New Archives Intern: Lauren Wolters

Lauren Wolters

Lauren Wolters is a rising junior undergraduate student at Skidmore College. She is double majoring in History and Psychology and is interested in learning the basics of archival theory and practice. Being a history major, Lauren is fascinated by old artifacts and is excited to have the unique opportunity to work with collections that are not always available to the public eye. Currently, she has been assisting by taking inventory of a collection of photographs and organizing a digital list of metadata. Eventually, she will be transitioning to aid on a project relating to the Langley Porter Psychiatric Institute Records. This project is perfectly tailored towards both of her interests as it combines her two majors.

Lauren was born and raised in San Francisco, CA. She plays volleyball at Skidmore College and enjoys photography as a hobby. Lauren is enjoying working in the library with the archivists and looks forward to learning even more about the archives.

Experiments with Digital Tools in the Archives — OCR

Working on digital “stuff” in the archives is always fascinating, because it blurs the borders between digital and physical. Most of the work the takes up my time is at these borders. Physical stuff requires lots of human touches to transition to “digital,” and digital stuff similarly requires lots tending by humans to ensure that it is preserved physically. After all, the 1s and 0s are stored physically somewhere, even if on the cloud or in DNA.

We’re currently working on several projects to convert physical materials to digital text. The huge quantities of rich and complicated textual material in archival collections is full of potential for use as data in both computational health research and also digital medical humanities work, but to be usable for these kinds of projects it needs to be converted to digital text or data, so that it can be interpreted by computers. To get to this point the documents must be scanned, and the scanned documents must either be transcribed, which can be immensely labor intensive, or converted directly by computers using a software that can perform Optical Character Recognition, or OCR. One of our projects using OCR to extract text from a document provides a fascinating look into the world of computer vision.

A pen and ink illustration of the lungs and a lymph gland from the Ralph Sweet Collection of Medical Illustrations

An example of the illustrations in the Ralph Sweet Collection

The Ralph Sweet Collection of Medical Illustration contains extraordinary examples of the work of one of the most renowned medical illustrators in the United States, so we’re working on digitizing the collection and putting it online. To do this we need to have detailed metadata — the kind of information you might expect to find in a catalog record, title, date, author — about each illustration. Currently this metadata for the Sweet Collection exists only in the form of printed index that was written on a typewriter. We can scan the index, but we do not have the labor to transcribe each of the 2500 or so entries. This is a job for OCR.

The image below shows what a page of the Ralph Sweet index looks like. This is the metadata that we want to be able to extract and turn into digital text so that it can be understood by a computer and used as data.

A page of an type-written index of the Ralph Sweet Collection, showing metadata about each illustration in the colleciton.

A page of the index for the Ralph Sweet Collection.

One of the first problems we encountered in attempting to extract text from this document is a classic difficulty of computer vision. As English-speaking humans, we know by looking at this document that it contains three columns, and that the order in which to read the page is top to bottom by column, starting on the left and moving right. To a computer however, it is simply a page full of text, and there is no way to know whether or not the text is broken into columns or whether each line should be read all the way across the page. This simple task presented a difficulty for most of the software that we tested, but we found one software which could identify these columns easily. The software is called Tesseract, and it was actually developed in the 1980’s but continues to be a good open-source tool to perform OCR.

If we plug the above page into Tesseract, we get mostly recognizable text, which in itself is pretty miraculous when you think about it. Looking at the text though, it quickly becomes clear that it is not an exact transcription of what’s on the page. There are misspellings (“Iivev”), and some chunks of text have been separated from the entry in which they should appear (“horizontal”).

An image of the text-output of the software tesseract showing errors in transciption.

An example of the text extracted from the Ralph Sweet Collection Index by Tesseract.

Digging into the way that Tesseract (and OCR software more generally) works can help us begin to understand why these errors are cropping up. Plus, it looks really cool.

OCR programs have to go through a set of image manipulation processes to help them decide which marks on the page are text — and hence should be interpreted — and which are other marks that can be ignored. This all happen behind the scenes, and usually this involves deciding what the background parts of the image are and blurring them out, increasing the image contrast, and making the image bi-tonal so that everything on the page is only black or white. Then, the computer can trace the black pixels on the image and get a series of shapes which it can use to begin attempting to interpret as text. The image below shows the shapes that Tesseract has identified as letters and traced out for interpretation. Each new color indicates that the computer believes it has moved on to a new letter.

A page of colorful text on a black background illustrating the text that has been automatically traced from the Ralph Sweet Index by the computer program Tessearact.

The result of Tesseract tracing the letters it has interpreted. Each new color is something that’s been identified as a new letter.

Interestingly, when comparing the computer tracing of the letters to the original image you can see that Tesseract has already made the assumption that the black spaces from the three-hole punch in the top of the page are not letters, and thus it should not bother to trace them. Lucky for us, that’s correct.

Next the computer has to take all these letters and turn them into words. In actual practice it’s not quite this simple, but basically the computer iterates on each letter identification that it believes it has made by testing whether or not that word is in its dictionary, and thus whether or not it is likely to be a word. If the combination of letters that the computer thinks it sees are not a word, then it will go back and make a new guess about the letters and test whether or not that’s a word, and so on. Part of this whole process is to chunk the letters into words using their actual spacing on the page. Below you can see an image of how Tesseract has begun to identify words using the spaces between lines and letters.

A view of a page of the Ralph Sweet Index showing each word as a blue rectangle encompassing the space taken up by that block of text against a black background -- the "word" output of the OCR program Tesseract.

The “words” that the OCR software has identified on the page. Each blue rectangle represents a space that Tesseract has marked as containing a word.

In addition to checking the word against the dictionary though, most OCR programs also use the context of the surrounding words to attempt to make a better guess about the text. In most cases this is helpful — if the computer has read a sentence that says “the plant requires wader” it should be a relatively easy task to decide that the last word is actually “water.” In our case though, this approach breaks down. The text we want the computer to extract in this case is not sentences, but rather (meta)data. The meaning of the language has little influence on how each individual word should be read. One of the next steps for us will be trying to figure out how to better instruct Tesseract about the importance of context in making word-identification decisions (i.e., that it’s not important).

Finally, as the OCR software interprets the text it also identifies blocks of words that it believes should be grouped together, like a paragraph. Below you can see the results of this process with Tesseract.

A view of the different elements of tesseract's text identification showing letters traced in primary colors and contained in yellow bounding boxes, words set against blue rectangles outlining the space they encompass, and blocks of text outlined in larger bounding boxes and numbered -- all of this set against a black background.

This view shows all of the elements of Tesseract’s word identification combined. Text has been traced in color, separate letters are contained in bounding boxes, words are contained in blue rectangles, and blocks are contained in larger bounding boxes and are numbered (though the numbers are a bit difficult to see).

A line has been drawn around each block of text, and it has been given a number indicating the order in which the computer is reading it. Here we can see the source of one of the biggest problems of the OCR-generated text from earlier. Tesseract is in-accurately excluding a lot of words from their proper blocks. In the above photo, the word “Pen” is a good example. It is a part of block 20, but it has been interpreted by the computer as it’s own block — block 21 — and has been set aside to appear in the text file after block 20. Attempting to solve this problem will be our next hurdle, and hopefully we can catch you up after we are successful.

Using OCR to extract text from digital images can be a frustrating endeavor if accuracy is a necessity, but it is also a fascinating illustration of the way computers see and make decisions. Anytime we ask computers to perform tasks that interface with humans, we will always be grappling with similar issues.

Base Hospital No. 30, One Hundred Years Later – Part Two: France

This is a guest post by Aaron J. Jackson, PhD student, UCSF Department of Anthropology, History and Social Medicine. 

One hundred years ago, the men and women of U.S. Army Base Hospital No. 30—the University of California School of Medicine Unit—arrived in France to support the American war effort after more than a year of preparation in the United States. They had already faced many challenges by the time they first set foot in Europe, including navigating the Army bureaucracy, going through extensive military and medical training, traveling from San Francisco to New York, and treating thousands of soldiers who had developed acute infections as a result of the massive mobilization efforts taking place in 1917 and 1918. They crossed the Atlantic in late April and arrived in France in May, expecting to occupy a prepared site, where they could set to the important work of caring for America’s wounded soldiers as the American Expeditionary Forces moved into the Western Front and helped blunt the German Spring Offensive of 1918. But they would still have to overcome significant obstacles before that work could begin. In this entry—the second part of four planned posts—I will cover the experience of Base Hospital No. 30 as they landed in France and made preparations to support the wounded between May and June, 1918. These stories are derived from primary source materials on Base Hospital No. 30 kept at the UCSF Archives & Special Collections, and it is with great appreciation to the archivists there that I am able to write about the experiences of the men and women of the University of California School of Medicine and their experience in the Great War. If you have not done so yet, please read Part One: Organization, Mobilization, and Travel here.

The U.S.S. Leviathan arrived in Brest, France on May 2, 1918. The port city at that time was a bustle of activity as the Americans established supply depots and warehouses and scrambled to offload the massive influx of war materiel and men arriving from across the Atlantic. Base Hospital No. 30 was but one of hundreds of American units transitioning into France at the time. Until that point, the unit managed to keep track of the $100,000 worth of Red Cross supplies and equipment it had drawn from the quartermaster and medical supply depot back in San Francisco. But with everything operating at a frantic pace in Brest, they found it impossible to ensure that these supplies remained with the unit. They received orders to board a train for Royat a mere two days after arriving in Brest, and while they were assured that their supplies would catch up, Lieutenant Colonel Eugene S. Kilgore later recalled that the unit was “dismayed at the apparent rough handling of [their] cargo in shipment, and were not surprised that much of it failed to reach us in Royat.”

Figure 6 – Royat Advertisements ca. 1900-1910.

“Royat les Bains is a small town, situated in the very heart of France, in the Auvergne Mountains,” begins the U.S. Army Hospitalization Report prepared by acquisition officers who scouted the location and rented the buildings that Base Hospital No. 30 was to occupy. Royat was (and remains) a spa town that advertised its natural hot springs and a history dating back to the Roman occupation of Gaul—the Romans constructed baths that utilized the hot springs, making Royat’s tourist heritage a truly ancient affair. Unfortunately, for an American hospital unit interested in operating a modern medical institution, Royat’s ancient roots left much to be desired, despite its charms.

In their hospitalization report, the acquisition officers noted that the town was “clean, quiet and healthful” with plenty of fresh air and sunshine—an atmosphere that attracted a clientele “of a very high class, comprising, as it does for the most part, the wealthy and nobility.” Due to this, the acquisition officers noted that the rents were quite high in Royat, but the environment seemed appropriate to them for a hospital due to the town’s reputation as a health resort and the advertised healing properties of its thermal springs, which “are taken for gout, rheumatism, gravel, kidney and bladder trouble, and… anemia, blood trouble, diabetes and dyspepsia.” With this in mind, the acquisition officers rented eleven buildings—eight hotels, one villa, one casino, and one garage—for the purposes of establishing a base hospital in the town. They left detailed instructions for the officers of Base Hospital No. 30 regarding the costs of tram fares to the nearby city of Clermont, the costs of maintaining and operating telephone service in the rented buildings, and how to go about securing sewage and garbage disposal. They were even so helpful as to provide the locations of local laundries and markets and to coordinate with local restaurants and cafes to ensure that price lists were printed in English as well as French “to prevent the unfair exploitation of foreigners.” However, they advised the officers of Base Hospital No. 30 that, as elsewhere in France, manpower for labor was in drastically short supply as almost all of it was involved in the war effort. They warned that the enlisted men of the hospital unit would likely be tasked with “street cleaning and watering and removal of rubbage and waste,” for which the local municipality would be grateful.

In short, while the acquisition officers obviously considered Royat to be an ideal location for a hospital due to its atmosphere, access to clean water, location relative to a railway, and the availability of seemingly suitable buildings—assuming, of course, that one believes a hospital and a resort hotel are sufficiently alike as to seem suitable—they noted that Base Hospital No. 30 would have its work cut out for it. The acquisition officers had done what they could and moved on. The rest would be up to the medical personnel, some of whom arrived in Royat on the morning of May 7, 1918, and immediately set about the task of transforming the sleepy spa village into a modern medical facility.

The nurses of Base Hospital No. 30 were diverted to Vichy, France for a short stay while the men went ahead to prepare the hospital site. In Vichy, the nurses attached to Base Hospital No. 1, which had organized out of Bellevue Hospital in New York City. While there, the nurses tended to a number of personnel who had contracted measles, which head nurse Arabella Lombard described as “a childhood pleasure evidently foregone in younger days.”

In Royat, Base Hospital No. 30’s officers immediately noted that the requisition team had secured the most undesirable hotels in the town and that the task before them to prepare the site to receive patients was indeed Herculean. The kitchens, primarily located in hotel basements, were particularly bad. Lt. Col. Kilgore later recalled in The Record that the kitchen in the hotel Continental was a veritable dungeon, but it had to be used because the adjoining mess hall was the only place large enough to install the main patients’ kitchen. Unfortunately, concerns about the kitchens were only the beginning.

The electrical supply, generated by hydraulic powerplants and initially reported to be ample, was found to be woefully insufficient in the dry summer months, leaving the hospital without electricity three days out of every seven and without power to run the new X-ray and laboratory incubator equipment. The hospital personnel installed multiple gas lines only to find that France’s coal scarcity reduced the gas pressure to such a point that the lines were practically useless. Worse, the water supply to the hotels, intended for a few dozen tourists, was inadequate for the needs of a several-hundred bed hospital. After installing a make-shift shower and bath system, hospital personnel discovered that it could not be operated without completely depriving at least one of the hotels of water entirely, and it was necessary to carry water in buckets up several flights of stairs in order to operate toilets and wash dishes. In spite of these complications, hospital personnel were able to establish well-prepared surgical clinics and patient wards, but the continuous complications they encountered certainly made the work more difficult than they anticipated.

The many complications were frustrating, “but worse than all these together was the hopeless inadequacy of the drainage system,” reported Lt. Col. Kilgore. Only two of the buildings had direct access to sewer lines with the rest relying on antiquated cesspool systems that were, again, intended to service the needs of at most a few dozen tourists. Kilgore noted that, “even in our condition of what we felt to be disgraceful water economy, it was evident that the numbers we put in the buildings and the use they made of water was greatly in excess of that contemplated by those who have used the buildings heretofore as summer hotels and boarding places. And very shortly after our hospital became open to patients, our cesspools began to overflow.” This problem was compounded on multiple levels. First, the only way to deal with an overflowing cesspool was to have it pumped by the Societe d’Assainissement of Clermont, whose horse-drawn steam pump and tank wagons—dubbed “honey wagons” by the Americans—had to serve the entirety of Clermont and Royat and so required three weeks’ advance notice for services. This was an impossible situation for Base Hospital No. 30 as the cesspools would overflow again as soon as seven days after they had been emptied, and even when the honey wagons could be secured, they were insufficient to empty all the hospital’s cesspools at one time, often leaving the work half done, at best. And if the misery of overflowing cesspools alone was not enough, the pools were often located directly under the hotel basements, where the kitchens were often located, including the main patients’ kitchen in the basement of the Continental. Thus, when the Continental’s cesspool overflowed, it did so directly into the newly refurbished main patients’ kitchen and dining hall. The officers and enlisted men attempted to deal with these issues as they were able, but they often lacked the tools and experience necessary to properly tackle the various tasks. The problems grew so great that the Army was forced to divert an engineer detachment from the front lines in August to install better drainage, additional cesspools, showers, wash troughs and heating stoves.

The nurses arrived from Vichy on May 23 and found that the hospital was far from the promised state of preparedness required. They immediately set to work sanitizing the hotels to “get them ready for the boys from the front,” as Arabella Lombard put it. The nurses scrubbed the rooms so thoroughly that one of the officers remarked that “three coats of paint were scrubbed off before we considered the buildings ready for occupancy.” It was hard work that left the nurses weary and sore, but there were breaks. On May 28, the nurses were invited to celebrate Decoration Day—the precursor to Memorial Day, celebrated in honor of the Civil War dead through the decoration of grave markers—with an aviation unit stationed in nearby Clermont. They observed an afternoon of sports, enjoyed a buffet dinner outside the Red Cross Headquarters overlooking the hills and the setting sun, and were entertained by a band concert, speeches, and dancing after dark.

By June, the men and women of Base Hospital No. 30 were still dealing with the cacophony of unforeseen difficulties associated with occupying Royat. The cesspools were still overflowing from time to time, the water and electrical supplies were still inefficient, the enlisted men were still helping clean the streets of Royat and installing new kitchens, and some of the old hotel rooms remained to be cleaned and refitted for the purpose of housing patients, but the hospital was operational enough to receive its first trainload of patients on June 12, 1918. Thankfully, the 360 patients aboard the train were convalescent for the most part, but even so these men represented a significant difficulty for hospital staff as the kitchen installations were not yet complete.

After more than a year of preparation for the deployment to France, the men and women of Base Hospital No. 30 found themselves scrambling to prepare their ad hoc hospital to receive patients, and time had run out. Patient trains were arriving, and the hospital was about to get very busy indeed. Approximately five-hundred kilometers north of Royat, near Château-Thierry, the German Spring Offensive of 1918 was grinding forward as the Germans attempted to cross the Marne River. Standing in their way were the men of the 5th and 6th Marines and the 9th and 23rd Infantry. The resulting fighting was among the most intense experienced by any Americans in the war, and as Base Hospital No. 30 attempted to figure out how they were going to feed their first trainload of convalescent patients, several more hospital trains carrying troops fresh from the front lines and in need of surgery were making their way to Royat.

In Part Three of this four-part blog installment, we will explore what Lt. Col. Kilgore characterized as the “Work of the Hospital” in treating battlefield casualties and how they dealt with the Influenza Pandemic of 1918. We will also discuss the experiences of the forward-deployed surgical teams led by Lt. Col. Alanson Weeks (Surgical Team No. 50) and Maj. Herbet S. Thomson (Surgical Team No. 51), who operated under extreme conditions between June 1918 and the Armistice in November.

Figures:

5 – “Loading at Brest for a Long Journey,” circa 1918, Woolsey (John Homer) Papers, UC San Francisco, Library, Special Collections, Calisphere, https://calisphere.org/item/8caee2bc-704e-4647-9218-d32ed2a4d9c8/, accessed May 21, 2018.

6 – “Royat Advertisements,” author’s compilation from “Affice Chemin de Fer D’Orleans Auvergne Geo Dorival,” circa 1910 (left) and “Royat Vintage Poster” by Gustave Fraipont, c. 1900 (right).

7 – “Base Hospital #30 at Royat, France,” ca. 1917-1919, Woolsey (John Homer) Papers, UC San Francisco, Library, Special Collections, Calisphere, https://calisphere.org/item/2cb5cbf5-d0c0-412a-9e15-a161a291d1e2/, accessed May 21, 2018.

8 – “Surgical Clinic, Metropole Hotel,” 1918, Base Hospital #30 Collection, UC San Francisco, Library, Special Collections, Calisphere, https://calisphere.org/item/b7ca9276-989e-468f-adb2-ece162e4ad01/, accessed May 21, 2018.

9 – “Base Hospital #30 Nurses,” 1918, Woolsey (John Homer) Papers, UC San Francisco, Library, Special Collections, Calisphere, https://calisphere.org/item/500d7be8-79c5-430e-86f6-f3a15d3a0d87/, accessed May 21, 2018.

10 – “Hospital Train with Hiram Miller and ‘Rug’ Ruggles,” 1918, Woolsey (John Homer) Papers, UC San Francisco, Library, Special Collections, Calisphere, https://calisphere.org/item/f4ec0c1f-b30c-48b4-9746-2d34420fcc4d/, accessed May 21, 2018.