New Archives Intern: Elizabeth Popiel

Today’s post is an introduction from Elizabeth Popiel, our newest intern here in the Archives who will be working on piloting and testing some of the key pieces of our digital forensics lab and workstations.

Portrait of Elizabeth Popiel.

Elizabeth Popiel

Hello out there readers! My name is Elizabeth Popiel and I’ll be interning at the UCSF Archives & Special Collections working with some of the early born-digital collections here in the Library this summer. I’m a second year graduate student in the School of Information at the University of Michigan in Ann Arbor, with a concentration in Digital Curation, Archives and Human Computer Interaction.  I’ve always the loved exploration and discovery part of any research project and I hope to do a little of that here this summer as well.

I’m enjoying being back in the Bay Area before heading back to the Midwest for my last year of school. I love road tripping along the coast and seeing everything out there from the Redwoods to the Historic Forts, museums and interesting locations. I was born in Canada and have traveled extensively from places such as Bern to Tasmania, Singapore to Beijing and back again. It’s great to get to see and learn perspectives that differ from your own and to learn to appreciate them when you approach your work, especially when trying to figure out a puzzle or sort through a collection

In my past I taught English overseas, worked in broadcasting, and I have experience working in both hardware and software in Silicon Valley. I’m an old-school gamer and I still love text-adventures, joystick-based and SCUMM Engine games. Figuring out how to make them work on newer machines is always a challenge!

I like the challenge of working in research and preservation for born-digital archival collections, and at UCSF I’m hoping to be able to gain practical experience in this area. I’ll assist in getting their Digital Forensics lab up and running for collections capture, processing, and use as well as test processing some of the collections. It’s my hope that I can better understand how to work with active collections and how digital archival models can be adapted to different and unique libraries and archives such as UCSF.

In Archives, my passion in work and learning lies in the archival challenges that lay ahead in digital curation, forensic work, and audiovisual materials. One of the reasons working with UCSF Special Collections interests me is because there are so many collection pieces that need attention in order for them to remain usable for future generations. Everything from floppy disks with key scientific notes, to spreadsheets containing experiment setup in ontological medicine, or information or email communications that represent negotiations and crucial strategies during the height of the San Francisco AIDS epidemic – these all represent important parts of the history of UCSF and its legacy and I’m excited to contribute to preserving that legacy.

Experiments with Digital Tools in the Archives — OCR

Working on digital “stuff” in the archives is always fascinating, because it blurs the borders between digital and physical. Most of the work the takes up my time is at these borders. Physical stuff requires lots of human touches to transition to “digital,” and digital stuff similarly requires lots tending by humans to ensure that it is preserved physically. After all, the 1s and 0s are stored physically somewhere, even if on the cloud or in DNA.

We’re currently working on several projects to convert physical materials to digital text. The huge quantities of rich and complicated textual material in archival collections is full of potential for use as data in both computational health research and also digital medical humanities work, but to be usable for these kinds of projects it needs to be converted to digital text or data, so that it can be interpreted by computers. To get to this point the documents must be scanned, and the scanned documents must either be transcribed, which can be immensely labor intensive, or converted directly by computers using a software that can perform Optical Character Recognition, or OCR. One of our projects using OCR to extract text from a document provides a fascinating look into the world of computer vision.

A pen and ink illustration of the lungs and a lymph gland from the Ralph Sweet Collection of Medical Illustrations

An example of the illustrations in the Ralph Sweet Collection

The Ralph Sweet Collection of Medical Illustration contains extraordinary examples of the work of one of the most renowned medical illustrators in the United States, so we’re working on digitizing the collection and putting it online. To do this we need to have detailed metadata — the kind of information you might expect to find in a catalog record, title, date, author — about each illustration. Currently this metadata for the Sweet Collection exists only in the form of printed index that was written on a typewriter. We can scan the index, but we do not have the labor to transcribe each of the 2500 or so entries. This is a job for OCR.

The image below shows what a page of the Ralph Sweet index looks like. This is the metadata that we want to be able to extract and turn into digital text so that it can be understood by a computer and used as data.

A page of an type-written index of the Ralph Sweet Collection, showing metadata about each illustration in the colleciton.

A page of the index for the Ralph Sweet Collection.

One of the first problems we encountered in attempting to extract text from this document is a classic difficulty of computer vision. As English-speaking humans, we know by looking at this document that it contains three columns, and that the order in which to read the page is top to bottom by column, starting on the left and moving right. To a computer however, it is simply a page full of text, and there is no way to know whether or not the text is broken into columns or whether each line should be read all the way across the page. This simple task presented a difficulty for most of the software that we tested, but we found one software which could identify these columns easily. The software is called Tesseract, and it was actually developed in the 1980’s but continues to be a good open-source tool to perform OCR.

If we plug the above page into Tesseract, we get mostly recognizable text, which in itself is pretty miraculous when you think about it. Looking at the text though, it quickly becomes clear that it is not an exact transcription of what’s on the page. There are misspellings (“Iivev”), and some chunks of text have been separated from the entry in which they should appear (“horizontal”).

An image of the text-output of the software tesseract showing errors in transciption.

An example of the text extracted from the Ralph Sweet Collection Index by Tesseract.

Digging into the way that Tesseract (and OCR software more generally) works can help us begin to understand why these errors are cropping up. Plus, it looks really cool.

OCR programs have to go through a set of image manipulation processes to help them decide which marks on the page are text — and hence should be interpreted — and which are other marks that can be ignored. This all happen behind the scenes, and usually this involves deciding what the background parts of the image are and blurring them out, increasing the image contrast, and making the image bi-tonal so that everything on the page is only black or white. Then, the computer can trace the black pixels on the image and get a series of shapes which it can use to begin attempting to interpret as text. The image below shows the shapes that Tesseract has identified as letters and traced out for interpretation. Each new color indicates that the computer believes it has moved on to a new letter.

A page of colorful text on a black background illustrating the text that has been automatically traced from the Ralph Sweet Index by the computer program Tessearact.

The result of Tesseract tracing the letters it has interpreted. Each new color is something that’s been identified as a new letter.

Interestingly, when comparing the computer tracing of the letters to the original image you can see that Tesseract has already made the assumption that the black spaces from the three-hole punch in the top of the page are not letters, and thus it should not bother to trace them. Lucky for us, that’s correct.

Next the computer has to take all these letters and turn them into words. In actual practice it’s not quite this simple, but basically the computer iterates on each letter identification that it believes it has made by testing whether or not that word is in its dictionary, and thus whether or not it is likely to be a word. If the combination of letters that the computer thinks it sees are not a word, then it will go back and make a new guess about the letters and test whether or not that’s a word, and so on. Part of this whole process is to chunk the letters into words using their actual spacing on the page. Below you can see an image of how Tesseract has begun to identify words using the spaces between lines and letters.

A view of a page of the Ralph Sweet Index showing each word as a blue rectangle encompassing the space taken up by that block of text against a black background -- the "word" output of the OCR program Tesseract.

The “words” that the OCR software has identified on the page. Each blue rectangle represents a space that Tesseract has marked as containing a word.

In addition to checking the word against the dictionary though, most OCR programs also use the context of the surrounding words to attempt to make a better guess about the text. In most cases this is helpful — if the computer has read a sentence that says “the plant requires wader” it should be a relatively easy task to decide that the last word is actually “water.” In our case though, this approach breaks down. The text we want the computer to extract in this case is not sentences, but rather (meta)data. The meaning of the language has little influence on how each individual word should be read. One of the next steps for us will be trying to figure out how to better instruct Tesseract about the importance of context in making word-identification decisions (i.e., that it’s not important).

Finally, as the OCR software interprets the text it also identifies blocks of words that it believes should be grouped together, like a paragraph. Below you can see the results of this process with Tesseract.

A view of the different elements of tesseract's text identification showing letters traced in primary colors and contained in yellow bounding boxes, words set against blue rectangles outlining the space they encompass, and blocks of text outlined in larger bounding boxes and numbered -- all of this set against a black background.

This view shows all of the elements of Tesseract’s word identification combined. Text has been traced in color, separate letters are contained in bounding boxes, words are contained in blue rectangles, and blocks are contained in larger bounding boxes and are numbered (though the numbers are a bit difficult to see).

A line has been drawn around each block of text, and it has been given a number indicating the order in which the computer is reading it. Here we can see the source of one of the biggest problems of the OCR-generated text from earlier. Tesseract is in-accurately excluding a lot of words from their proper blocks. In the above photo, the word “Pen” is a good example. It is a part of block 20, but it has been interpreted by the computer as it’s own block — block 21 — and has been set aside to appear in the text file after block 20. Attempting to solve this problem will be our next hurdle, and hopefully we can catch you up after we are successful.

Using OCR to extract text from digital images can be a frustrating endeavor if accuracy is a necessity, but it is also a fascinating illustration of the way computers see and make decisions. Anytime we ask computers to perform tasks that interface with humans, we will always be grappling with similar issues.

New Sites in the UCSF Web-Archive

As discussed previously here, we’ve been working on expanding our web-archives presence in all areas across campus, and one of the developments we’re most excited about is getting the web-archiving process formalized in centralized UCSF workflow for upgrading websites or retiring abandoned ones. Now that we have been successful in establishing this program, archiving a site is an official part of the website roll over or retirement process, which means we have a much better finger on the pulse of the UCSF web presence.

And as this process ramps up, we’ve been adding all sorts of fascinating UCSF websites to our collections, so we wanted to highlight a few recent acquisitions.

First is a complete copy of the website for the W. M. Keck Center for Noncoding RNAs, which is scheduled to be rolled over to a new platform soon. The Keck Center explores the 98.6% of the human genome which is “non-coding,” or which is not the part of the genome directly containing the code to create proteins. Since, in their words, most genetic research focuses on the protein-encoding genes — those genes whose purpose is clear — the area of non-coding RNA can be thought of as “genetic dark matter.” Even though the purpose of this generic material is not clear, it still influences human health, and it is the mission of the Keck Center to figure out how.

screen shot of the homepage of the W.M. Keck Center for Non-Coding RNAs

Homepage of the archived version of the W.M. Keck Center website.

The lab uses mice in their process, modifying mouse stem cells and using mouse genes to examine the function of these non-coding RNAs. And conveniently, their lab website contains all the raw genetic data, as well as the experiment plans, images, and other associated data for these experiments. We’re excited about this capture because we were able to collect all this data at once and provide a snapshot of the lab’s work — complete with all the associated research materials. This is a huge help in tackling the problem of historic preservation of contemporary scientific work, and it even begins to address the very present problems of reproducibility in data-intensive and computing-intensive scientific research.

Additionally, another web-site which we have recently captured illustrates the value of curating a selection of the UCSF institutional ecosystem all together. This is the site of the UCSF Institutional Animal Care and Use Committee. Say, for example, that in examining the archived site of the Keck center, you also wondered what the legal treatment protocols and procedures were at the time for scientific research involving animal subjects, and whether or not the Keck center was following those protocols. With a little clicking around on the Wayback Machine you would be able to quickly answer that question, and would have a clear picture of where the Keck center’s research fit into the larger legal and ethical questions on campus and in the scientific community about proper treatment of and care for the animals used in research.

screen shot of the homepage of the UCSF Institutional Animal Care and Use Committee

Homepage of the archived version of the UCSF Institutional Animal Care and Use Committee.

We look forward to continuing to build and enrich our web-archive collections, and remember that if you have a suggestion you can always request that we begin capturing your UCSF site!