Random Forests and Datashare at the CDL Code Jam

The California Digital Library hosted a code jam earlier this month at the Oakland City Conference center.

This gathering brought together librarians and developers from University of California campuses for a series of working meetings with an eye toward system-wide projects, especially involving data curation.

In the spirit of an informal, code-jam style meeting, I presented a bit on my recent experiments using machine learning to categorize data. As a starting example, I applied a random forest to suggest subject keywords for data sets uploaded to the recently launched DataShare website.

A researcher who uploads a data set to Datashare through the datashare-ingest app is prompted to provide keyword information along with other metadata describing the data set. One relatively common keyword is “middle-aged”.

http://datashare.ucsf.edu/xtf/search?f1-keyword=Middle%20Aged

For the code jam, I showed one possible way to use these existing records to train a random forest to determine whether a new dataset should be tagged with the keyword “middle-aged”.

For this particular exercise, there isn’t a great deal of data – we’d need a much larger data set to really apply this. However, the small dataset has some advantages for exploring the use of random forests, as it’s possible to visually inspect the input and gain a better understanding of how the random forest categorization is working.

“Middle aged” is a relatively good term to use for an experiment, as it isn’t as obvious (or as objective) more technical terminology. Highly technical keywords often show up repeatedly in the data set, and are often very predictable. More subjective and less technical keywords such as “middle aged” may apply to a wide range subjects, and they may show up in some records but not in others, as some researchers include the keyword and others don’t. Random Forest classification can be particularly useful in this case, as these subjective keywords are more likely to be applied when they don’t show up in the description, title, or technical methods for a particular dataset.

For example, here’s a record tagged with the keyword “middle aged”.

To get started, I approached this with a simple “bag of words” approach, with a small modification. Rather than taking a very large bag of words for all records (or a sample of all records) in Datashare, I limited the count to words that show up in records with the keyword “middle aged”.

data: 14
changes: 12
disease: 12
alzheimer: 11
diffusion: 10
dti: 9
subjects: 9
images: 9
slice: 9
brain: 8
ftd: 8
imaging: 8
axial: 7
white: 7
matter: 7
tracts: 7
reductions: 7
acquired: 7
gwi: 7
center: 7
diffusivity: 7
thickness: 6
type: 6
symptoms: 6

(for the full list, run the wordcount_summary.py script in the github repo).

I used this bag of words to populate a random forest. For each record, I created a vector indicating the word count for each of the most common words in each dataset, using the title, abstract, and methods as text fields.

Here’s the bag of words for a single record that includes the “middle-aged” subject tag:

ark+=b7272=q67p8w9z
data 1
changes 7
disease 0
alzheimer 0
diffusion 6
dti 5
subjects 2
images 4
slice 0
brain 2
ftd 0

And here’s the bag of words for a single record that doesn’t contain this keyword:

data 5
changes 6
disease 4
alzheimer 4
diffusion 0
dti 0
subjects 16
images 1
slice 0
brain 10
ftd 0

Each of these records is converted into a vector containing the word count along with information about whether it was tagged with the keyword “middle-aged” (for the python scikit-learn library, I expressed this as a binary 0 or 1).

A categorization problem like this, where a bag of words is used as the basis for determining whether a record belongs in a particular category, can be approached with a number of different supervised learning techniques. Logistic Regression and Decision Trees are common approaches, and Random Forest is a particularly accessible and often effective algorithm for many categorization problems. I used a Random Forest classifier here, though other algorithms might turn out to be more effective.

For this application, I converted each record into a vector representing the bag of words, with each position representing a common term, and each value representing the number of instances of that word. I ignore a number of stop words and common terms, but there’s a lot more that could be done here, especially around identifying common phrases rather than single words. Each vector, along with an indicator representing whether this record was tagged with the subject “middle aged”, is then used to train a random forest classifier.

The rforestDS-Middle-Age.py script contains python code using scikit-learn to train and evaluate a random forest classifier on the Datashare records for the “middle aged” keyword tag, using the strategy described above.

Note that if you run this multiple times, you’ll get slightly different output. This is probably amplified by the relatively small number of training samples.

The train data score indicates how well the random forest categorizes the records that were used to train it. Because of the small sample size and relatively specific vocabulary, the assessment is fairly high.

Train data score: 0.941176470588

While it is interesting to observe the Random Forest’s assessment of how it performs on its own training set, a common practice is to split the training data into a training set and testing set (often at a two-thirds training, one-third testing ratio). You can then use the testing data (which was not used to build or train the random forest) to evaluate the accuracy of the classifier. I didn’t in this particular case, as the training set was very small and this is a small experiment/exercise, but it would be an important step on a larger dataset where we plan to make real use of a classifier.

Random Forests can also estimate the relative importance of the classifiers (in this case, the importance of the word count for each of the most common words in determining whether a record has been tagged with the subject “middle aged”). Here’s the feature importance estimate for a run of the random forest. The numbers will change slightly each time the forest is run.

data: 0.115221184049
changes: 0.0497659524022
disease: 0.0414350226064
alzheimer: 0.0438029075727
diffusion: 0.0870807358224
dti: 0.0287297792369
subjects: 0.0548861615427
images: 0.0783953700445
slice: 0.401884434536
brain: 0.0378775508884
ftd: 0.0609209012994

Although such a small sample size isn’t ideal for building a useful classifier, it can be illuminating, as the data set is small enough to hint at why certain words are so important. “Slice”, for instance, probably wouldn’t be such a strong predictor for whether a record should be tagged as “middle aged” in a larger data set. This is almost certainly a quirk related to our small sample size.

The output for this script is stored in the “categories” folder in a file named “1-Middle-Aged.txt”.

ark+=b7272=q6154f00,””
ark+=b7272=q61z429d,””
ark+=b7272=q62z13fs,”Middle-Aged”
ark+=b7272=q65q4t1r,””
ark+=b7272=q66q1v54,”Middle-Aged”
ark+=b7272=q67p8w9z,”Middle-Aged”
ark+=b7272=q6bg2kwf,””
ark+=b7272=q6cc0xmh,”Middle-Aged”
ark+=b7272=q6h41pb7,””
ark+=b7272=q6kw5cxv,”Middle-Aged”
ark+=b7272=q6mw2f2n,””
ark+=b7272=q6pn93h6,””
ark+=b7272=q6qn64nk,”Middle-Aged”
ark+=b7272=q6rn35sz,””
ark+=b7272=q6td9v7j,””
ark+=b7272=q6x63jt1,””

To apply the random forest to all subject tags, run the rforestDS.py script. You should see similar output for each keyword tag, along the training data score and feature importance estimates. To assemble all the estimates into a single file, run the MergeFiles.py script.

GitHub repository for this exercise

Github repository for Dash (formerly Datashare)