Category Archives: data and analytics

Beyond Basic SQL

Thanks to everyone who attended the Beyond Basic SQL class at Mission Bay on April 25th!

This class was inspired by the questions and problems I’ve encountered while trying to work with various datasets at the Programming and Pizza sessions.  SQL (the language) and sqlite (the database) can be excellent tools for data analysis, but getting the data into a form where you can use it can be a real challenge.

In the first part of this class, we reviewed the basics of querying a single database table through Selection, Filtering, and Aggregation.  We used the “gapminder” dataset from Software Carpentry to  write queries and do analysis on worldwide per capita gdp, life expectancy, and population by country and continent.

The second part of this class focused on dealing with data that isn’t in a convenient, normalized format.  To do this, we analyzed the kind of spreadsheet researchers often download from the internet.  As a simplified example, we looked at a sample enrollment spreadsheet with information about students, courses, and reading lists.  We reviewed the first two normal forms, discussed how these forms can reduce ambiguity and redundancy, and transformed the non-normalized spreadsheet into a database in second normal form.

It was a lot of material to cover in two hours, thanks for sticking with it.  Here are the promised slides for everyone who wanted to try it out.

If you’re interested in trying this out on a real world example, take a look at the blog post below on data munging and excel.  The examples in that post  (you’ll need to use Python as well as SQL) will present strategies for getting an unstructured spreadsheet into a form where you can use sql, pandas, data frames, and other tools to analyze a dataset.

Class files are available on the ucsf-ckm dsi GitHub repo here.



Data Munging Addendum: The Long Way To Handle Comma Delimited Lists

In an earlier post, we discussed the issue of comma delimited lists within an excel spreadsheet. This way of one-to-many relationships in data can make it more difficult to build look up tables, run queries, and do other types of analysis. Although there are some concise coding approaches, both in SQL and pandas, sometimes you just want to [give up on trying to be clever and] reconstruct your data frame line by line. Here’s a quick overview on how to do this.

You can follow along, cut and paste into your own notebook, or view/checkout the code from github.

Let’s go ahead and build a pandas dataframe with comma delimited information in some of the cells.

import pandas as pd
import numpy as np

First, we’ll create three lists

ar1 = [1,2,3,4,5]
ar2 = [2,4,6,8,10]
ar3 = ['one','two,three,four','three,four,five','four,five','five']

Next, add the lists as columns to a dataframe

df = pd.DataFrame({'A' : ar1, 'B' : ar2, 'C' : ar3})


And, of course, if you query it through pandasql, you’ll get a single row with a comma delimited list

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM df WHERE A = 3")


Like last time, we want to have each value for C on a separate row, with corresponding values for A and B.

Here’s how to do this the computationally long and expensive way (which is, in fact, sometimes the way you have to do things – sometimes because performance doesn’t matter and you’re tired of trying to be clever, sometimes because the logic is so intricate that you have to knit it all together line by line anyway).

We’ll create three new arrays (again, to hold the column values).

a0 = []
a1 = []
a2 = []

Next, we’ll loop through each row of our existing dataframe. We’ll split on the the third row (we start counting from 0 in a list, so that will be at index 2). Splitting on the comma will create a new array with three new strings. We’ll add a new value to each column for each string (word) in that row.

for index, row in df.iterrows():
for s in row[2].split(','):

Now, let’s create a new data frame out of the three columns we just created

ndf = pd.DataFrame({'A' : a0, 'B' : a1, 'C' : a2})

Take a look



We can now query both dataframes using pandasql

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Querying our original dataframe, you’ll see everything on one comma delimited line

pysqldf("SELECT * FROM df WHERE A = 3")


Quering our new dataframe, you’ll set a separate line for each entry

pysqldf("SELECT * FROM ndf WHERE A = 3")


It’s the long way to do this, but sometimes the long way can save you a lot of time.

Data Munging with Python, SQL, and Excel

So much of data science is, in fact, data munging and wrangling, moving and transforming it from one format to another. Data, when you’re fortunate enough to know where to find it, is almost never in the nicely organized format you need for your analysis. This is one of the reasons python is so popular among data scientists – it is a good language and environment for collecting, formatting, parsing, combining and splitting data from different sources.

Data frequently arrives in comma delimited files or excel spreadsheets. You can do certainly some analysis with spreadsheet operations, but odds are you’ll eventually want to load it into a data frame and use python (or R) for more meaningful analysis. This post is a write up of a few tips I learned from my recent efforts to wrangle some genomic data available on the web as Excel downloads. These spreadsheets presented a few common issues that arise when dealing with non-normalized data sets in single table format.

The exact nature of the data isn’t really the focus here, but for some context: researchers at UCSF often need information about a gene, variants of that gene, and the effect this Gene has on responsiveness of different tumors or conditions to treatment. There are a number of different places to find this data, a number of different datasets, and (of course) varied ways to search, filter, or query those datasets. Searching and finding can be a long, error prone, irritating, manual process.

We’ll use Python, Pandas, and Jupyter Notebook to try to get a better handle on some of this data. I’m assuming you already know how to open a 1) Jupyter Notebook and issue basic Python commands, and 2) Use pandasql to run SQL queries against a dataframe. (If you don’t, consider signing up for a Software Carpentry class or attend one of our Python/R workshops at Mission Bay).

Otherwise, you can follow the jupyter and python installation documents (I used anaconda for both).

A full jupyter notebook for the code in this post is available on the ucsf-ckm github repository.

Create a DataFrame from an Excel Spreadsheet

We’ll use a spreadsheet from The Precision Medicine Knowledge Base. To follow along, click on the “Download All Interpretations (Excel)” link.

Before analyzing, let’s load the excel spreadsheet into a pandas DataFrame. Open up a jupyter notebook (or just the python interactive shell)  to start.

First off, remember to include the pandas module…

import pandas as pd

Pandas has an easy method to load a spreadsheet (I’m assuming the spreadsheet is in your working directory, otherwise you’ll need to edit the path)

ipm = pd.read_excel("IPM_Knowledgebase_Interpretations_Complete_20160913-2207.xlsx")

Now take at your newly populated dataframe


You’ll see that your spreadsheet headers and data have become the column names and rows of a pandas dataframe. Let’s try using pandasql to run a query on it. First, load the module and globals (more information on this)

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

And try running a query. Let’s get the variants that match a particular Gene.

pysqldf("SELECT Gene, [Tumor Type(s)], [Variant(s)] FROM ipm WHERE Gene = 'PIK3CA'")

Note: you must use the brackets around Tumor Type(s) so the white space and parentheses around (s) won’t be interpreted as SQL.

Even without any additional data munging, you’re in a pretty good spot for analyzing your data. You have it in a dataframe, where you can run queries, python expressions, and pandas operations on it. However, there are a few issues, common to spreadsheets, that may make this data harder to work with and analyse.

Relational Databases frequently have “one to many” relationships. In this case, a Gene has a one to many relationship with Tumor Types, Variants, and Citations. Each Gene has an effect on multiple Tumor Types, each Gene can have multiple Variants, and the effect a Gene and Variant has on a Tumor Type can have multiple Citations.

This spreadsheet stores the data for one to many relationships in two different ways. For Genes to Tumor Types and Variants, the spreadsheet provides a comma delimited list in a single cell. For Citations, the spreadsheet adds tacks on a varying number of columns to the right side of the spreadsheet. Although this does provide the data, it can make the data harder to work with.

The next two sections will review techniques for converting comma delimited lists and multiple columns into a one-to-many lookup table.

Common Issue #1 – Comma Delimited Lists

You may notice that although the Genes are provided as single (atomic) values, other fields, such as Variants or Tumor Types are provided as a comma delimited list. This won’t be much of a problem if you want to find all Tumor Types associated with a Gene – the query is straightforward:

pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE Gene = 'CSF3R'")

You’ll get back a single row with a comma delimited list of Tumor Types, rather than a separate row for each Tumor Type, but you can parse that relatively easily.

Now, suppose you wanted to reverse this query, to find all genes that that match a particular Tumor Type. In this case, a query like the one above won’t work, as it will miss fields that have multiple Tumor Types separated by commas. Because SQL will look for an exact match, you’ll won’t get all the results for a query like this.

 pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE [Tumor Type(s)] = 'Diffuse Large B Cell Lymphoma'") 

Note that you only received a single row from this query, even though there are multiple records that match this tumor type.  SQL does provide a way to find sub-patterns in a text field. You can get all records (sort of/kind of/hand waving) with a LIKE query

pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE [Tumor Type(s)] LIKE '%Diffuse Large B Cell Lymphoma%'")

NOTE: you may not want the text truncated in your results. To handle this, set a property on your dataframe:

pd.set_option('display.max_colwidth', -1)

Although this works, you might want to split the comma delimited values into separate rows to create a Tumor_Type to Gene lookup table (perhaps to put it into first or higher normal forms As always, there are a number of different ways to do this. You can certainly do this through SQL and temporary tables, but since we’re in python and can access this table as a dataframe, let’s try a python solution.

First, let’s get the Gene and Tumor Type as a dataframe

tumor_types = pysqldf("SELECT Gene, [Tumor Type(s)] as Tumor_Type FROM ipm")

Next, we’ll split the comma delimited tumor_types into separate rows.

gene_tumor_types = pd.DataFrame(tumor_types["Tumor_Type"].str.split(',').tolist(), index=tumor_types["Gene"]).stack()
gene_tumor_types = gene_tumor_types.reset_index()[[0, 'Gene']]
gene_tumor_types.columns = ['Tumor_Type', 'Gene']

See stack overflow for a nice discussion of this solution and other pandas dataframe based solutions.

Take a look at the resulting gene_tumor_types data frame.


You now have an association from Gene to Tumor Type, with each tumor type as an individual row rather than as a comma delimited list. To get the Genes associated with a particular Tumor Type, we no longer need a LIKE query.

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type = 'Diffuse Large B Cell Lymphoma'")

Wonderful! Except that… not so wonderful, it didn’t work – we’re missing data! There’s a big gotcha here. Compare the earlier LIKE query and this one. As an exercise, you might want to stop and try to figure out why (answer is in the next paragraph).

Common Issue # 2: Leading or Trailing White Space

This takes us to another common pitfall – white space! Don’t forget, an equals operator in SQL (and programming languages in general) is an exact match. “ Hello” and “Hello” do not match!

Take a look at the dataframe for gene_tumor_types  – you’ll notice many of the Tumor Types have leading whitespace. This prevents the exact match from occurring, though you will find them through like queries, which find it as a partial match. You can still them them through a LIKE query

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type LIKE '%Diffuse Large B Cell Lymphoma'")

But that’s a hack and kind of defeats the purpose of creating a new lookup table. We should be able to get this through an equality operator. Let’s trim the whitespace from this column.

gene_tumor_types["Tumor_Type"] = gene_tumor_types["Tumor_Type"].str.strip()

And take a look at the list to see the whitespace has been removed


Now retry the LIKE query and the exact match query – you’ll see that you are now retrieving all the rows.

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type = 'Diffuse Large B Cell Lymphoma'")

Common Issue # 3: Repeated Columns

Another common spreadsheet practice is to tack on a variable number of columns to store one-to-many data relationships. Take a look toward the end (right hand side) of the spreadsheet (or ipm dataframe)


Each row has one or more citations. This spreadsheet stores the one to many relationship by creating a new column for each set.

Unfortunately, this does make it more difficult to query, since we need to know in advance how many Citations to query. Furthermore, the column headers that hold Citations beyond the first one don’t have names, making the query less informative.

For instance, not all Citations have a multiple citations. To get the citations for Gene JAK1, we’d need to write:

pysqldf("SELECT Gene, Citations, [Unnamed: 7], [Unnamed: 8], [Unnamed: 9], [Unnamed: 10], [Unnamed: 11], [Unnamed: 12], [Unnamed: 13], [Unnamed: 14] FROM ipm WHERE Gene = 'JAK1'")

This query will return all the citations for Gene “JAK1”. However, if you run this query against Gene “MPL”, you’ll receive a value of “None” for several columns. By contrast, if you run this query against Gene “MYD88”, you’ll miss a number of citations that extend out to “Unnamed: 26”.

It would be more convenient to be able to write a query like this:

pysqldf("SELECT Gene, Citation from ipm”)

And receive a separate row for each citation.

Let’s create a lookup table for Gene and Citations. There are, as always, a number of different ways to accomplish this, through SQL or pandas. In this case, we’ll use SQL with a python loop to create a “UNION ALL” query.

query = "SELECT Gene, [Tumor Type(s)], [Variant(s)], Tier, Interpretations, Citations as Citation FROM ipm WHERE Citations != 'None'"
for i in range(7, 27):
    query += (" UNION ALL SELECT Gene, [Tumor Type(s)], [Variant(s)], Tier, Interpretations, [Unnamed: {val}] as Citation FROM ipm WHERE [Unnamed: {val}] != 'None' ".format(val=i))
query += ("ORDER BY Gene")

This approach uses a python loop to build a SQL statement. The UNION ALL statement combines the results of more than one query into a single output table. In this case, we are taking querying each Gene and Citation combination and outputting the results into separate rows. You can take a look at the full query (it’s long, so I’m not posting it here, just view the output of the “query” string in jupyter or interactive python).

Let’s look at the results of this table

gene_citations = pysqldf(query)

Let’s re-run the query for JAK1 and MPL.

pysqldf("SELECT Gene, Citation FROM gene_citations WHERE Gene = 'JAK1'")
pysqldf("SELECT Gene, Citation FROM gene_citations WHERE Gene = 'MPL'")

You are now able to get clean, multiple row results from the Python dataframe through a one-to-many table relation.

We’ll tackle a few more data sources for responsiveness of various tumors and conditions to genes and variants in future posts. Stay tuned.

And, as always, if you’d like to learn more about Python or R, please consider signing up for a Software Carpentry Class or attend one of our workshops at Mission Bay!

Always Be Collecting Feedback

As part of our ongoing efforts to get to know our users better, the UCSF Library web team decided we wanted some of that sweet, sweet microfeedback action, and so we deployed a very short satisfaction survey at back in July.

How Are We Doing tabAnytime a patron clicked the How Are We Doing button at the bottom of every page, they were asked a simple question: How would you rate your experience today? Patrons could let us know if they were satisfied or unsatisfied with their web experience.

Regardless of their answer, they’d be prompted to go into detail and provide some demographic information, but the only required question was whether they were satisfied/unsatisfied. Our hope was that keeping the survey short and to the point, and constantly available on every page, would encourage participation.

The not-so-secret goal of this survey structure was for the web team to learn directly from our patrons where they’re having problems, so that we make improvements to the website based on their needs and not our own assumptions. Our thinking was that a user frustrated enough to leave a comment was a user we’d want to hear from.

Enough Background Already, What Did We Learn?

help-btnThe stats below are from July 7 through August 23, 2015, the day before we introduced our new Help feature. We’re excluding overlapping dates from this analysis because the Help button began competing with the How Are We Doing button for user attention (more on this below), and we wanted to base our analysis on clean data.

Of the 201 responses received during that period, 65% had a satisfactory experience at our site. Hooray!

If we drill down to the 76% of respondents who shared their demographic information, the overwhelming number of responses came from UCSF affiliates (94%), with Staff leading the way, closely followed by Faculty, then Students. It’s likely the data was skewed by the summer months, and it’ll be interesting to see if the affiliation breakdown changes now that the fall semester is in full swing.


Patron satisfaction is awesome, but remember our not-so-secret goal was to learn why our users are unsatisfied. While only 20% of all respondents bothered to comment at all, our hypothesis about frustrated users being more likely to go into detail was correct: 87% of comments came from unsatisfied users. Hooray (really)!

Unsatisfied Users are More Likely to Comment

What’s Making our Users Unhappy?

Most of the frustration came from patrons who felt that navigating the site and finding what they needed was just too hard. 2nd prize for user frustration goes to remote access issues, with patrons expressing difficulties getting to articles using EZproxy, UC-eLinks, and VPN.

Connection errors and Library service issues (comments like you don’t have a book I want and my barcode number doesn’t work anymore) tied for 3rd place, and I was personally amused and humbled to know that 9% of the feedback was about how annoying they found the feedback survey popup window (removed after a few weeks).

Unsatisfied Users - Comments Breakdown
* If a patron gave feedback in more than one category, I picked the dominant issue.

So What’s Next?

We were fortunate to meet with some respondents, and we used their comments and other key takeaways from our satisfaction survey, the interviews we conducted with our customer-facing staff, and the LibQUAL 2013 survey to finalize the Library User Personas we’ll rely on as we make changes going forward.

We’ll keep our satisfaction survey going, but with the successful rollout of the Help feature, the time has come to modify how we ask for feedback. The How Are We Doing and Help buttons serve different purposes; unfortunately, the current design doesn’t make that distinction clear. Getting requests for help in the satisfaction survey gave us useful information before we rolled out the Help feature, but now it’s more appropriate that help me please comments get routed to the right person right away for faster customer service.

We’ll launch our more low-key request for feedback this week.

New Feedback Button
The new Feedback button will live at the bottom of every web page.

Random Forests and Datashare at the CDL Code Jam

The California Digital Library hosted a code jam earlier this month at the Oakland City Conference center.

This gathering brought together librarians and developers from University of California campuses for a series of working meetings with an eye toward system-wide projects, especially involving data curation.

In the spirit of an informal, code-jam style meeting, I presented a bit on my recent experiments using machine learning to categorize data. As a starting example, I applied a random forest to suggest subject keywords for data sets uploaded to the recently launched DataShare website.

Continue reading Random Forests and Datashare at the CDL Code Jam

On Metrics

Collecting metrics is important. But we all know that many metrics are chosen for collection because they are inexpensive and obvious, not because they are actually useful.

(Quick pre-emptive strike #1: I’m using metrics very broadly here. Yes, sometimes I really mean measurements, etc. For better or for worse, this is the way metrics is used in the real world. Oh well.)

(Quick pre-emptive strike #2: Sure, if you’re Google or Amazon, you probably collect crazy amounts of data that allow highly informative and statistically valid metrics through sophisticated tools. I’m not talking about you.)

I try to avoid going the route of just supplying whatever numbers I can dig up and hope that it meets the person’s need. Instead, I ask the requester to tell me what it is they are trying to figure out and how they think they will interpret the data they receive. If pageviews have gone up 10% from last year, what does that tell us? How will we act differently if pageviews have only gone up 3%?

This has helped me avoid iterative metric fishing expeditions. People often ask for statistics hoping that, when the data comes back, it will tell an obvious story that they like. Usually it doesn’t tell any obvious story or tells a story they don’t like, so they start fishing. “Now can you also give me the same numbers for our competitors?” “Now can you divide these visitors into various demographics?”

When I first started doing this, I was afraid that people would get frustrated with my push-back on their requests. For the most part, that didn’t happen.

Instead, people started asking better questions as they thought through and explained how the data would be interpreted. And I felt better about spending resources getting people the information they need because I understood its value.

Just like IT leaders need to “consistently articulate the business value of IT”, it is healthy for data requesters to articulate the value of their data requests.