Data Munging with Python, SQL, and Excel

So much of data science is, in fact, data munging and wrangling, moving and transforming it from one format to another. Data, when you’re fortunate enough to know where to find it, is almost never in the nicely organized format you need for your analysis. This is one of the reasons python is so popular among data scientists – it is a good language and environment for collecting, formatting, parsing, combining and splitting data from different sources.

Data frequently arrives in comma delimited files or excel spreadsheets. You can do certainly some analysis with spreadsheet operations, but odds are you’ll eventually want to load it into a data frame and use python (or R) for more meaningful analysis. This post is a write up of a few tips I learned from my recent efforts to wrangle some genomic data available on the web as Excel downloads. These spreadsheets presented a few common issues that arise when dealing with non-normalized data sets in single table format.

The exact nature of the data isn’t really the focus here, but for some context: researchers at UCSF often need information about a gene, variants of that gene, and the effect this Gene has on responsiveness of different tumors or conditions to treatment. There are a number of different places to find this data, a number of different datasets, and (of course) varied ways to search, filter, or query those datasets. Searching and finding can be a long, error prone, irritating, manual process.

We’ll use Python, Pandas, and Jupyter Notebook to try to get a better handle on some of this data. I’m assuming you already know how to open a 1) Jupyter Notebook and issue basic Python commands, and 2) Use pandasql to run SQL queries against a dataframe. (If you don’t, consider signing up for a Software Carpentry class or attend one of our Python/R workshops at Mission Bay).

Otherwise, you can follow the jupyter and python installation documents (I used anaconda for both).

A full jupyter notebook for the code in this post is available on the ucsf-ckm github repository.

Create a DataFrame from an Excel Spreadsheet

We’ll use a spreadsheet from The Precision Medicine Knowledge Base. To follow along, click on the “Download All Interpretations (Excel)” link.

Before analyzing, let’s load the excel spreadsheet into a pandas DataFrame. Open up a jupyter notebook (or just the python interactive shell)  to start.

First off, remember to include the pandas module…

import pandas as pd

Pandas has an easy method to load a spreadsheet (I’m assuming the spreadsheet is in your working directory, otherwise you’ll need to edit the path)

ipm = pd.read_excel("IPM_Knowledgebase_Interpretations_Complete_20160913-2207.xlsx")

Now take at your newly populated dataframe


You’ll see that your spreadsheet headers and data have become the column names and rows of a pandas dataframe. Let’s try using pandasql to run a query on it. First, load the module and globals (more information on this)

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

And try running a query. Let’s get the variants that match a particular Gene.

pysqldf("SELECT Gene, [Tumor Type(s)], [Variant(s)] FROM ipm WHERE Gene = 'PIK3CA'")

Note: you must use the brackets around Tumor Type(s) so the white space and parentheses around (s) won’t be interpreted as SQL.

Even without any additional data munging, you’re in a pretty good spot for analyzing your data. You have it in a dataframe, where you can run queries, python expressions, and pandas operations on it. However, there are a few issues, common to spreadsheets, that may make this data harder to work with and analyse.

Relational Databases frequently have “one to many” relationships. In this case, a Gene has a one to many relationship with Tumor Types, Variants, and Citations. Each Gene has an effect on multiple Tumor Types, each Gene can have multiple Variants, and the effect a Gene and Variant has on a Tumor Type can have multiple Citations.

This spreadsheet stores the data for one to many relationships in two different ways. For Genes to Tumor Types and Variants, the spreadsheet provides a comma delimited list in a single cell. For Citations, the spreadsheet adds tacks on a varying number of columns to the right side of the spreadsheet. Although this does provide the data, it can make the data harder to work with.

The next two sections will review techniques for converting comma delimited lists and multiple columns into a one-to-many lookup table.

Common Issue #1 – Comma Delimited Lists

You may notice that although the Genes are provided as single (atomic) values, other fields, such as Variants or Tumor Types are provided as a comma delimited list. This won’t be much of a problem if you want to find all Tumor Types associated with a Gene – the query is straightforward:

pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE Gene = 'CSF3R'")

You’ll get back a single row with a comma delimited list of Tumor Types, rather than a separate row for each Tumor Type, but you can parse that relatively easily.

Now, suppose you wanted to reverse this query, to find all genes that that match a particular Tumor Type. In this case, a query like the one above won’t work, as it will miss fields that have multiple Tumor Types separated by commas. Because SQL will look for an exact match, you’ll won’t get all the results for a query like this.

 pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE [Tumor Type(s)] = 'Diffuse Large B Cell Lymphoma'") 

Note that you only received a single row from this query, even though there are multiple records that match this tumor type.  SQL does provide a way to find sub-patterns in a text field. You can get all records (sort of/kind of/hand waving) with a LIKE query

pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE [Tumor Type(s)] LIKE '%Diffuse Large B Cell Lymphoma%'")

NOTE: you may not want the text truncated in your results. To handle this, set a property on your dataframe:

pd.set_option('display.max_colwidth', -1)

Although this works, you might want to split the comma delimited values into separate rows to create a Tumor_Type to Gene lookup table (perhaps to put it into first or higher normal forms As always, there are a number of different ways to do this. You can certainly do this through SQL and temporary tables, but since we’re in python and can access this table as a dataframe, let’s try a python solution.

First, let’s get the Gene and Tumor Type as a dataframe

tumor_types = pysqldf("SELECT Gene, [Tumor Type(s)] as Tumor_Type FROM ipm")

Next, we’ll split the comma delimited tumor_types into separate rows.

gene_tumor_types = pd.DataFrame(tumor_types["Tumor_Type"].str.split(',').tolist(), index=tumor_types["Gene"]).stack()
gene_tumor_types = gene_tumor_types.reset_index()[[0, 'Gene']]
gene_tumor_types.columns = ['Tumor_Type', 'Gene']

See stack overflow for a nice discussion of this solution and other pandas dataframe based solutions.

Take a look at the resulting gene_tumor_types data frame.


You now have an association from Gene to Tumor Type, with each tumor type as an individual row rather than as a comma delimited list. To get the Genes associated with a particular Tumor Type, we no longer need a LIKE query.

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type = 'Diffuse Large B Cell Lymphoma'")

Wonderful! Except that… not so wonderful, it didn’t work – we’re missing data! There’s a big gotcha here. Compare the earlier LIKE query and this one. As an exercise, you might want to stop and try to figure out why (answer is in the next paragraph).

Common Issue # 2: Leading or Trailing White Space

This takes us to another common pitfall – white space! Don’t forget, an equals operator in SQL (and programming languages in general) is an exact match. “ Hello” and “Hello” do not match!

Take a look at the dataframe for gene_tumor_types  – you’ll notice many of the Tumor Types have leading whitespace. This prevents the exact match from occurring, though you will find them through like queries, which find it as a partial match. You can still them them through a LIKE query

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type LIKE '%Diffuse Large B Cell Lymphoma'")

But that’s a hack and kind of defeats the purpose of creating a new lookup table. We should be able to get this through an equality operator. Let’s trim the whitespace from this column.

gene_tumor_types["Tumor_Type"] = gene_tumor_types["Tumor_Type"].str.strip()

And take a look at the list to see the whitespace has been removed


Now retry the LIKE query and the exact match query – you’ll see that you are now retrieving all the rows.

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type = 'Diffuse Large B Cell Lymphoma'")

Common Issue # 3: Repeated Columns

Another common spreadsheet practice is to tack on a variable number of columns to store one-to-many data relationships. Take a look toward the end (right hand side) of the spreadsheet (or ipm dataframe)


Each row has one or more citations. This spreadsheet stores the one to many relationship by creating a new column for each set.

Unfortunately, this does make it more difficult to query, since we need to know in advance how many Citations to query. Furthermore, the column headers that hold Citations beyond the first one don’t have names, making the query less informative.

For instance, not all Citations have a multiple citations. To get the citations for Gene JAK1, we’d need to write:

pysqldf("SELECT Gene, Citations, [Unnamed: 7], [Unnamed: 8], [Unnamed: 9], [Unnamed: 10], [Unnamed: 11], [Unnamed: 12], [Unnamed: 13], [Unnamed: 14] FROM ipm WHERE Gene = 'JAK1'")

This query will return all the citations for Gene “JAK1”. However, if you run this query against Gene “MPL”, you’ll receive a value of “None” for several columns. By contrast, if you run this query against Gene “MYD88”, you’ll miss a number of citations that extend out to “Unnamed: 26”.

It would be more convenient to be able to write a query like this:

pysqldf("SELECT Gene, Citation from ipm”)

And receive a separate row for each citation.

Let’s create a lookup table for Gene and Citations. There are, as always, a number of different ways to accomplish this, through SQL or pandas. In this case, we’ll use SQL with a python loop to create a “UNION ALL” query.

query = "SELECT Gene, [Tumor Type(s)], [Variant(s)], Tier, Interpretations, Citations as Citation FROM ipm WHERE Citations != 'None'"
for i in range(7, 27):
    query += (" UNION ALL SELECT Gene, [Tumor Type(s)], [Variant(s)], Tier, Interpretations, [Unnamed: {val}] as Citation FROM ipm WHERE [Unnamed: {val}] != 'None' ".format(val=i))
query += ("ORDER BY Gene")

This approach uses a python loop to build a SQL statement. The UNION ALL statement combines the results of more than one query into a single output table. In this case, we are taking querying each Gene and Citation combination and outputting the results into separate rows. You can take a look at the full query (it’s long, so I’m not posting it here, just view the output of the “query” string in jupyter or interactive python).

Let’s look at the results of this table

gene_citations = pysqldf(query)

Let’s re-run the query for JAK1 and MPL.

pysqldf("SELECT Gene, Citation FROM gene_citations WHERE Gene = 'JAK1'")
pysqldf("SELECT Gene, Citation FROM gene_citations WHERE Gene = 'MPL'")

You are now able to get clean, multiple row results from the Python dataframe through a one-to-many table relation.

We’ll tackle a few more data sources for responsiveness of various tumors and conditions to genes and variants in future posts. Stay tuned.

And, as always, if you’d like to learn more about Python or R, please consider signing up for a Software Carpentry Class or attend one of our workshops at Mission Bay!