Data Munging Addendum: The Long Way To Handle Comma Delimited Lists

In an earlier post, we discussed the issue of comma delimited lists within an excel spreadsheet. This way of one-to-many relationships in data can make it more difficult to build look up tables, run queries, and do other types of analysis. Although there are some concise coding approaches, both in SQL and pandas, sometimes you just want to [give up on trying to be clever and] reconstruct your data frame line by line. Here’s a quick overview on how to do this.

You can follow along, cut and paste into your own notebook, or view/checkout the code from github.

Let’s go ahead and build a pandas dataframe with comma delimited information in some of the cells.

import pandas as pd
import numpy as np

First, we’ll create three lists

ar1 = [1,2,3,4,5]
ar2 = [2,4,6,8,10]
ar3 = ['one','two,three,four','three,four,five','four,five','five']

Next, add the lists as columns to a dataframe

df = pd.DataFrame({'A' : ar1, 'B' : ar2, 'C' : ar3})

screen-shot-2016-10-10-at-9-00-20-am

And, of course, if you query it through pandasql, you’ll get a single row with a comma delimited list

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM df WHERE A = 3")

screen-shot-2016-10-10-at-9-01-49-am

Like last time, we want to have each value for C on a separate row, with corresponding values for A and B.

Here’s how to do this the computationally long and expensive way (which is, in fact, sometimes the way you have to do things – sometimes because performance doesn’t matter and you’re tired of trying to be clever, sometimes because the logic is so intricate that you have to knit it all together line by line anyway).

We’ll create three new arrays (again, to hold the column values).

a0 = []
a1 = []
a2 = []

Next, we’ll loop through each row of our existing dataframe. We’ll split on the the third row (we start counting from 0 in a list, so that will be at index 2). Splitting on the comma will create a new array with three new strings. We’ll add a new value to each column for each string (word) in that row.

for index, row in df.iterrows():
for s in row[2].split(','):
a0.append(row[0])
a1.append(row[1])
a2.append(s)

Now, let’s create a new data frame out of the three columns we just created

ndf = pd.DataFrame({'A' : a0, 'B' : a1, 'C' : a2})

Take a look

ndf

screen-shot-2016-10-10-at-9-04-45-am

We can now query both dataframes using pandasql

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Querying our original dataframe, you’ll see everything on one comma delimited line

pysqldf("SELECT * FROM df WHERE A = 3")

screen-shot-2016-10-10-at-9-05-37-am

Quering our new dataframe, you’ll set a separate line for each entry

pysqldf("SELECT * FROM ndf WHERE A = 3")

screen-shot-2016-10-10-at-9-06-04-am

It’s the long way to do this, but sometimes the long way can save you a lot of time.