Data Munging Addendum: The Long Way To Handle Comma Delimited Lists

In an earlier post, we discussed the issue of comma delimited lists within an excel spreadsheet. This way of one-to-many relationships in data can make it more difficult to build look up tables, run queries, and do other types of analysis. Although there are some concise coding approaches, both in SQL and pandas, sometimes you just want to [give up on trying to be clever and] reconstruct your data frame line by line. Here’s a quick overview on how to do this.

You can follow along, cut and paste into your own notebook, or view/checkout the code from github.

Let’s go ahead and build a pandas dataframe with comma delimited information in some of the cells.

[code lang=”python”]
import pandas as pd
import numpy as np
[/code]

First, we’ll create three lists

[code lang=”python”]
ar1 = [1,2,3,4,5]
ar2 = [2,4,6,8,10]
ar3 = [‘one’,’two,three,four’,’three,four,five’,’four,five’,’five’]
[/code]

Next, add the lists as columns to a dataframe

[code lang=”python”]
df = pd.DataFrame({‘A’ : ar1, ‘B’ : ar2, ‘C’ : ar3})
[/code]

screen-shot-2016-10-10-at-9-00-20-am

And, of course, if you query it through pandasql, you’ll get a single row with a comma delimited list

[code lang=”python”]
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM df WHERE A = 3")
[/code]

screen-shot-2016-10-10-at-9-01-49-am

Like last time, we want to have each value for C on a separate row, with corresponding values for A and B.

Here’s how to do this the computationally long and expensive way (which is, in fact, sometimes the way you have to do things – sometimes because performance doesn’t matter and you’re tired of trying to be clever, sometimes because the logic is so intricate that you have to knit it all together line by line anyway).

We’ll create three new arrays (again, to hold the column values).

[code lang=”python”]
a0 = []
a1 = []
a2 = []
[/code]

Next, we’ll loop through each row of our existing dataframe. We’ll split on the the third row (we start counting from 0 in a list, so that will be at index 2). Splitting on the comma will create a new array with three new strings. We’ll add a new value to each column for each string (word) in that row.

[code lang=”python”]
for index, row in df.iterrows():
for s in row[2].split(‘,’):
a0.append(row[0])
a1.append(row[1])
a2.append(s)
[/code]

Now, let’s create a new data frame out of the three columns we just created

[code lang=”python”]
ndf = pd.DataFrame({‘A’ : a0, ‘B’ : a1, ‘C’ : a2})
[/code]

Take a look

[code lang=”python”]
ndf
[/code]

screen-shot-2016-10-10-at-9-04-45-am

We can now query both dataframes using pandasql

[code lang=”python”]
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
[/code]

Querying our original dataframe, you’ll see everything on one comma delimited line

[code lang=”python”]
pysqldf("SELECT * FROM df WHERE A = 3")
[/code]

screen-shot-2016-10-10-at-9-05-37-am

Quering our new dataframe, you’ll set a separate line for each entry

[code lang=”python”]
pysqldf("SELECT * FROM ndf WHERE A = 3")
[/code]

screen-shot-2016-10-10-at-9-06-04-am

It’s the long way to do this, but sometimes the long way can save you a lot of time.