Beyond Basic SQL

Thanks to everyone who attended the Beyond Basic SQL class at Mission Bay on April 25th!

This class was inspired by the questions and problems I’ve encountered while trying to work with various datasets at the Programming and Pizza sessions.  SQL (the language) and sqlite (the database) can be excellent tools for data analysis, but getting the data into a form where you can use it can be a real challenge.

In the first part of this class, we reviewed the basics of querying a single database table through Selection, Filtering, and Aggregation.  We used the “gapminder” dataset from Software Carpentry to  write queries and do analysis on worldwide per capita gdp, life expectancy, and population by country and continent.

The second part of this class focused on dealing with data that isn’t in a convenient, normalized format.  To do this, we analyzed the kind of spreadsheet researchers often download from the internet.  As a simplified example, we looked at a sample enrollment spreadsheet with information about students, courses, and reading lists.  We reviewed the first two normal forms, discussed how these forms can reduce ambiguity and redundancy, and transformed the non-normalized spreadsheet into a database in second normal form.

It was a lot of material to cover in two hours, thanks for sticking with it.  Here are the promised slides for everyone who wanted to try it out.

If you’re interested in trying this out on a real world example, take a look at the blog post below on data munging and excel.  The examples in that post  (you’ll need to use Python as well as SQL) will present strategies for getting an unstructured spreadsheet into a form where you can use sql, pandas, data frames, and other tools to analyze a dataset.

Class files are available on the ucsf-ckm dsi GitHub repo here.