All posts by Geoff Boushey

Topics in Data Science: Python and JSON workshop

Thanks to everyone who attending the “Topics in Data Science: JSON and Python” workshop on November 14th, 2017!

The format for this workshop was a little different from previous Software Carpentry workshops. Unlike core SWC workshops, which assume no prior programming knowledge, the “Topics” courses are designed for people who have some programming background or who have taken a workshop or course in the past. Although we continue to emphasize hands on programming, workshops in a “Topics” series will cover more material than we can manually type in the time allotted. Some code will be presented through copy and paste, some through review.

This means that to get the most out of these workshops, participants will want to review and work with the code after the class.

In the JSON and Python workshop, we first covered dictionaries, an essential data structure for parsing JSON that isn’t generally covered in the Python section of a standard SWC workshop at UCSF. We covered keys, look up values, and the practice of nesting sub dictionaries and lists within a dictionary or list.

We then used the UCSF Profiles application developed at the CTSI to demonstrate how to request a JSON document, send query parameters, and convert the result to a dictionary. As an example, we reviewed how to parse a JSON document to generate a list of publications for a researcher at UCSF.

After this, we briefly reviewed some of the JSON based web apis available through the National Library of Medicine, applying the same techniques to generate a list of interactions for a particular medication. I highly encourage everyone who took this class to take a more extensive look at this website, and think about what kind of services you’d like to see here at UCSF.

Lastly, we reviewed and ran code using the bottle module to create a JSON based web service, running on localhost, to demonstrate how to write web services that do more than simply provide access to data.

As promised, all code, with comments and explanations, is available on github.

On last note – I mentioned, briefly, that you will sometimes need to directly parse HTML from a website, since many useful data feeds still don’t have a nice JSON api available (or, if they do, it’s hard to find or poorly documented). Here’s a link to a short application that shows how to use beautifulsoup to quickly parse XML.

Because XML is tag based, this approach will work for HTML as well.

Thanks again for attending!

Beyond Basic SQL

Thanks to everyone who attended the Beyond Basic SQL class at Mission Bay on April 25th!

This class was inspired by the questions and problems I’ve encountered while trying to work with various datasets at the Programming and Pizza sessions.  SQL (the language) and sqlite (the database) can be excellent tools for data analysis, but getting the data into a form where you can use it can be a real challenge.

In the first part of this class, we reviewed the basics of querying a single database table through Selection, Filtering, and Aggregation.  We used the “gapminder” dataset from Software Carpentry to  write queries and do analysis on worldwide per capita gdp, life expectancy, and population by country and continent.

The second part of this class focused on dealing with data that isn’t in a convenient, normalized format.  To do this, we analyzed the kind of spreadsheet researchers often download from the internet.  As a simplified example, we looked at a sample enrollment spreadsheet with information about students, courses, and reading lists.  We reviewed the first two normal forms, discussed how these forms can reduce ambiguity and redundancy, and transformed the non-normalized spreadsheet into a database in second normal form.

It was a lot of material to cover in two hours, thanks for sticking with it.  Here are the promised slides for everyone who wanted to try it out.

If you’re interested in trying this out on a real world example, take a look at the blog post below on data munging and excel.  The examples in that post  (you’ll need to use Python as well as SQL) will present strategies for getting an unstructured spreadsheet into a form where you can use sql, pandas, data frames, and other tools to analyze a dataset.

Class files are available on the ucsf-ckm dsi GitHub repo here.

 

 

Data Munging Addendum: The Long Way To Handle Comma Delimited Lists

In an earlier post, we discussed the issue of comma delimited lists within an excel spreadsheet. This way of one-to-many relationships in data can make it more difficult to build look up tables, run queries, and do other types of analysis. Although there are some concise coding approaches, both in SQL and pandas, sometimes you just want to [give up on trying to be clever and] reconstruct your data frame line by line. Here’s a quick overview on how to do this.

You can follow along, cut and paste into your own notebook, or view/checkout the code from github.

Let’s go ahead and build a pandas dataframe with comma delimited information in some of the cells.

import pandas as pd
import numpy as np

First, we’ll create three lists

ar1 = [1,2,3,4,5]
ar2 = [2,4,6,8,10]
ar3 = ['one','two,three,four','three,four,five','four,five','five']

Next, add the lists as columns to a dataframe

df = pd.DataFrame({'A' : ar1, 'B' : ar2, 'C' : ar3})

screen-shot-2016-10-10-at-9-00-20-am

And, of course, if you query it through pandasql, you’ll get a single row with a comma delimited list

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM df WHERE A = 3")

screen-shot-2016-10-10-at-9-01-49-am

Like last time, we want to have each value for C on a separate row, with corresponding values for A and B.

Here’s how to do this the computationally long and expensive way (which is, in fact, sometimes the way you have to do things – sometimes because performance doesn’t matter and you’re tired of trying to be clever, sometimes because the logic is so intricate that you have to knit it all together line by line anyway).

We’ll create three new arrays (again, to hold the column values).

a0 = []
a1 = []
a2 = []

Next, we’ll loop through each row of our existing dataframe. We’ll split on the the third row (we start counting from 0 in a list, so that will be at index 2). Splitting on the comma will create a new array with three new strings. We’ll add a new value to each column for each string (word) in that row.

for index, row in df.iterrows():
for s in row[2].split(','):
a0.append(row[0])
a1.append(row[1])
a2.append(s)

Now, let’s create a new data frame out of the three columns we just created

ndf = pd.DataFrame({'A' : a0, 'B' : a1, 'C' : a2})

Take a look

ndf

screen-shot-2016-10-10-at-9-04-45-am

We can now query both dataframes using pandasql

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Querying our original dataframe, you’ll see everything on one comma delimited line

pysqldf("SELECT * FROM df WHERE A = 3")

screen-shot-2016-10-10-at-9-05-37-am

Quering our new dataframe, you’ll set a separate line for each entry

pysqldf("SELECT * FROM ndf WHERE A = 3")

screen-shot-2016-10-10-at-9-06-04-am

It’s the long way to do this, but sometimes the long way can save you a lot of time.

Data Munging with Python, SQL, and Excel

So much of data science is, in fact, data munging and wrangling, moving and transforming it from one format to another. Data, when you’re fortunate enough to know where to find it, is almost never in the nicely organized format you need for your analysis. This is one of the reasons python is so popular among data scientists – it is a good language and environment for collecting, formatting, parsing, combining and splitting data from different sources.

Data frequently arrives in comma delimited files or excel spreadsheets. You can do certainly some analysis with spreadsheet operations, but odds are you’ll eventually want to load it into a data frame and use python (or R) for more meaningful analysis. This post is a write up of a few tips I learned from my recent efforts to wrangle some genomic data available on the web as Excel downloads. These spreadsheets presented a few common issues that arise when dealing with non-normalized data sets in single table format.

The exact nature of the data isn’t really the focus here, but for some context: researchers at UCSF often need information about a gene, variants of that gene, and the effect this Gene has on responsiveness of different tumors or conditions to treatment. There are a number of different places to find this data, a number of different datasets, and (of course) varied ways to search, filter, or query those datasets. Searching and finding can be a long, error prone, irritating, manual process.

We’ll use Python, Pandas, and Jupyter Notebook to try to get a better handle on some of this data. I’m assuming you already know how to open a 1) Jupyter Notebook and issue basic Python commands, and 2) Use pandasql to run SQL queries against a dataframe. (If you don’t, consider signing up for a Software Carpentry class or attend one of our Python/R workshops at Mission Bay).

Otherwise, you can follow the jupyter and python installation documents (I used anaconda for both).

A full jupyter notebook for the code in this post is available on the ucsf-ckm github repository.

Create a DataFrame from an Excel Spreadsheet

We’ll use a spreadsheet from The Precision Medicine Knowledge Base. To follow along, click on the “Download All Interpretations (Excel)” link.

Before analyzing, let’s load the excel spreadsheet into a pandas DataFrame. Open up a jupyter notebook (or just the python interactive shell)  to start.

First off, remember to include the pandas module…

import pandas as pd

Pandas has an easy method to load a spreadsheet (I’m assuming the spreadsheet is in your working directory, otherwise you’ll need to edit the path)

ipm = pd.read_excel("IPM_Knowledgebase_Interpretations_Complete_20160913-2207.xlsx")

Now take at your newly populated dataframe

ipm

You’ll see that your spreadsheet headers and data have become the column names and rows of a pandas dataframe. Let’s try using pandasql to run a query on it. First, load the module and globals (more information on this)

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

And try running a query. Let’s get the variants that match a particular Gene.

pysqldf("SELECT Gene, [Tumor Type(s)], [Variant(s)] FROM ipm WHERE Gene = 'PIK3CA'")


Note: you must use the brackets around Tumor Type(s) so the white space and parentheses around (s) won’t be interpreted as SQL.

Even without any additional data munging, you’re in a pretty good spot for analyzing your data. You have it in a dataframe, where you can run queries, python expressions, and pandas operations on it. However, there are a few issues, common to spreadsheets, that may make this data harder to work with and analyse.

Relational Databases frequently have “one to many” relationships. In this case, a Gene has a one to many relationship with Tumor Types, Variants, and Citations. Each Gene has an effect on multiple Tumor Types, each Gene can have multiple Variants, and the effect a Gene and Variant has on a Tumor Type can have multiple Citations.

This spreadsheet stores the data for one to many relationships in two different ways. For Genes to Tumor Types and Variants, the spreadsheet provides a comma delimited list in a single cell. For Citations, the spreadsheet adds tacks on a varying number of columns to the right side of the spreadsheet. Although this does provide the data, it can make the data harder to work with.

The next two sections will review techniques for converting comma delimited lists and multiple columns into a one-to-many lookup table.

Common Issue #1 – Comma Delimited Lists

You may notice that although the Genes are provided as single (atomic) values, other fields, such as Variants or Tumor Types are provided as a comma delimited list. This won’t be much of a problem if you want to find all Tumor Types associated with a Gene – the query is straightforward:

pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE Gene = 'CSF3R'")

You’ll get back a single row with a comma delimited list of Tumor Types, rather than a separate row for each Tumor Type, but you can parse that relatively easily.

Now, suppose you wanted to reverse this query, to find all genes that that match a particular Tumor Type. In this case, a query like the one above won’t work, as it will miss fields that have multiple Tumor Types separated by commas. Because SQL will look for an exact match, you’ll won’t get all the results for a query like this.

 pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE [Tumor Type(s)] = 'Diffuse Large B Cell Lymphoma'") 

Note that you only received a single row from this query, even though there are multiple records that match this tumor type.  SQL does provide a way to find sub-patterns in a text field. You can get all records (sort of/kind of/hand waving) with a LIKE query

pysqldf("SELECT Gene, [Tumor Type(s)] FROM ipm WHERE [Tumor Type(s)] LIKE '%Diffuse Large B Cell Lymphoma%'")

NOTE: you may not want the text truncated in your results. To handle this, set a property on your dataframe:

pd.set_option('display.max_colwidth', -1)

Although this works, you might want to split the comma delimited values into separate rows to create a Tumor_Type to Gene lookup table (perhaps to put it into first or higher normal forms https://en.wikipedia.org/wiki/First_normal_form). As always, there are a number of different ways to do this. You can certainly do this through SQL and temporary tables, but since we’re in python and can access this table as a dataframe, let’s try a python solution.

First, let’s get the Gene and Tumor Type as a dataframe

tumor_types = pysqldf("SELECT Gene, [Tumor Type(s)] as Tumor_Type FROM ipm")

Next, we’ll split the comma delimited tumor_types into separate rows.

gene_tumor_types = pd.DataFrame(tumor_types["Tumor_Type"].str.split(',').tolist(), index=tumor_types["Gene"]).stack()
gene_tumor_types = gene_tumor_types.reset_index()[[0, 'Gene']]
gene_tumor_types.columns = ['Tumor_Type', 'Gene']

See stack overflow for a nice discussion of this solution and other pandas dataframe based solutions.

Take a look at the resulting gene_tumor_types data frame.

gene_tumor_types

You now have an association from Gene to Tumor Type, with each tumor type as an individual row rather than as a comma delimited list. To get the Genes associated with a particular Tumor Type, we no longer need a LIKE query.

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type = 'Diffuse Large B Cell Lymphoma'")

Wonderful! Except that… not so wonderful, it didn’t work – we’re missing data! There’s a big gotcha here. Compare the earlier LIKE query and this one. As an exercise, you might want to stop and try to figure out why (answer is in the next paragraph).

Common Issue # 2: Leading or Trailing White Space

This takes us to another common pitfall – white space! Don’t forget, an equals operator in SQL (and programming languages in general) is an exact match. “ Hello” and “Hello” do not match!

Take a look at the dataframe for gene_tumor_types  – you’ll notice many of the Tumor Types have leading whitespace. This prevents the exact match from occurring, though you will find them through like queries, which find it as a partial match. You can still them them through a LIKE query

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type LIKE '%Diffuse Large B Cell Lymphoma'")

But that’s a hack and kind of defeats the purpose of creating a new lookup table. We should be able to get this through an equality operator. Let’s trim the whitespace from this column.

gene_tumor_types["Tumor_Type"] = gene_tumor_types["Tumor_Type"].str.strip()

And take a look at the list to see the whitespace has been removed

gene_tumor_types["Tumor_Type"]

Now retry the LIKE query and the exact match query – you’ll see that you are now retrieving all the rows.

pysqldf("SELECT Gene, Tumor_Type FROM gene_tumor_types WHERE Tumor_Type = 'Diffuse Large B Cell Lymphoma'")

Common Issue # 3: Repeated Columns

Another common spreadsheet practice is to tack on a variable number of columns to store one-to-many data relationships. Take a look toward the end (right hand side) of the spreadsheet (or ipm dataframe)

ipm

Each row has one or more citations. This spreadsheet stores the one to many relationship by creating a new column for each set.

Unfortunately, this does make it more difficult to query, since we need to know in advance how many Citations to query. Furthermore, the column headers that hold Citations beyond the first one don’t have names, making the query less informative.

For instance, not all Citations have a multiple citations. To get the citations for Gene JAK1, we’d need to write:

pysqldf("SELECT Gene, Citations, [Unnamed: 7], [Unnamed: 8], [Unnamed: 9], [Unnamed: 10], [Unnamed: 11], [Unnamed: 12], [Unnamed: 13], [Unnamed: 14] FROM ipm WHERE Gene = 'JAK1'")

This query will return all the citations for Gene “JAK1”. However, if you run this query against Gene “MPL”, you’ll receive a value of “None” for several columns. By contrast, if you run this query against Gene “MYD88”, you’ll miss a number of citations that extend out to “Unnamed: 26”.

It would be more convenient to be able to write a query like this:

pysqldf("SELECT Gene, Citation from ipm”)

And receive a separate row for each citation.

Let’s create a lookup table for Gene and Citations. There are, as always, a number of different ways to accomplish this, through SQL or pandas. In this case, we’ll use SQL with a python loop to create a “UNION ALL” query.

query = "SELECT Gene, [Tumor Type(s)], [Variant(s)], Tier, Interpretations, Citations as Citation FROM ipm WHERE Citations != 'None'"
for i in range(7, 27):
    query += (" UNION ALL SELECT Gene, [Tumor Type(s)], [Variant(s)], Tier, Interpretations, [Unnamed: {val}] as Citation FROM ipm WHERE [Unnamed: {val}] != 'None' ".format(val=i))
query += ("ORDER BY Gene")

This approach uses a python loop to build a SQL statement. The UNION ALL statement combines the results of more than one query into a single output table. In this case, we are taking querying each Gene and Citation combination and outputting the results into separate rows. You can take a look at the full query (it’s long, so I’m not posting it here, just view the output of the “query” string in jupyter or interactive python).

Let’s look at the results of this table

gene_citations = pysqldf(query)

Let’s re-run the query for JAK1 and MPL.

pysqldf("SELECT Gene, Citation FROM gene_citations WHERE Gene = 'JAK1'")
pysqldf("SELECT Gene, Citation FROM gene_citations WHERE Gene = 'MPL'")

You are now able to get clean, multiple row results from the Python dataframe through a one-to-many table relation.

We’ll tackle a few more data sources for responsiveness of various tumors and conditions to genes and variants in future posts. Stay tuned.

And, as always, if you’d like to learn more about Python or R, please consider signing up for a Software Carpentry Class or attend one of our workshops at Mission Bay!

Rails and Ember without rails-api

This is a follow up to a post I wrote a couple months ago, “Ember and Rails Scaffolding for Simple Crud”. In that post, I gave an overview for how to generate simple CRUD scaffolding for an Ember app using Rails purely as an api through the rails-api gem.

Here’s the thing… if you take the api-only approach, you by design give up the ability to write standard Rails views. This can be a good thing, of course, which is why the gem was integrated into Rails 5. If all you want is an api, you don’t want the extra weight of everything else that comes with Rails, and you always can add various gems back in as needed. But for now, you may want to preserve the ability to write a full MVC rails app while still providing an API for Ember or other single page javascript frameworks.

Fortunately, this isn’t especially difficult. The Rails side will get a little more verbose, at least the way I’m writing it, but all you need to do is ensure that your app responds to both html and json – and of course be particularly careful to make sure that you don’t mix view logic with backend logic.

So, here we go…

First, create a full rails app with basic CRUD for the User model in the previous tutorial. I’m not going to repeat the steps here since they won’t change much. The only difference here is that instead of doing this with the rails-api gem and command, you’ll now do this with traditional rails. You will still need to create serializers, add rack/cors, allow access to various HTTP actions in the Rails app, and so forth. This is all available through the previous tutorial, with one change – you don’t need to install the rails-api gem, and wherever it says “rails-api generate…”, instead just use “rails generate…”.

You should now have a fully functional rails app for CRUD operations on a User that also provides json formatting as an api. The main difference between the api for a traditional Rails app and the rails-api generated app is that the traditional rails app responds by default as html, whereas rails-api responds as json. To get a json response from the traditional rails app, you will need to append “.json” to the url – in other words, to get the list of users rendered as json rather than displayed as html, you’d need to request:

http://localhost:3000/users.json

whereas the rails-api version doesn’t require this extension, as a rails-api app by default returns json (and wouldn’t normally respond as html at all).

On the Ember side, we need to instruct the adapter to specifically request json from the Rails app, as this is no longer the default Rails response.

To accomplish this, we will modify the Ember adapter in app/user/adapter.js

import DS from 'ember-data';

export default DS.RESTAdapter.extend({
  host: 'http://localhost:3000',
  buildURL: function(record, suffix) {
        var s = this._super(record, suffix);
        return s + ".json";
    }
});

As you can see, this will append “.json” to all the requests send from Ember to Rails – even post, put, and delete requests , so you’ll need to explicitly handle the json format in any Rails controllers you intend to make available to Ember. As a result, we’ll need to modify the update and create methods in the Rails controller to specifically respond with json for Ember.

There is, inevitably, one more wrinkle – although Rails does respond by default to the “.json” extension, Ember expects a slightly different formatting, so you’ll need to make a few tweaks to get it working with Ember. Here’s the full controller code:

class UsersController < ApplicationController
  before_action :set_user, only: [:show, :edit, :update, :destroy]

  # GET /users
  # GET /users.json
  def index
    @users = User.all
    
    #render json: @users
    
    respond_to do |format|
      format.html
      format.json { render json: @users }
    end
  end

  # GET /users/1
  # GET /users/1.json
  def show
    respond_to do |format|
      format.html
      format.json { render json: @user }
    end
  end

  # GET /users/new
  def new
    @user = User.new
  end

  # GET /users/1/edit
  def edit
  end

  # POST /users
  # POST /users.json
  def create
    @user = User.new(user_params)

    respond_to do |format|
      if @user.save
        format.html { redirect_to @user, notice: 'User was successfully created.' }
        format.json { render :show, status: :created, location: @user }
      else
        format.html { render :new }
        format.json { render json: @user.errors, status: :unprocessable_entity }
      end
    end
  end

  # PATCH/PUT /users/1
  # PATCH/PUT /users/1.json
  def update
    respond_to do |format|
      if @user.update(user_params)
        format.html { redirect_to @user, notice: 'User was successfully updated.' }
        format.json { render :show, status: :ok, location: @user }
      else
        format.html { render :edit }
        format.json { render json: @user.errors, status: :unprocessable_entity }
      end
    end
  end

  # DELETE /users/1
  # DELETE /users/1.json
  def destroy
    @user.destroy
    respond_to do |format|
      format.html { redirect_to users_url, notice: 'User was successfully destroyed.' }
      format.json { head :no_content }
    end
  end

  private
    # Use callbacks to share common setup or constraints between actions.
    def set_user
      @user = User.find(params[:id])
    end

    # Never trust parameters from the scary internet, only allow the white list through.
    def user_params
      params.require(:user).permit(:name)
    end
end

You may notice some additional code in create and update. This is because we need to respond as json for Ember, which we configured to submit all requests with the .json extension (even POST and PUT requests).

At this point, you can bring up both a Rails app on port 3000 and an Ember app on port 4200 and use both a standard Rails view and the Ember client for CRUD operations on your User model.

This does require some extra overhead, but it does keep open the possibility of writing a traditional Rails app while providing an API for not just Ember but any other app that might want to consume a Rails API.

Ember and Rails Scaffolding for Simple CRUD

Like many developers who use Rails, I haven’t thought about scaffolding for a while, but I found myself wanting something like it when I started looking into Ember as a front end for Rails. As with Rails, my guess is that I won’t use Ember scaffolding for long, but I wanted a way to get that initial Ember application up and running, with Rails as a backend.

It turns out that there are easy to use scaffolding generators for an Ember front end and a Rails API backend, with just a few minor gotchas you need to be aware of when integrating the two.

For this tutorial, we’re going to do the simplest thing possible for a crud app. We will create a single model (User), with one field (name), and we will create a web app with Create, Retrieve, Update, and Delete operations. Ember will supply the front end, Rails will handle the back end.

One quick note – while there are ways to integrate Rails with Ember in the same application, this tutorial will build the Ember frontend out as a completely separate, stand-alone app from the Rails backend. This way, the front end can be integrated with any backend that provides the API that Rails provides.

Here we go…

Part 1 – Create an Ember front end

We’ll create a client side, front end application that provides an interface for CRUD operations.

Step 0 – Install ember and ember-cli.

This will also require installing node, npm. I’m pretty sure you’ll need git as well. I’m not going to cover all this (or the process for installing Ruby or Rails). There are plenty of guides on the web to getting all this installed. Just make sure you can execute the following command

ember -v

I’m using ember 1.13.8, node 0.12.6, and npm 2.13.4 (on Mac OS X).

Step 1 – Create an ember app

ember new ember-crud

Step 2 – Install the ember-cli-scaffold package

cd into ember-crud and run

ember install ember-cli-scaffold

(for more information, see https://www.npmjs.com/package/ember-cli-scaffold)

Step 3 – Generate scaffolding for a Model (User) with one field (Name)

ember generate scaffold user name:string --pod

The “pod” option creates a different organizational structure for your files than the standard ember defaults. I prefer it and will be using it here, but the differences are very minimal.

Step 4 – Verify that your app is working

ember serve

and go to http://localhost:4200/users

You should see a very Rails-ish interface with full CRUD for a user with a single input field, name. Go ahead and create, edit, delete a few to verify that it is all working.

The ember CRUD app is using a local dev storage provided by mirage. In the next step, we’ll swap this out for a Rails backend.

Part 2 – Provide a Rails API for CRUD operations

Ember is already providing the view, so rather than creating a full blown rails app, we will limit the Rails app to providing an API for persistence.

Step 0: Install Ruby and Rails

As with Ember, there are lots of resources on the web for getting Ruby and Rails installed. Make sure that you can run

ruby -v

(I’m using 2.1.0)

and

rails -v

(I’m using Rails 4.2.0. You will need this version or later for Rails to use the rails-api gem, which I believe will be included in Rails 5).

Step 1 – Create a rails API only application

Install the rails-api gem

gem install rails-api

And generate a new Rails application

rails-api new rails-crud

Step 2 – Create an API scaffold for CRUD operations for a User

cd into rails-crud and type

rails-api g scaffold user name:string

While I promised not to go into a lot of detail here, you may notice that no view tier is created, and if you look at the user controller, you’ll see that it is set up for rendering json, not html.

Step 3 – Seed with a bit of data

Technically, you don’t need to do this step, since you’ll populate from your Ember app, but it can help to verify everything is working properly on the Rails side before integrating with your Ember app.

in db/seeds.rb, create something like this

user1 = User.create(name: 'user1')
user2 = User.create(name: 'user2')

and run

rake db:migrate
rake db:seeds

Step 4 – Check out the Rails API

run:

rails server

go to localhost:3000/users

and you should see a json representation of the two users you created in the seeds.rb file.

Part 3 – Use the Rails API service as a backend for the Ember application

This is relatively straightforward, though there are a few wrinkles.

Step 1 – Modify the way Rails is rendering JSON for a list of objects

Take another look at the json returned from http://localhost:300/users

[{"id":1,"name":"user1","created_at":"2015-10-21T22:17:32.778Z","updated_at":"2015-10-21T22:17:32.778Z"},{"id":2,"name":"user2","created_at":"2015-10-21T22:17:32.783Z","updated_at":"2015-10-21T22:17:32.783Z"}]

You may notice that Rails has flattened this into a single array. Ember, by default, expects a slightly different formatting where the json array of User objects is stored in a hash with the model type as a key.

One approach to this problem is to use a serializer to establish the format for JSON output from the rails api.

Add the serializer to your Gemfile

gem 'active_model_serializers', '~>; 0.8.3'

and run

bundle update

and create a new serializer for the user model

rails g serializer user

This will create a user_serializer.rb file in app/serializers.

class UserSerializer < ActiveModel::Serializer
  embed :ids, embed_in_root: true
  attributes :id, :name
end

This code will format users the way Ember expects it at the defaults, and will include only the id and name that are expected by the Ember model we created earlier (the various Rails additions like created_at or updated_at will not be serialized and sent to Ember as JSON).

NOTE:

I recently tried this with ember-cli 2.11.0, and it looks like the formatting for JSON may have changed since I wrote this. To get this working, I had to create a new file named json_api.rb in the initializers folder containing the following code (per this tutorial from EmberIgniter).

ActiveSupport.on_load(:action_controller) do
  require 'active_model_serializers/register_jsonapi_renderer'
end

ActiveModelSerializers.config.adapter = :json_api

Once you’ve made these changes, reload http://localhost:3000/users Or go to one of the individual users http://localhost:3000/users/1 You should see the following change to the json representation

{"users":[{"id":1,"name":"user1"},{"id":2,"name":"user2"},{"id":3,"name":"user1"},{"id":4,"name":"user2"}]}

Step 2 – Tell Rails to allow Ember to use the API

For security reasons, Rails by default won’t allow an application running on a different port to access the API. To solve this, add the following to your Gemfile

gem 'rack-cors', :require => 'rack/cors'

And add the following configuration to your Rails config/application.rb file

class Application < Rails::Application

config.middleware.use Rack::Cors do
   allow do
     origins '*'
     resource '*', headers: :any, methods: [:get, :post, :put, :delete, :options]
   end
end

and run

bundle update

and restart the rails server

This is the minimal configuration to get this example working – for more information on how to allow cross-origin JSON properly check out the rack-cors documentation at https://github.com/cyu/rack-cors

Step 3 – Point Ember at the Rails API

In your Ember application, open the application.js file in app/user/adapter.js (if you didn’t use the –pod structure, this will be in app/adapters instead). You should see a line

namespace: 'api'

change this to

host: ‘http://localhost:3000’

You will also need to disable mirage so that it won’t intercept Ember communications with the rails app. In the config directory of the ember-crud application, open the environments.js file and add

if (environment === 'development') {
   ENV['ember-cli-mirage'] = {
   enabled: false
};
…

Step 4 – Turn off Ember’s cli security policy

Like Rails, ember comes with configuration defaults to protect against cross domain security problems. To get this example running quickly, you can remove this line from package.json in your Ember app.

“ember-cli-content-security-policy”: “0.4.0”,

As with other config options in this tutorial, this is something you’ll want to read about and understand rather than just disabling.

Step 5 – Verify that the changes to the Ember UI are directed to and properly handled by the Rails application

Restart ember by typing

ember serve

(you may need to stop and restart the server if it is still running), and navigate to

http://localhost:4200/users

You should see a list of the users you created in the Rails database seed.

Try adding, editing, or deleting a few users. You can verify the changes at the Rails back end by rendering the list of Users in the database as JSON by going to

localhost:3000/users

Step 6 – Fix the duplicates problem

You may have noticed that when you create a new record, two new items are added to your list – but if you look at the rails service, only one record was persisted and one of the records in the ember list has a null id. If you refresh the page, the record with the null id will disappear.

I’ve been looking around for a real solution to this. If you just want to make the null id record to disappear, you can hack it in the index route:

model: function() {
   return this.store.find('user', {id: true});
}

Postscript

I wrote this as part of my notes on getting up and running on Ember and Rails. I’ve found that if I don’t do these write ups when I’m learning something (figuring I’ll do it later when I understand it all better), there’s a good chance I’ll never do it at all. However, I figured it would be a good idea to run it by a colleague here at the UCSF CKM, Jon Johnson, who has some Ember experience. He said no problem posting his reply:


There are a couple of things you might do a little bit differently, but they aren’t wrong. I’m not sure if its a drop in but https://github.com/fotinakis/jsonapi-serializers looks like a better serializer to use in rails. Active record will work and continue to be supported, but if I were going at this from scratch I would start with that.

For Ember you might want to setup the adapter globally to talk to rails instead of just for the user model. You can do that with `ember g adapter application` It looks like that paragraph might be stuck between these two things as you reference application.js being in the user pod.

Instead of killing mirage in development you could also restart the server in production mode. Thats what we do to talk to the real API. Something like `ember s –env=production` will not go through mirage at all. I’m not sure if that is easier or harder than your way.

Random Forests and Datashare at the CDL Code Jam

The California Digital Library hosted a code jam earlier this month at the Oakland City Conference center.

This gathering brought together librarians and developers from University of California campuses for a series of working meetings with an eye toward system-wide projects, especially involving data curation.

In the spirit of an informal, code-jam style meeting, I presented a bit on my recent experiments using machine learning to categorize data. As a starting example, I applied a random forest to suggest subject keywords for data sets uploaded to the recently launched DataShare website.

Continue reading Random Forests and Datashare at the CDL Code Jam

Working with Blacklight Part 2 – displaying the result count

This is the second in of a series of posts about customizing Blacklight.

Last time, we implemented a feature that emailed a list of saved searches. We’d also like to display the number of results retrieved by each search. This task is a good way to learn about how a Solr response is stored and processed in Blacklight.

You can either start from a clean installation of Blacklight or build on the results of the previous exercise. A completed version is available on GitHub at https://github.com/ucsf-ckm/blacklight-email-tutorial.

Step 1: Add a “numresults” attribute to the Search model

Search history and saved searches are stored in an array of Search objects. The Search model in Blacklight holds the query_params for a search but doesn’t store the number of results. We’ll add an attribute, “numresults”, to store this value.

There are a few ways to do this in Rails – here, we’ll go with a migration.


rails g migration add_numresults_to_search numresults:integer

This should produce a new migration


class AddNumfoundToSearch < ActiveRecord::Migration
  def change
    add_column :searches, :numfound, :integer
  end
end

.. and run the migration


rake db:migrate

You may want to inspect the new schema or object to make sure that the model has been modified properly.

Step 1: Retrieve the number of results and store them in the Search object

Searches are created and stored in the search_context.rb class in the Blacklight gem (under lib/assets/blacklight/catalog/search_context.rb).


    saved_search ||= begin
      s = Search.create(:query_params => params_copy)
      add_to_search_history(s)
      s
    end

This code is not called explicitly in a controller – instead, it is run as a before_filter prior to the execution of any controller that includes it. This is mentioned in the comments at the top of the search_context.rb file.

This works for storing the query parameters, which are known before the controller is called. However, we won’t know the number of results in the Solr response until after the controller is called, so we’ll need to move this code the code for creating and saving a Search into a controller method.

We can get access to the object holding the solr response in the index method of the catalog controller (under lib/blacklight/catalog.rb in the Blacklight gem).


      (@response, @document_list) = get_search_results
      @filters = params[:f] || []

The get_search_results method in the solr_helper.rb runs a Solr query and returns a SolrResponse object (lib/solr_response.rb). Since this exercise is really about getting familiar with the Solr code base, it’s worth opening these classes and taking a look a how a query is executed and how results are stored.

The solr_response object (stored in @response, above) provides a hash with results data. The number of results is stored under “numFound”. We can now modify the index method to retrieve the number of results associated with a Solr query, add them to the Search object, and save the results.

Here’s the full code (add this to catalog_controller.rb in your local app).


 # get search results from the solr index
    def index

      (@response, @document_list) = get_search_results
      @filters = params[:f] || []

      params_copy = params.reject { |k,v| blacklisted_search_session_params.include?(k.to_sym) or v.blank? }

      return if params_copy.reject { |k,v| [:action, :controller].include? k.to_sym }.blank?

      saved_search = searches_from_history.select { |x| x.query_params == params_copy }.first

      s = Search.new(:query_params => params_copy)
      s.numfound = @response.response["numFound"]
      s.save
      add_to_search_history(s)

      respond_to do |format|
        format.html { }
        format.rss  { render :layout => false }
        format.atom { render :layout => false }

        format.json do
          render json: render_search_results_as_json
        end
      end
    end

Step 3: Add the number of results to the view

Now that the number of results is available in the Search object, you can easily display them in the index page in the saved_searches or search_history views.

Here’s the snippet for index.html.erb under saved_searches


<table class="table table-striped">
  <%- @searches.each do |search| -%>
    <tr>
      <td><%= link_to_previous_search(search.query_params) %></td>
      <td>results: <%= search.numfound %></td>
      <td><%= button_to t('blacklight.saved_searches.delete'), forget_search_path(search.id) %></td>
    </tr>
  <%- end -%>

The only change here is the addition of “search.numfound” populated in the controller method above.

You can add the number of results to the search_history similarly.

Step 4: Try it out

You should now be able to run a search, list the search history (or saved searches, depending on what views you modified), and view the number or results associated with each search.

One note – this numresults value won’t automatically update if new material is added to the index, but clicking on the search link would display the larger number of new files. So you could get out of sync here.

Working with Blacklight Part 1 – email search history

This is the first of a series of posts about configuring and modifying Blacklight at UCSF. It’s less about emailing search history and more about getting familiar with Blacklight by picking something to modify and seeing how it goes…

We are developing a front end for a Solr repository of tobacco industry documents. Blacklight, out of the box, provides a lot of what we’d need. We decided to come up with a business requirement that isn’t currently in Blacklight and see what it’s like working with the code.

We decided to try emailing a list of saved searches. This blog post is a write up of my notes. I’m hoping it will be useful as a tutorial/exercise for developers looking to get up to speed with with working with Blacklight code.

You should be able to start with a clean installation of Blacklight and add the functionality to email search histories from the notes here. A completed version is available on github at https://github.com/ucsf-ckm/blacklight-email-tutorial.

 Step 1: Get a clean installation of blacklight app going

Use the quickstart guide at
https://github.com/projectblacklight/blacklight/wiki/Quickstart

(do all of it, including the Jetty Solr part).

Step 2: Configure an SMTP mailer (optional)

This is optional, but I prefer not to use a sytem mailer on my dev machine.

in config/environments/development.rb


# Expands the lines which load the assets
config.assets.debug = true
config.action_mailer.delivery_method = :smtp
config.action_mailer.default_url_options = { host: 'myhost.com' }
config.action_mailer.perform_deliveries = true

config.action_mailer.smtp_settings = {
    :address => "smtp.gmail.com",
    :port => 587,
    :domain => "localhost:3000",
    :user_name => "username",
    :password => "password",
    :authentication => "plain",
    :enable_starttls_auto => true
}

Test this to be sure it works by creating and emailing a Blacklight bookmark to yourself (the next steps won’t work if this doesn’t work).

Step 3: Add a feature to send an email history through the saved searches page

1) Create and save a few searches

Do a few searches (anything you like), then go to Saved Searches and save a few of them.
You’ll notice that unlike the Bookmarks page, there’s no functionality to email your saved searches yet.

2) Add a button to email saved searches.

First, we need to add an email button to the saved searches page. We’ll piggyback on the email button used for bookmarks.

If you look in your views directory, you won’t see any code in your local app. It is currently stored in the Blacklight gem. Because our customizations are local we (of course) won’t hack the gem directly; we’ll add or override things in our local app.

You can follow this tutorial without looking at the Blacklight gem source directly, but I’d recommend unpacking the gem so that you can look at the code. Do not change the gem code.

We’ll need to add an email button to the Saved Searches page. To do this, we’ll need to both create a new view and override an existing view in the Blacklight gem.

The view code for the main display page for saved searches is in /app/views/saved_searches/index.html

We’ll override this page locally to add the email button. To do this, create a new directory called saved_searches in the views directory and create a file called index.html.erb with this content (modified from the same file in the gem itself):


<div id="content" class="span9">
<h1><%= t('blacklight.saved_searches.title') %></h1>

<%- if current_or_guest_user.blank? -%>
<h2><%= t('blacklight.saved_searches.need_login') %></h2>
<%- elsif @searches.blank? -%>
<h2><%= t('blacklight.saved_searches.no_searches') %></h2>
<%- else -%>
<p>
<%= link_to t('blacklight.saved_searches.clear.action_title'), clear_saved_searches_path, :method => :delete, :data => { :confirm => t('blacklight.saved_searches.clear.action_confirm') } %>
</p>

<h2><%= t('blacklight.saved_searches.list_title') %></h2>
<%= render 'search_tools' %>

<table class="table table-striped">
<%- @searches.each do |search| -%>
<tr>
<td><%= link_to_previous_search(search.query_params) %></td>
<td><%= button_to t('blacklight.saved_searches.delete'), forget_search_path(search.id) %></td>
</tr>
<%- end -%>
</table>

<%- end -%>

</div>

This will add a search tools (through <%= render ‘search_tools’ %>) to the index page.

The _search_tools.html.erb partial doesn’t exist in the gem. To create one, we’ll copy and modify the _tools.html.erb partial from the gem (used to render the various tools for bookmarks) to create a partial _search_tools.html.erb (also in the saved_searches view folder).


<ul class="bookmarkTools">

<li class="email">
<%= link_to t('blacklight.tools.email'), email_search_path(:id => @searches), {:id => 'emailLink', :class => 'lightboxLink'} %>
</li>
</ul>

3) Create routes for for the email_search path

This email button links to a new path (email_search_path) that will need routes. Your first instinct as a Rails programmer might be to look into config/routes.rb.  But the Blacklight gem uses a separate class in /lib/blacklight/routes.rb to generate most of the routes.

Instead of manually creating a new route in the config folder, we’ll modify Blacklight’s routes class. There are a few ways to do this. You could override the entire class by creating a routes.rb file under the same directory path in your rails app. For this exercise, we’ll limit our modifications to the method we need to override and put the code in the initializer folder under lib/blacklight.routes.rb. Although we’re only overriding one method, I would recommend taking a look at the full source in the gem to get a better sense of what this class does.


require "#{Blacklight.root}/lib/blacklight/routes.rb"

# -*- encoding : utf-8 -*-
require 'deprecation'
module Blacklight
  class Routes
    extend Deprecation
    protected
    module RouteSets
      def saved_searches(_)
        add_routes do |options|
          delete "saved_searches/clear",       :to => "saved_searches#clear",   :as => "clear_saved_searches"
          get "saved_searches",       :to => "saved_searches#index",   :as => "saved_searches"
          put "saved_searches/save/:id",    :to => "saved_searches#save",    :as => "save_search"
          delete "saved_searches/forget/:id",  :to => "saved_searches#forget",  :as => "forget_search"
          post "saved_searches/forget/:id",  :to => "saved_searches#forget"
          get "saved_searches/email",       :to => "saved_searches#email",   :as => "email_saved_searches"
          post "saved_searches/email"
        end
      end
    end
    include RouteSets
  end
end

4) Add a form to submit the email

Now that the routes are in place, we can create the form needed to submit an email.

In app/views/saved_searches create an email.html.erb view. This is based on the email.html.erb used to email bookmarks (under app/views/catalog in the blacklight gem).


<div class="modal-header">
<button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button>
<h1><%= t('blacklight.email.form.title') %></h1>
</div>
<%= render :partial => 'email_search_form' %>

In the same directory, create a partial to provide the form fields.

_email_search_form.html.erb


<%= form_tag url_for(:controller => "saved_searches", :action => "email"), :id => 'email_search_form', :class => "form-horizontal ajax_form", :method => :post do %>

<div class="modal-body">
<%= render :partial=>'/flash_msg' %>
<div class="control-group">
<label class="control-label" for="to">
<%= t('blacklight.email.form.send_to') %>
</label>
<div class="controls">
<%= text_field_tag :to, params[:to] %><br/>
</div>
</div>
<div class="control-group">
<label class="control-label" for="message">
<%= t('blacklight.email.form.message') %>
</label>
<div class="controls">
<%= text_area_tag :message, params[:message] %>
</div>
</div>
</div>
<div class="modal-footer">
<button type="submit" class="btn btn-primary"> <%= t('blacklight.sms.form.submit') %></button>
</div>
<% end %>

5) Add an email_search action to the controller

The partial form invokes a controller action (email) that doesn’t exist yet. We’ll add this next.

The Blacklight gem has a class saved_searches_controller.rb that holds the controller methods for saved_searches. It’s worth taking a look at this controller class in the gem (in lib/blacklight/catalog.rb). We’ll be basing our new controller method on the email_record action that already exists in this catalog controller (also in the gem).

In app/controllers/saved_searches_controller.rb (in your local instance), put:


require "#{Blacklight.root}/app/controllers/saved_searches_controller.rb"

# -*- encoding : utf-8 -*-
class SavedSearchesController < ApplicationController
  include Blacklight::Configurable

  # Email Action (this will render the appropriate view on GET requests and process the form and send the email on POST requests)

  def email

     @searches = current_user.searches

     if request.post? and validate_email_params

        email = SearchMailer.email_search(@searches, {:to => params[:to], :message => params[:message]}, url_options)
        email.deliver

        flash[:success] = I18n.t("blacklight.email.success")

        respond_to do |format|
           format.html { redirect_to catalog_path(params['id']) }
           format.js { render 'email_sent' }
        end and return
     end

     respond_to do |format|
        format.html
        format.js { render :layout => false }
     end
  end

  def validate_email_params
    case
    when params[:to].blank?
      flash[:error] = I18n.t('blacklight.email.errors.to.blank')
    when !params[:to].match(defined?(Devise) ? Devise.email_regexp : /^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$/)
      flash[:error] = I18n.t('blacklight.email.errors.to.invalid', :to => params[:to])
    end

    flash[:error].blank?
  end

end

Here, the email action is grabbing the saved searches from the current_user object and storing them in an array.

@searches = current_user.searches

If the call to this method is POST, this means the form has been submitted, so the method will call a mailer method (email_search, which we still need to write) and pass the @searches array as a parameter.

6) Create the mailer method

Create a new file named search_mailer.rb in the app/mailers directory. This is similar to the record_mailer.rb file in the blacklight gem, adapted for a list of searches rather than bookmarks.


require "#{Blacklight.models_dir}/record_mailer.rb"

# -*- encoding : utf-8 -*-
# Only works for documents with a #to_marc right now.
class RecordMailer < ActionMailer::Base

   def email_search(searches, details, url_gen_params)
      subject = I18n.t('blacklight.email_search_subject', :title => ("search results") )

      @searches = searches
      @message = details[:message]
      @url_gen_params = url_gen_params

      mail(:from => "youremail@yourserver.edu", :to => details[:to], :subject => subject)
   end
end

The subject text (blacklight.email.search_subject) doesn’t exist yet. You can see a full list in the gem under config/locales. We’ll add the new text required for our local app under blacklight.en.yml.


en:
  blacklight:
    application_name: 'Blacklight'
    email_search_subject: 'Your saved search history'

7) Create the mailer view

You will also need a view for this mailer to create the body of the email that will be sent. The view for document emails in the Blacklight gem is in app/views/record_mailer/email_record.html.erb.

We’ll create a similar view for the search history email.

In your local app, create a search_mailer directory in app/views, and create a new view named email_search.text.erb.  (In other words, create  app/views/record_mailer/email_search.text.erb.)


Here are your saved searches

sent with the message: <%= @message %>

<% @searches.each do |s| %>
http://localhost:3000/?<%=(s.query_params).to_query%>
<%end%>

Give it a try! You should now be able to email your saved searches through Blacklight.

8) Next steps

As you can see, the email view for search is hacky. You don’t want to hardcode localhost and you should probably exclude the action and controller name in the URL. You might also want to consider moving some of the headers and text to a configuration file. (Check out config/locales/blacklight.en.yml and blacklight.fr.yml in the gem for a starting point.)