Introducing Amalgamatic

“Search!” by Jeffrey Beall licensed CC BY-ND 2.0

Academic libraries offer many resources, but users cannot be expected to search in, say, a half dozen different interfaces to find what they’re looking for. So academic libraries typically offer federated search.

Sometimes, a solution is purchased. Many libraries, for example, use 360 Search.

Here at UCSF, we are among the libraries that have built our own federated search. Twice.

There are (at least) three ways to pull data out of other resources in real time.

  1. Cool, they have an API for that!
    This almost never happens.

  2. I will screen-scrape the #*%!?@ out of your website!
    This is by far the most common scenario.

  3. Web New-dot-Oh: It’s full of JavaScript that injects the content.
    An edge case, but one that is becoming more important all the time. Make friends with PhantomJS to scrape these sites.

When trying to implement these solutions, a common scenario is to build your screen-scraping federated search tool with traditional server-side languages like Java or PHP.

These strategies and technologies bring with them pitfalls to be avoided. Recompiling your WAR every time one of your target systems modifies their HTML layout, anyone?

Here’s another pitfall: Our group built a solution years ago (our first one) that is implemented in Drupal with no external facing API. So, if I want to experiment with a different results interface, I need to write it in Drupal. This tight coupling prevents experimentation with other technologies or things that don’t fit neatly into the Drupal paradigm.

A lot of the pitfalls can be avoided by following sound software architecture principles. But one thing should be uncontroversial:

No programming language has a more robust and widely-understood set of conventions and tools for processing blobs of HTML than JavaScript.

So how about building your federated search server using Node.js? Or maybe even take it a step further and just let your user’s browser execute the federated search entirely by itself, no need to talk to your server! If it’s all just JavaScript, why not?

That is our approach this second time around.

First, we wrote a pluggable, extensible federated search tool called Amalgamatic.

Second, we wrote the plugins that we needed to search the resources we were interested in:

In the course of writing these plugins, we used all three of the techniques described above (API, scraping HTML, and using a headless browser to get JavaScript-generated content).

Third, we used Amalgamatic to expose federated search on our API server. (source code)

Fourth, we set up a prototype search interface to use that API. (source code)

Lastly, because we could, we used Browserify to create a demo showing how to use Amalgamatic so that all the retrieval and processing happens in the browser—no need for an intermediary API or search server! (source code)

I hope others find this work useful. Use Amalgamatic, ask questions, file issues for bugs or feature requests, and write and publish your own plugins.

(Or tell me about your project that already does this better and I need to fold up shop or at least steal all your ideas.)

While I’m at it with the small-text thing, here’s a caveat on the Browserify-ed version: The one thing the browser couldn’t do was launch the PhantomJS headless browser for scraping sites that depend on JavaScript execution to display results. Fortunately for us, that was needed in the LibGuides plugin only. And LibGuides offers an API, so if we really wanted LibGuides results, we could use the API. We initially implemented it that way, actually, but found that the API results differed from the LibGuides search page results. We thought that might be confusing to users, so we went with PhantomJS-assisted scraping.