One of the datasets we use in the DATA Lab is a bunch (millions of files totaling tens of gigabytes) of JSON files, each with a single entity represented in them. I got tired of, in this order:

  • scanning the entire dataset when I wanted something specific
  • running a DBMS service on my workstation and INSERTing and indexing the data
  • rewriting/rerunning the Python script to make a dict/hashmap of field values to filenames every time something changed

I generalized the Python script I was using to manage the index into a more user-friendly, configurable tool. It’s a pretty niche piece of software targeted at people who have enough data to warrant indexing, but don’t want to run a DBMS for such a task (ie, myself and a few colleagues specifically).