One of the datasets we use in the DATA Lab is a bunch (millions of files totaling tens of gigabytes) of JSON files, each with a single entity represented in them. I got tired of, in this order:
- scanning the entire dataset when I wanted something specific
- running a DBMS service on my workstation and INSERTing and indexing the data
- rewriting/rerunning the Python script to make a dict/hashmap of field values to filenames every time something changed
I generalized the Python script I was using to manage the index into a more user-friendly, configurable tool. It’s a pretty niche piece of software targeted at people who have enough data to warrant indexing, but don’t want to run a DBMS for such a task (ie, myself and a few colleagues specifically).