On Fungi and Data Stores

17/03/2021

Warning: Rant ahead. Feel free to skip the nstore backend section.

Motivation

I’ve spent the past year looking into the fungi kingdom and the deeper I look, the weirder it gets. One barrier of entry is identifying mushrooms, with two different schools of thought:

Carefully observing their features and using a dichotomous key system to narrow down to a manageable set of matches. I found Michael Kuo’s website useful for this.
Taking a few photos and letting a neural network analyze them.

I’m not a fan of the latter approach for various reasons. You’re at the mercy of the training set quality, it’s easy to subvert them and they’re essentially undebuggable. I also found that Wikipedia has basic identification data on mushrooms. Therefore I thought it to be a fun exercise to build my own web application for quickly narrowing down interesting Wikipedia articles to read. You can find the code over at https://depp.brause.cc/brause.cc/wald/, with the web application itself hosted on https://wald.brause.cc/.

Data munging

The mushroom data uses so-called mycomorphboxes to hold their characteristics. Using the Wikipedia API one can query for the latest revision of every page containing a mycomorphbox template and fetch its contents in the form of JSON and Wiki markup.

While I like writing scrapers, I dislike that the programs tend to be messy and require an online connection for every test run. I used the chance to try out the ETL pattern, that is, writing separate programs that perform the extraction (downloading data from the service while avoiding tripping up API limits), transformation (massaging the data into a form that’s easier to process) and loading (putting the data into a database). I quite like it. Each part has its own unique challenges and by sticking to a separate program I can fully focus on it. Instead of fetching, transforming and loading up the data every time, I focus on fetching it correctly to disk, then transform the dump to a more useful form, then figure out how to load it into the database. If more patterns of that kind emerge, I can see myself writing utility libraries for them.

Data stores

There were two obvious choices for storing the data:

Keeping it as JSON and just writing ugly code traversing the parse tree.
Using SQLite because it’s a fast and reliable solution. That is, once you’ve come up with a suitable schema fitting the problem at hand.

I wanted to try out something different this time, though - something other than JSON or a relational database. Perhaps something in the NoSQL realm that’s both pleasant to use and comes with a query language. Or maybe some dumb key-value store to speed up loading and dumping the data. I ended up going with a tuple store, but I’m still considering to give graph and document databases a try. Here’s some benchmark figures for querying all available filters and filtering species with a complicated query:

[wasa@box ~]$ time DB=json ./benchmark mushrooms.json >/dev/null
Filters: 14898.5027832031μs
Query stats: 1808.65561523438μs
DB=json ./benchmark mushrooms.json > /dev/null  1.37s user 0.09s system 98% cpu 1.480 total
[wasa@box ~]$ time DB=sqlite ./benchmark db.sqlite3 >/dev/null
Filters: 214.554809570313μs
Query stats: 3953.87497558594μs
DB=sqlite ./benchmark db.sqlite3 > /dev/null  0.24s user 0.01s system 96% cpu 0.253 total
[wasa@box ~]$ time DB=nstore ./benchmark db.lmdb >/dev/null
Filters: 355414.137402344μs
Query stats: 407887.70847168μs
DB=nstore ./benchmark db.lmdb > /dev/null  8.15s user 0.05s system 99% cpu 8.250 total

Bonus: There should be no hardcoded storage solution, but the possibility to choose it at runtime. This would hopefully not complicate things too much and encourage cleaner design. For this I came up with a simple API revolving around establishing/closing a database connection, performing a transaction on that connection and querying filters/species on a transaction.

JSON backend

This was rather braindead code. It’s far from pretty, but does the job surprisingly well. Queries are acceptably fast, so it makes for a nice fallback. Initial loading time is a bit slow though, using a key-value store like LMDB would help here. Maybe it’s time for a binary Scheme serialization solution along the lines of Python’s pickle format, but without the arbitrary code execution parts…

SQLite backend

It took considerable time to get the schema right. I ended up asking another SQL expert for help with this and they taught me about EAV tables. Another oddity was that the database only performed properly after running ANALYZE once. The code itself is relatively short, but makes use of lots of string concatenation to generate the search query.

nstore backend

Retrospectively, this was quite the rabbit hole. I ignored the warning signs, persisted and eventually got something working. But at what cost?

My original plan was to use a graph database like Neo4j. I’ve seen it used for analysis of social graphs, Active Directory networks and source code. It’s powerful, though clunky and oversized for my purposes. If I can avoid it, I’d rather not run a separate Java process and tune its garbage collection settings to play well with everything else running on my VPS. On top of that I’d need to write a database adaptor, be it for their HTTP API or the Bolt protocol. If you’re aware of a comparable in-process solution, I’d be all ears. It doesn’t even need to do graphs (the data set doesn’t have any connections), a JSON store with a powerful query language would be sufficient.

I asked the #scheme channel on Freenode about the topic of graph databases and was told that tuple stores have equivalent power, while being considerably easier to implement. SRFI-168 describes a so-called nstore and comes with a sample in-memory implementation depending on SRFI-167 and a few others. Getting it running seemed like an ideal weekend exercise. Or so I thought. I’ve run into the following obstacles and weeks turned into months of drudgery:

The specifications themselves are of subpar quality. It seems little proofreading was done. There are minor errors in the language and several unclear parts and outright design mistakes that render parts of the library unusable. Unfortunately I noticed this long after the SRFI has been finalized. While the process allows for errata, it took some back and forth to get the most egregious faults in SRFI-167 fixed. Some faults remain in SRFI-168 and the sample implementation is incompatible with SRFI-167 due to an API change.
There is no such thing as a query language. You get basic pattern matching and SRFI-158 generators. Everything else, like grouping results or sorting them, you must do yourself. For this reason the nstore implementation is a bit more verbose than the JSON one. Relevant webcomic.
The sample implementation itself depends on several other SRFIs, most of which I had to port first. Granted, I only did this because I wanted to contribute them properly to the CHICKEN coop, but it was still bothersome. I hacked on SRFI-125, SRFI-126, SRFI-145, SRFI-146, SRFI-158, SRFI-167, SRFI-168 plus alternative versions of SRFI-125 (using a portable hash tables implementation instead of the stock one) and SRFI-167 (using LMDB for its backend).
Some of the SRFIs were particularly difficult to port. SRFI-125 turned out to neither work with stock hash tables (they’re incompatible with R6RS-style APIs) nor the R6RS-style hash table implementation provided by SRFI-126 (the stock implementation fails with custom comparators and the portable SRFI-69 implementation runs into an infinite loop when executing the test suite). SRFI-167 requires a custom backend for on-disk storage, I initially messed around with Sophia for this (turned out to be unusable) and eventually settled for a LMDB-backed implementation. The SRFI-167 and SRFI-168 eggs deviate from the official APIs and have therefore not been published. For this reason only SRFI-145, SRFI-146 and SRFI-158 have been added to the coop.
During the time I worked on the project, some of the links pointing towards documentation, implementations and example code broke and pointed nowhere. When I communicated with the author, I got the impression they had become dissatisfied with the project and wanted to start over on a clean slate. Links have been replaced, but some code has been permanently lost. Most recently they admitted they don’t have any working implementation of SRFI-167 and SRFI-168 at hand. I consider this a deeply troubling sign for the health of the project and therefore discourage anyone from relying on it.
Once I actually got everything running with LMDB for the backing store, I was surprised to see awful overall performance. Even with JSON a query takes only a few milliseconds, whereas here it’s two orders of magnitude more. I did some light profiling and identified hot paths in both SRFI-128 and SRFI-167. For this reason the web application is currently using the SQLite backend.
The APIs themselves are kind of clumsy. I worked around this with my data storage abstraction, but it’s still something to look out for. If you compare it to clojure.jdbc or the sql-de-lite egg, there’s a few obvious usability improvements to be done.
Eventually, after criticism from other people, the entire SRFI was considered to be withdrawn. It hasn’t been withdrawn so far as the process requires a replacement SRFI. I believe this to be a mistake.
The SRFI process in general has accelerated in the last few years due to R7RS-large making heavy use of it for its dockets. There is the occasional SRFI among them that is too ambitious in scope and bound to become outdated. I believe this to be an abuse of the standardization process, instead there should be experimentation on a decoupled platform such as Snow or Akku. Once the project has been approved by the community and implemented by several Scheme systems, it can be considered for standardization. The pre-srfi repository lists a few upcoming projects of that kind, such as HTTP servers/clients, a P2P network proposal, a web UI library and Emacs-style text manipulation. I’m doubtful they will be anywhere as successful as existing non-portable Scheme libraries.

Needless to say that I’ve become increasingly frustrated over time. To the SRFI-168 author’s credit, they’ve always been civil, recognized the design mistakes and are working on a less ambitious replacement library. While I do regret the time that went into this adventure, I have learned a few lessons:

LMDB and key-value stores in general are great. They’re easy to comprehend, have fast load times and can be a quick and dirty solution when dealing with relational models is complete overkill. I’m not sure whether ordered key-value stores are worth it though.
While it’s true that tuple stores are roughly equivalent in power to graph databases, graph databases still have the edge. Mind you though, this piece has been written by a Neo4j person, so it’s most likely biased. Still, I’m inclined to believe their claims.
Portable code is cool, but it cannot compete with highly tuned solutions. Do not expect a sample implementation of a database to rival SQLite and friends.

Web frontend

I assumed this part to be way harder, but it only took me two days of hacking without any sort of web framework backing it. I do miss some of the conveniences I learned from writing Clojure web applications though:

I had to write my own database abstraction instead of using clojure.jdbc and a connection string. On top of that there’s ugly code to detect which database to use and perform a dynamic import.
Stuart Sierra’s component library gives you easy dependency injection. For example you can access configuration and database connections from a HTTP handler directly instead of having to use global or dynamically bound variables.
A ring-style API with a request/response alist and middleware manipulating them would improve discoverability considerably. It’s no deal breaker though.

Further thoughts

I’d have expected this project to suck any remaining enthusiasm for writing web applications out of me, but it didn’t. While I’m not sure whether I’ll stick to Scheme for them, I could see myself doing another one soonish. I think I’ll abstain from investing more time into databases though and hack on something else for the time being.