7 x R7RS

14/09/2017

The nice thing about Scheme is that not only it comes in more than seven standards[1], but with tons of implementations as well. I’ve heard from various people that R7RS-small is good enough to write portable code in it that will work on more than one implementation, so I decided to try implementing MAL in it. This has been mostly successful as I’ve started out with 11 implementations, narrowed it down to 7 and found a new one for future experimentation. During this time I’ve reported 14 bugs for two implementations, Cyclone and Foment. My impressions of the implementations can be summarized as follows:

Chibi
The recommended implementation if you’re looking for a fully standards-compliant one. It comes with its own comprehensive test suite that has been borrowed by others. I’ve had zero issues with it, although the overall speed could be better, earning it the last place in the toy benchmarks (where it’s tied with Foment). This is somewhat weird considering that it’s advertised as small yet embeddable (look at Lua for an example of a fast one in that domain); Picrin would be another candidate for that purpose.
Kawa
It proudly wears the GNU banner (despite not being the Scheme officially endorsed by GNU), but is otherwise completely unknown. This is a shame because unlike the other JVM languages I’ve worked with it comes with a fast compiler, has a small boot time and didn’t pose any issues for me, other than having to learn how this classpath thing works in Java. Speed is decent as it comes third place in the toy benchmarks, interop doesn’t look painful either. Its author did a few more interesting things with it, such as implementing other languages like Emacs Lisp (which forms the basis of JEmacs). I might just as well invest more time into it and use it to write things where Clojure would otherwise be a bad idea.
CHICKEN
My favorite implementation for writing all kinds of small, standalone programs. I occasionally stumble upon issues, but none for this run. It’s not completely obvious how to create R7RS programs using libraries from anywhere else than the current directory, this will be hopefully addressed in the next major release, among with a more R7RS-like reorganization of the namespaces and improvements to the package system. Speedwise it’s the best, despite the r7rs egg being known for incurring a huge speed penalty in numerical benchmarks.
Gauche
The domain says it all. I haven’t expected it to support R7RS considering how long it has been around, but it does! Thanks to this it contains many contrib libraries and performs fairly well for an interpreter-only system. If I hadn’t discovered CHICKEN, this might have been my preferred Scheme to go for.
Picrin
The other contender in the embeddable domain. Unfortunately I ran into several problems, the most problematic ones being that it’s currently not buildable from master and that there are no official plans to support loading user-written libraries. For this reason I wouldn’t recommend using it over Chibi.
Sagittarius
A fork of Gauche with more features, some changes (build system, command-line switches) and a bit of performance improvements.
Cyclone
Judging from the design documents this is basically CHICKEN, but with a different approach to GC and threading that allows it to do native threads. It’s a relatively young project and therefore bound to have countless bugs, yet it’s second place in the speed department.
Foment
Someone’s personal project for learning Scheme. It’s about the same speed as Chibi and not quite finished yet.
Guile
The Scheme officially endorsed by GNU. It had a long time in making, but its R7RS support is minimal and only covers reader adjustments.
Racket
I’m somewhat surprised there is unofficial support for R7RS, given that the language itself has been derived from R6RS. Not tested yet.
Larceny
Research project that supports R5RS, R6RS and R7RS to varying degrees. Due to R7RS incompatibilities, a somewhat outdated documentation and a ridiculous toolchain, I’ve decided to test it at some later point to see whether it becomes the fastest implementation for MAL.
Gerbil
I’ve learned about this one from a blog post by Fare, the maintainer of ASDF, a build system for CL packages. It turns out that not only he likes Racket, no, he also switched to this Scheme implementation because it’s made by a close friend who not only ported Racket’s module system to Gambit, but also added R7RS support to it. I expect it to become usable soon and to make for another contender to the fastest Scheme implementation for MAL.
[1]Additionally to RnRS there’s the IEEE standard (which mostly resembles R5RS) and the upcoming R7RS-large standard that aims at giving Scheme a big standard library.

Parsing the Hard Way

26/08/2017

Hello again and sorry for the long downtime! My current Emacs project is an EPUB reader, something that requires parsing XML and validating/extracting data from it. The former can be done just fine in Emacs Lisp with the help of libxml2 (or alternatively, xml.el), for the latter there is no good solution. Typically people go for one of the following approaches:

  • Don’t parse at all and just use regular expressions on the raw XML. This works somewhat okayish if your input is predictable and doesn’t change much.
  • Parse and walk manually through the parse tree with car, cdr and assoc. Rather tedious and requires writing your own tree traversal functions for anything less than static XML.
  • Invent your own library and use a selector DSL for DOM traversal. I’ve seen a few of those, like xml+.el, enlive.el and xml-query.el, however they support relatively little features in their selectors, use their own language instead of a well-established one (such as CSS selectors or XPath) and are usually not available from a package archive for easy installation.

As I’m a big fan of APIs like Python’s lxml with the cssselect module and used the esxml package before, I decided to implement CSS selectors for it. The general strategy was to take parse a CSS selector into a suitable form, do tree traversal by interpreting the parse tree and return the nodes satisfying the selector. Surprisingly enough, the hardest part of this were the parsing bits, so I’ll go into a bit more of detail on how you’d do it properly without any dependencies.

The approach taken in esxml-query.el is recursive-descent parsing, as seen in software like GCC. Generally speaking, a language can be described by a set of rules where the left side refers to its name and the right side explains what it expands to. Expansions are sequences of other rules or constants (which naturally cannot be expanded) and may contain syntactic sugar, such as the Kleene star (as seen in regular expressions). Given an input string described by the grammar, a parser breaks it down according to its rules until it has found a valid combination. The easiest way to turn a grammar into code is by expressing it with a function for each rule, with each function being free to call others. Success and failure can be expressed by returning a piece of the parse tree, a special sentinel value (I’ve chosen to return nil if the rule wasn’t completely matched) or throwing an error, thereby terminating the computation. If all recursively called rule functions returned a bit of the parse tree, the top-level call returns the complete parse tree and the parsing attempt has been successful.

Traditionally there is an extra step before parsing the string, as it’s a bit tedious to express the terminating rules as a sequence of characters, the string is typically preprocessed by a so-called lexer into a list of tagged tokens. This is relatively simple to do in Emacs Lisp by treating the string like a buffer, finding a token that matches the current position, adding it to the list of found tokens and advancing the position until the input has been exhausted. There is one non-trivial problem though, depending on the token definitions it can happen that there are two different kinds of tokens for a given position in the input string. A simple solution here is picking the longer match, this is why the tokenization in esxml--tokenize-css-selector finds all possible matches and picks the longest one.

The syntactical sugar used for the official CSS grammars consists of alternation (|), grouping ([...]), optionals (?) and greedy repetition (* and +). Given the basic token operations (peek) (return first token in the stream) and (next) (pop first token in the stream), it’s straight-forward to translate them to working code by using conditionals and loops. For example, the rule whitespace: SPACE* is consumed by calling (next) while (pop) returns a whitespace. To make things easier, I’ve also introduced an (accept TYPE) helper that uses (peek) to check whether the following token matches TYPE and either consumes it and returns the value or returns nil without consuming. With it the preceding example can be shortened to (while (accept 'space)). Similarly, alternation is expressed with cond and grouping with a while where the body checks whether the grouped content could be matched.

This parsing strategy allows for highly flexible error reporting going beyond “Invalid selector” errors I’ve seen previously in a browser console as you immediately know at which place the parser fails and are free to insert code dealing with the error as you see fit. Be warned though that you must understand the grammar well enough to transform it into a more suitable form, yet equivalent form if you run into rules that are hard or even impossible to express as code. Debugging isn’t too bad either, you can observe the junctions taken by your code and quickly spot at which it goes wrong.

I’m looking forward to venture into parser combinators and PEGs next as they follow the same approach, but involve less code to achieve similar results.


One of Those Who Can Type

11/07/2017
<paluche> wasamasa: how many times have you done MAL?
<wasamasa> paluche: this is the fourth time
<paluche> So...
<wasamasa> so, three times so far
<paluche> you're a MAL-coholic?

I don’t even need to answer this one, do I? PR #4 is for Smalltalk, one of those languages I always wanted to try out because of their influence. What convinced me to finally do it was that I had to learn a bit of Objective-C for work and this made the second obviously Smalltalk-inspired language I’ve encountered, other than Ruby. Overall, it was a pretty nice experience, despite not using an image-based and IDE-centric workflow with GNU Smalltalk. As usual, here’s my list of random notes:

  • Smalltalk has relatively minimalistic syntax
  • My biggest stumbling points with it were sequencing syntax (semicolon and period) and precedence rules
  • It’s kind of telling that the standard all implementations go by omitted how the syntax for defining a method looks like, this makes it more difficult than it should to share self-contained code examples on the web
  • Documentation or rather, finding the right method is a big problem, the canonical solution would be to use the browser, however the search function there errors out
  • I’ve therefore resorted to searching the locally installed sources and internets for things not specific to the implementation
  • Apply (aka valueWithArguments) is supported, variadic args aren’t or rather, there is no syntax for them
  • There is no cond/case construct, instead you’re supposed to either return from every conditional, do a dictionary lookup, use polymorphism or nest boolean tests
  • There is no continue/break construct, but it can be emulated easily with the non-standard valueWithExit method
  • It’s awesome just how much of the language is implemented in itself
  • It’s interesting that there are only five keywords and no special forms, only methods that are implemented in terms of VM primitives
  • The class hierarchy makes more sense than in Java, generally this is the cleanest implementation of OO I came across (though Self dares challenging this by replacing metaclasses with a prototyping mechanism as seen in JS)
  • The debugging experience is less than stellar, I have to try out Pharo for the real deal
  • Blocks are considerably cleaner than in Ruby where you have three options for subtly different purposes with subtly different semantics
  • Block syntax however is imperfect, you cannot return from them (attempting to do so gives you a silly error message), instead the last expression is used as return value
  • This gets stranger if you consider that (non-local) returns are exclusive to methods; in other words blocks aren’t intended to be as powerful as method bodies
  • There is no stack overflow in step 5, however if you push far enough, you can reach an unrecoverable OOM condition
  • The class library is far less forgiving than in Ruby, for instance slicing/access errors out instead of returning an empty array or nil
  • String syntax is a bit weird with regards to quoting as strings are delimited with , no backslash escapes exist and escaping of is done by doubling it
  • If there’s a language that invented monkey-patching, this is the one; I’ve successfully made use of this capability to fix a runtime bug that only happens in the CI environment

Nostalgia

18/05/2017

My first programming experience was with VB6. I considered trying TurboPascal, but couldn’t figure out how to use it from cmd.exe on my measly Windows 98 SE machine. Many things have been said about the quality of the BASIC family of languages and the subpar programs created with VB6, but it’s easy to forget how convenient they make it for the beginner to create something workable without losing their head in irrelevant details in the windowing toolkit. The workflow looks mostly like this: Create and arrange widgets, edit their properties, double-click each widget with custom behavior and write code for the respective event handler. Needless to say that the few original programs I did write weren’t of the useful kind. Only much later after a brief encounter with C++ for game programming (which made me quit programming for a few years) I discovered scripting languages, learned to love Python and eventually spun off into the polyglossia required by modern software development jobs.

windows93.net reminded me of the good ol’ times, leaving me wondering just how hard it would be to recreate the irretrievably lost programs from back then. My excursions into GUI programming have taught me that GNOME offers two separate projects that connect their introspection for GTK and friends with a JavaScript interpreter, seed and gjs. Unfortunately seed is no longer in the Arch repositories, so I picked gjs, just to find out that there is no official documentation, aside from an auto-generated set of API docs provided by a user.

The good news is that with the help of #gtk+ on Freenode, I managed porting all applications in less than a day. The bad news is that I doubt I’ll use gjs again as its situation reminds me of LLVM: Lots of potential, intriguing claims (in their case, introspection giving you language bindings for free) but with outdated documentation if any (you’re better off with reading the source and experimenting a lot) and tooling being half-assed. Rewriting teapub might be possible, but for what gain? It’s no surprise Qt is eating their lunch despite being more complex and requiring you to use C++. Had I been given this environment ten years ago, I’d probably have gone for browsers instead. So, in a way it’s not surprising that environment appears to breed the new generation of programmers.


A Visit to a Sad Planet

18/04/2017

One of the things that always irked me about my EPUB reader was that part of it is written in JavaScript, simply because that’s the only way one can script a browser view these days. I’ve always wanted to give the SPOCK compiler a try to see how well it does compared to ClojureScript, so this project gave me the perfect chance to cure me of delusions. After an evening, I got it running, it took me a few more evenings to iron out bugs introduced by the conversion and add a new bugfix. Here’s a list of observations made in that time:

  • Both versions are about the same size, with the Scheme version being a bit shorter (which is mostly closing parentheses not going on a separate line). I’d expect a greater difference in favor of the Scheme version if I had any noteworthy business logic embedded into this, but alas.
  • Debugging got significantly harder as there is no REPL, no source maps integration and no debugger for the Scheme code. I’ve had to do with classic printf-Debugging, except that it looked more like (%inline "console.log" (jstring (string-append "Foo " (number->string 42)))).
  • While there is documentation (which includes a few working examples), it isn’t clear how to use the compiler to its fullest abilities. I’ve resorted to compiling all kinds of code and staring at the compiler output to see what works and what doesn’t. This experimentation revealed that you’ll want to use %inline for most interop, with a bit of dot syntax for property access. ClojureScript beats SPOCK easily in this aspect, including its #js reader macro and conversion macros from/to JS data structures.
  • Error reporting is extremely basic, with some errors being silent and merely preventing code from executing any further.
  • The supported language is restricted to R5RS with a few useful macros and JS-specific helpers. In other words, while you might manage compiling other Scheme libraries to JavaScript, you’re better off writing your own helpers as needed.
  • Tooling is simple and quick. Recompiling code is instantaneous, it’s easy to see what part of your own code maps to the generated parts. This is the only benefit I see in SPOCK over ClojureScript.

To summarize, if you want maximum comfort and features, go for ClojureScript. The price you pay for it is significant friction while developing, but other than that it’s pretty advanced. Personally I think I’ll stay with vanilla JavaScript for my other toy projects to keep things as simple and painless as possible.

I predict that Guile Emacs won’t lead to a significant increase in packages written in Scheme for similar reasons. Much like in browsers, the majority of Emacs Lisp usage doesn’t have complex business logic and follows the principle of practicality over purity. Perhaps it’s different for big projects like Magit or Evil, but even these cases are doubtful to me, simply because they have higher priorities than speculative rewrites that might as well kill the project. I could keep rambling about my reasons for this assessment, but that is better left for a separate blog post…