Forbidden Emacs Lisp Knowledge: Block Comments

12/02/2022

Note: The \037 sequence appearing in the code snippets is one character, escaped for readability.

It’s been eight years since I started using Emacs and Emacs Lisp and I still keep running into dusty corners. Traditionally, Lisp dialects use the semicolon for line comments, with block and s-expression comments being optional features.

Dialect	Line comment	Block comment	S-expression comment
Clojure, Hy	`;`	n/a	`#_`
Common Lisp[1]	`;`	`#\|...\|#`	`#+(or)`
Emacs Lisp, Lush	`;`	n/a	n/a
ISLisp, LFE, uLisp	`;`	`#\|...\|#`	n/a
NewLisp	`;`, `#`	n/a	n/a
Picolisp[2]	`#`	`#{...}#`	n/a
Racket, Scheme[3]	`;`	`#\|...\|#`	`#;`
TXR Lisp	`;`	n/a	`#;`
WAT[4]	`;;`	`(;...;)`	n/a

Emacs Lisp is special though. Here’s an unusual section from the Emacs Lisp reference on comments:

The #@COUNT construct, which skips the next COUNT characters, is useful for program-generated comments containing binary data. The Emacs Lisp byte compiler uses this in its output files (see “Byte Compilation”). It isn’t meant for source files, however.

At first sight, this seems useless. This feature is meant to be used in .elc, not .el files and looking at a file produced by the byte compiler, its only use is to emit docstrings:

;;; This file uses dynamic docstrings, first added in Emacs 19.29.

[...]

#@11 docstring\037
(defalias 'my-test #[...])

This is kind of like a block-comment, except there is no comment terminator. For this reason, the characters to be commented out need to be counted. You’d think that the following would work, but it fails with an “End of file during parsing” error:

(defvar my-variable #@8 (/ 1 0) 123)

It took me a dive into the reader to find out why:

#define FROM_FILE_P(readcharfun)                            \
  (EQ (readcharfun, Qget_file_char)                         \
   || EQ (readcharfun, Qget_emacs_mule_file_char))

static void
skip_dyn_bytes (Lisp_Object readcharfun, ptrdiff_t n)
{
  if (FROM_FILE_P (readcharfun))
    {
      block_input ();                /* FIXME: Not sure if it's needed.  */
      fseek (infile->stream, n - infile->lookahead, SEEK_CUR);
      unblock_input ();
      infile->lookahead = 0;
    }
  else
    { /* We're not reading directly from a file.  In that case, it's difficult
         to reliably count bytes, since these are usually meant for the file's
         encoding, whereas we're now typically in the internal encoding.
         But luckily, skip_dyn_bytes is used to skip over a single
         dynamic-docstring (or dynamic byte-code) which is always quoted such
         that \037 is the final char.  */
      int c;
      do {
        c = READCHAR;
      } while (c >= 0 && c != '\037');
    }
}

Due to encoding difficulties, the #@COUNT construct is always used with a terminating \037 AKA unit separator character. While it seems that the FROM_FILE_P macro applies when using the reader with get-file-char or get-emacs-mule-file-char (which are used by load internally), I never managed to trigger that code path. The reader therefore seems to always ignore the count argument, essentially turning #@COUNT into a block comment facility.

Given this information, one could obfuscate Emacs Lisp code to hide something unusual going on:

(message "Fire the %s!!!" #@11 "rockets")\037

(reverse "sekun"))

A more legitimate usecase is a multi-line shebang:

#!/bin/sh
#@0 -*- emacs-lisp -*-
exec emacs -Q --script "$0" -- "$@"
exit
#\037

(when (equal (car argv) "--")
  (pop argv))

(while argv
  (message "Argument: %S" (pop argv)))

In case you want to experiment with this and want to use the correct counts, here’s a quick and dirty command:

(defun cursed-elisp-block-comment (beg end)
  (interactive "r")
  (save-excursion
    (save-restriction
      (narrow-to-region beg end)
      (goto-char (point-min))
      ;; account for space and terminator
      (insert (format "#@%d " (+ (- end beg) 2)))
      (goto-char (point-max))
      (insert "\037"))))

There’s one more undocumented feature though, #@00 is special-cased as EOF comment:

/* Read a decimal integer.  */
while ((c = READCHAR) >= 0
       && c >= '0' && c <= '9')
  {
    if ((STRING_BYTES_BOUND - extra) / 10 <= nskip)
      string_overflow ();
    digits++;
    nskip *= 10;
    nskip += c - '0';
    if (digits == 2 && nskip == 0)
      { /* We've just seen #@00, which means "skip to end".  */
        skip_dyn_eof (readcharfun);
        return Qnil;
      }
  }

The EOF comment version can be used to create polyglots. An Emacs Lisp script could end with #@00, then concatenated with a file tolerating leading garbage. The ZIP format is known for its permissive behavior, thereby allowing you to embed several resources into one file:

[wasa@box ~]$ cat polyglot.el
(message "This could be a whole wordle game")
(message "I've attached some dictionaries for you though")#@00
[wasa@box ~]$ cat polyglot.el wordle.zip > wordle.el
[wasa@box ~]$ file wordle.el
wordle.el: data
[wasa@box ~]$ emacs --script wordle.el
This could be a whole wordle game
I've attached some dictionaries for you though
[wasa@box ~]$ unzip wordle.el
Archive:  wordle.el
warning [wordle.el]:  109 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  inflating: wordle.de
  inflating: wordle.uk

This could be combined with the multi-line shebang trick to create a self-extracting archive format. Or maybe an installer? Or just a script that can access its own resources? Let me know if you have any interesting ideas.

[1]	Strictly speaking, `#+(or)` isn’t a comment, but a conditional reader construct with an always false feature test. While one may shorten it to `#+nil` or `#-t`, that would be incorrect because both may be registered features.

[2]	Here’s a notable exception using the number sign instead. The semicolon is a function for property access.

[3]	`#\|...\|#` and `#;` are available as of R6RS and R7RS. R5RS implementations may support them as non-standard extension.

[4]	Semicolons must be doubled or part of a block comment. This feels like an unfortunate design choice for implementors.