Code Conversion Language

20/07/2019

Update: I forgot that I did a brief analysis on this many years ago, using ROT13 as example.

Update: Noam Postavsky pointed out on #emacs that CCL is not turing-complete after all as a full simulation (as opposed to just interpreting a single line) requires an Emacs Lisp loop. This loop cannot be done in CCL itself as it doesn’t allow feeding its output back in as input. The I/O restrictions most likely put it into the weaker category of finite-state transducers.

Emacs is most famously, a re-imagination of a Lisp machine, with the Emacs Lisp byte-code interpreter being at its core. A lesser-known fact is that there’s two more byte-code interpreters in its C sources, one for compiled regular expressions and another designed for encoding and decoding text, known as Code Conversion Language (CCL). This blog post will focus on the latter as it’s largely gone unnoticed and hasn’t seen too much experimentation.

The CCL implementation is split into the byte-code interpreter (ccl.c) and compiler (ccl.el) parts. There is no official documentation other than comments and docstrings found in these files. From this I’ve learned that CCL programs are represented as integer vectors and that there’s a higher-level language compiling to them, described in the ccl-define-program docstring. By reading that information I’ve deduced the following:

  • The VM has eight integer-sized registers r0 to r7 and an instruction counter ic
  • Register r7 is used as a status register and may be clobbered at any time by an arithmetic operation
  • A CCL program can either be run on a string and return a string, alternatively it can be run standalone for side effects
  • The former mode requires you to provide a nine-element status vector representing the registers and instruction counter, the latter an eight-element status vector representing the registers only
  • As a side-effect, the status vector contains the new state of the registers and instruction counter after executing the program
  • The VM supports the standard C arithmetic, comparison and assignment operators
  • The language translates several control flow statements to equivalent goto statements, such as if, branch (look-up table) and loop with repeat inside
  • Statements are grouped by surrounding them with parentheses
  • When operating on a string, they are read in and written out in a serial fashion, no random access whatsoever
  • It’s possible to do a look-up on an array, translation table or hash table
  • There is a call operator, but no stack to save/restore arguments to/from, so you’ll have to come up with a calling convention fitting the available registers
  • Each CCL program specifies a magnification factor which determines the ratio between output and input string size

Armed with that knowledge I wrote some boiler plate code for experimentation:

;; -*- lexical-binding: t; -*-

(require 'ccl)

(defvar ccl-status (make-vector 8 0))
(defvar ccl-status+ic (make-vector 9 0))

(defun ccl-init-status (status args)
  (let ((i 0))
    (fillarray status 0)
    (dolist (arg args)
      (aset status i arg)
      (setq i (1+ i)))))

(defun ccl-run (program string &rest args)
  (let ((status ccl-status+ic))
    (ccl-init-status status args)
    (ccl-execute-on-string program status string)))

(defun ccl-run-pure (program &rest args)
  (let ((status ccl-status))
    (ccl-init-status status args)
    (ccl-execute program status)
    status))

There will be some benchmark numbers, none of these are to be taken seriously. Do your own benchmarks before using mine for decisions.

Hello World!

For starters I’ll focus on processing strings. The easiest possible program that still does something useful reads in output and writes it out as is:

(define-ccl-program ccl-identity
  '(1
    (loop
     (read r0)
     (write r0)
     (repeat))))

(ccl-run 'ccl-identity "Hello World!") ;=> "Hello World!"

Let’s go through that program carefully. The S-Expression starts with a magnification factor of 1, meaning that the output buffer should be as large as the input buffer. If it were zero, no I/O would be permitted in the first place, whereas a factor greater than one would allocate enough space to produce a string larger than the input.

The magnification factor is followed by a s-expression that’s executed until it’s done or an error occurred, such as there being no more input. It may be followed by another s-expression that’s executed after the main one, no matter whether it failed with an error or not.

ccl-identity uses a pattern that will come up a few more times in this blog post. It enters a loop, reads a character into the r0 register, writes out a character from the r0 register and jumps to the beginning of the loop. If there are no more characters left, the read operation fails and terminates the loop. Let’s spice things up by adding an extra processing step before writing out the character:

(define-ccl-program ccl-xor
  '(1
    (loop
     (read r1)
     (r1 ^= r0)
     (write r1)
     (repeat))))

(ccl-run 'ccl-xor "Secret" 42) ;=> "yOIXO^"
(ccl-run 'ccl-xor "yOIXO^" 42) ;=> "Secret"

XOR is the bread and butter operator in modern cryptography. A text can be encrypted by replacing each character with the result of XORing it against a secret byte, similarly it can be decrypted by applying the same transformation again. To pass the secret byte as an argument, I’ve placed it in the r0 register and read the string into the r1 register instead. On each iteration of the loop r1 is set to r1 ^ r0 and written out again.

More on translation

In the real world translating characters isn’t as simple as applying some arithmetic to them. Suppose I wanted to challenge the upcase built-in:

(define-ccl-program ccl-upcase
  '(1
    (loop
     (read r0)
     (if (r0 >= ?a)
         (if (r0 <= ?z)
             (r0 -= 32)))
     (write r0)
     (repeat))))

The processing step is a bit more involved this time. If the read-in character appears to be between the a and z characters, transform it by subtracting 32. Why 32? Take a look at an ASCII table and you’ll see that this is the distance between uppercase and lowercase letters. Unfortunately this implementation cannot challenge upcase as it fails to translate non-ASCII characters correctly and is slower than the real deal:

(ccl-run 'ccl-upcase "Hello World!") ;=> "HELLO WORLD!"
(ccl-run 'ccl-upcase "Mötley Crüe") ;=> "MöTLEY CRüE"
(benchmark 100000 '(ccl-run 'ccl-upcase "Hello World!"))
;; => "Elapsed time: 0.165250s (0.072059s in 1 GCs)"
(benchmark 100000 '(upcase "Hello World!"))
;; => "Elapsed time: 0.119050s (0.072329s in 1 GCs)"

Let’s try again with a different text transformation where I actually have a chance to win, ROT13:

(define-ccl-program ccl-rot13
  '(1
    (loop
     (read r0)
     (if (r0 >= ?a)
         (if (r0 <= ?z)
             ((r0 -= ?a)
              (r0 += 13)
              (r0 %= 26)
              (r0 += ?a))))
     (if (r0 >= ?A)
         (if (r0 <= ?Z)
             ((r0 -= ?A)
              (r0 += 13)
              (r0 %= 26)
              (r0 += ?A))))
     (write r0)
     (repeat))))

This time the program needs to recognize two different character ranges to process, lowercase and uppercase ASCII characters. In either case they’re translated to their position in the alphabet, rotated by 13, then translated back to ASCII again. Surprisingly enough, this is enough to beat both rot13-string and rot13-region:

(ccl-run 'ccl-rot13 "Hello World!") ;=> "Uryyb Jbeyq!"
(ccl-run 'ccl-rot13 (ccl-run 'ccl-rot13 "Hello World!"))
;; => "Hello World!"
(benchmark 100000 '(ccl-run 'ccl-rot13 "Hello World!"))
;; => "Elapsed time: 0.248791s (0.072622s in 1 GCs)"
(benchmark 100000 '(rot13-string "Hello World!"))
;; => "Elapsed time: 6.108861s (2.360862s in 32 GCs)"
(with-temp-buffer
  (insert "Hello World!")
  (benchmark 100000 '(rot13-region (point-min) (point-max))))
;; => "Elapsed time: 1.489205s (1.017631s in 14 GCs)"

I then tried to use translation tables for a final example of a “Vaporwave” converter, but failed. Funnily enough this mirrors my overall experience with Emacs, it’s easy to write fun things, but the moment one tries to write something useful, you discover it’s not fun and sometimes not even up to the task. At least it’s possible to salvage the translation tables and use them with translate-region instead, the built-in used by rot13-string and rot13-region:

(defvar ccl-vaporwave-table
  (make-translation-table-from-alist
   (cons '(?\s . 12288)
         (mapcar (lambda (i) (cons i (+ i 65248)))
                 (number-sequence 33 126)))))

(defun vaporwave-it (string)
  (with-temp-buffer
    (insert string)
    (translate-region (point-min) (point-max) ccl-vaporwave-table)
    (buffer-string)))

(vaporwave-it (upcase "aesthetic")) ;=> "AESTHETIC"

Edging towards general-purpose computing

All examples so far have worked on text. If you limit yourself to numbers, you can solve some basic arithmetic problems. Here’s a classic, calculating the factorial of a number:

(define-ccl-program ccl-factorial
  '(0
    ((r1 = 1)
     (loop
      (if r0
          ((r1 *= r0)
           (r0 -= 1)
           (repeat)))))))

(defun factorial (n)
  (let ((acc 1))
    (while (not (zerop n))
      (setq acc (* acc n))
      (setq n (1- n)))
    acc))

While the regular version is more concise, the logic is nearly the same in both. Here’s some numbers:

(aref (ccl-run-pure 'ccl-factorial 10) 1) ;=> 3628800
(factorial 10) ;=> 3628800
(benchmark 100000 '(ccl-run-pure 'ccl-factorial 10))
;; => "Elapsed time: 0.069063s"
(benchmark 100000 '(factorial 10))
;; => "Elapsed time: 0.080212s"

This isn’t nearly as much of a speed-up as I’ve hoped for. Perhaps CCL pays off more when doing arithmetic than for looping? Another explanation is that the Emacs Lisp byte-code compiler has an edge over CCL’s rather simple one. Here’s a more entertaining example, printing out the lyrics of 99 Bottles of Beer on the Wall:

(define-ccl-program ccl-print-bottle-count
  '(1
    (if (r0 < 10)
        (write (r0 + ?0))
      ((write ((r0 / 10) + ?0))
       (write ((r0 % 10) + ?0))))))

(define-ccl-program ccl-99-bottles
  '(1
    (loop
     (if (r0 > 2)
         ((call ccl-print-bottle-count)
          (write " bottles of beer on the wall, ")
          (call ccl-print-bottle-count)
          (write " bottles of beer.\n")
          (write "Take one down and pass it around, ")
          (r0 -= 1)
          (call ccl-print-bottle-count)
          (write " bottles of beer on the wall.\n\n")
          (repeat))
       ((write "2 bottles of beer on the wall, 2 bottles of beer.\n")
        (write "Take one down and pass it around, 1 bottle of beer on the wall.\n\n")
        (write "1 bottle of beer on the wall, 1 bottle of beer.\n")
        (write "Take one down and pass it around, no more bottles of beer on the wall.\n\n")
        (write "No more bottles of beer on the wall, no more bottles of beer.\n")
        (write "Go to the store and buy some more, 99 bottles of beer on the wall.\n"))))))

(defun 99-bottles ()
  (with-output-to-string
    (let ((i 99))
      (while (> i 2)
        (princ (format "%d bottles of beer on the wall, %d bottles of beer.\n" i i))
        (princ (format "Take one down and pass it around, %d bottles of beer on the wall.\n\n" (1- i)))
        (setq i (- i 1))))
    (princ "2 bottles of beer on the wall, 2 bottles of beer.\n")
    (princ "Take one down and pass it around, 1 bottle of beer on the wall.\n\n")
    (princ "1 bottle of beer on the wall, 1 bottle of beer.\n")
    (princ "Take one down and pass it around, no more bottles of beer.\n\n")
    (princ "No more bottles of beer on the wall, no more bottles of beer.\n")
    (princ "Go to the store and buy some more, 99 bottles of beer on the wall.\n")))

This example shows a few more interesting things, generating text of unknown length is rather hard, so I’m using the standard magnification factor of 1 and estimate how big the buffer will be to create an appropriately sized input string. call is useful to not repeat yourself, at the cost of having to carefully plan register usage. Printing out the bottle count can be done if you’re limiting yourself to whole numbers up to 100, a generic solution is going to be hard without random access to the output string. The performance numbers for this one are somewhat surprising:

(let ((input (make-string 15000 ?\s)))
  (benchmark 1000 '(ccl-run 'ccl-99-bottles input 99)))
;; => "Elapsed time: 0.301170s (0.217804s in 3 GCs)"
(benchmark 1000 '(my-99-bottles))
;; => "Elapsed time: 1.735386s (0.507231s in 7 GCs)"

This doesn’t make much sense. Is using format that expensive? It’s hard to tell in advance whether CCL will make a noticable difference or not.

But is it Turing-complete?

My experimentation so far left me wondering, is this language turing-complete? You can perform arithmetics, there’s goto, but the I/O facilities, amount of registers and memory access are limited. The easiest way of proving this property is by implementing another known turing-complete system on top of your current one. I researched a bit and found the following candidates:

  • Brainfuck: A classic, however it requires writable memory. Registers could be used for this, but you don’t have many to play with. You’d need the branch instruction to simulate the data pointer.
  • subleq: Implementing subleq looks easy, but suffers from the same problem as Brainfuck, it requires you to modify an arbitrary memory location. I’ve found a compiler from a C subset to subleq that generates code operating beyond the handful of registers, so that’s not an option either.
  • Rule 110: It’s basically Game of Life, but one-dimensional and can be implemented in a serial fashion. With some tricks it doesn’t require random access either. The proof of it being turing-complete looks painful, but whatever, I don’t care. It’s perfect. There are more elementary cellular automata, so I’ll try to implement it in a generic fashion and demonstrate it on Rule 90 which produces the Sierpinski triangle.
(defmacro define-ccl-automaton (n)
  (let ((print-sym (intern (format "ccl-rule%d-print" n)))
        (rule-sym (intern (format "ccl-rule%d" n))))
    `(progn
       (define-ccl-program ,print-sym
         '(1
           ((r4 = 0)
            (if (r0 == ?1)
                (r4 += 4))
            (if (r1 == ?1)
                (r4 += 2))
            (if (r2 == ?1)
                (r4 += 1))
            (branch r4
                    (write ,(if (zerop (logand n 1)) ?0 ?1))
                    (write ,(if (zerop (logand n 2)) ?0 ?1))
                    (write ,(if (zerop (logand n 4)) ?0 ?1))
                    (write ,(if (zerop (logand n 8)) ?0 ?1))
                    (write ,(if (zerop (logand n 16)) ?0 ?1))
                    (write ,(if (zerop (logand n 32)) ?0 ?1))
                    (write ,(if (zerop (logand n 64)) ?0 ?1))
                    (write ,(if (zerop (logand n 128)) ?0 ?1))))))
       (define-ccl-program ,rule-sym
         '(1
           ((r6 = ,n)
            (r0 = 0)
            (read r1)
            (read r2)
            (loop
             (call ,print-sym)
             (read r3)
             (r0 = r1)
             (r1 = r2)
             (r2 = r3)
             (repeat)))
           ((r0 = r1)
            (r1 = r2)
            (r2 = r5)
            (call ,print-sym)))))))

(define-ccl-automaton 30)
(define-ccl-automaton 90)
(define-ccl-automaton 110)

(defun ccl-sierpinski ()
  (with-output-to-string
    (let ((line "0000000001000000000"))
      (dotimes (_ 20)
        (princ line)
        (terpri)
        (setq line (ccl-run 'ccl-rule90 line))))))

The macro may look scary, but all it does is defining two CCL programs. What an elementary cellular automaton does is looking at the two cells around the current cell, map them to a cell depending to a rule and emit it. There are two edge cases with this for the first and last cell, in my implementation the first cell assumes the previous one was a zero and the last cell uses the first cell. Since there’s no random access, it’s stored into a spare register at the beginning and accessed in a S-Expression after the main loop terminated due to no more input. The surrounding and current cell are stored in three registers and rotated every time a new cell is read in. The mapping is done in the print program by summing up the ones and zeroes, then using the branch instruction to apply the rule to it. If you find this hard to follow, here’s an Emacs Lisp version of it using random access and less limited arithmetic to do the job:

(defun rule--evolve (prev cur next n)
  (let ((acc (+ (if (= prev ?1) 4 0)
                (if (= cur ?1) 2 0)
                (if (= next ?1) 1 0))))
    (if (zerop (logand n (ash 1 acc))) ?0 ?1)))

(defun rule-evolve (line n)
  (let ((out (make-string (length line) ?0)))
    (dotimes (i (length line))
      (cond
       ((zerop i)
        (aset out i (rule--evolve ?0 (aref line i) (aref line (1+ i)) n)))
       ((= i (1- (length line)))
        (aset out i (rule--evolve (aref line (1- i)) (aref line i) (aref line 0) n)))
       (t
        (aset out i (rule--evolve (aref line (1- i)) (aref line i) (aref line (1+ i)) n)))))
    out))

(defun sierpinski ()
  (with-output-to-string
    (let ((line "0000000001000000000"))
      (dotimes (_ 20)
        (princ line)
        (terpri)
        (setq line (rule-evolve line 90))))))

One more benchmark run, this time with less surprising performance numbers:

(benchmark 1000 '(ccl-sierpinski))
;; => "Elapsed time: 0.365031s (0.071827s in 1 GCs)"
(benchmark 1000 '(sierpinski))
;; => "Elapsed time: 0.545512s (0.071829s in 1 GCs)"

If you want to see it in action, try evaluating (progn (princ (sierpinski)) nil) in M-x ielm.

Now for a big letdown, despite everything what I’ve demonstrated, this system is not turing-complete after all. While it’s capable of processing a single line of input, the proof of Rule 110 being turing-complete relies on feeding its output in as input over and over again, however that part has been done in Emacs Lisp as it’s impossible to do in CCL. I’m not 100% sure what CCL’s computing power is, Noam Postavsky suggested on #emacs that it’s most likely a finite-state transducer.

Final words

You may ask yourself now whether you should rewrite all of your slow code to CCL programs. I don’t believe that’s the way to go for a number of reasons:

  • Speedups vary wildly, somewhere between -40% to 450%. There’s no obvious way of predicting whether the optimization is worth it.
  • It’s hard to write an equivalent program or sometimes even impossible, especially if it requires you to use many variables or random access read/write operations.
  • It’s hard to debug a CCL program. While the compiler does a good job at catching syntax errors, runtime errors are far harder to figure out if you can only stare at the registers. Maybe a debugger could be built that uses the “continue” facility and instruction counter register…
  • It’s hard to maintain a CCL program. Not to mention, there’s hardly people who know how to write them. Most of the examples I’ve found online do text encoding/decoding. The only exception to this is pgg-parse-crc24-string which lives in a file marked as obsolete.
  • This leads me to my last point, CCL may be obsoleted as well. Granted, it will take time, but so far I haven’t seen people enthusiastic about it being a thing.

If you still believe that despite of this it’s worth giving CCL a try, please let me know, especially if you’re doing something non-standard with it, like advanced cryptography or number crunching. Likewise, if you’re not convinced that it’s turing-complete or that I could be doing some things far better than presented, send me a message.


Trapping Attackers Without Nyan Cat

19/05/2019

As of a little over two weeks ago I’ve stopped offering the service I’ve described in an earlier blog post. Someone™ managed maxing out my server’s resources by establishing as many connections as possible and keeping them open for hours. Eventually my hoster sent me an email about unusually high resource usage. After a reboot everything was back to normal, but I didn’t feel like allowing this to happen again, so I shopped for a more efficient SSH honeypot.

endlessh fits the bill. It does as little work as possible, yet wastes the unsuspecting attacker’s time in a rather sneaky way: By sending an infinite SSH banner. The downside is that you won’t ever find out what kind of software the attacker is using since it doesn’t proceed beyond that stage. I was curious how well it works in practice, so I collected logs for two weeks, wrote a bit less than 100 SLOC of Ruby to do basic statistics and plotted the overall distribution with gnuplot.


Hosts: 21286
Unique hosts: 1665
Total time wasted: 17 days, 2 hours, 51 minutes and 40 seconds
Max time wasted: 7 days, 9 hours, 19 minutes and 39 seconds
Average time wasted: 20 minutes and 38 seconds
Total bytes sent: 43.6M
Max bytes sent: 1.1M
Average bytes sent: 2.1K
Max connections established: 58
Greatest netsplit: 47

Top 10 hosts:
  112.***.***.***: 12160 connections
   58.***.***.***: 2427 connections
  218.***.***.***: 1792 connections
   58.***.***.***: 358 connections
  195.***.***.***: 174 connections
  185.***.***.***: 109 connections
  222.***.***.***: 65 connections
  223.***.***.***: 58 connections
  115.***.***.***: 52 connections
  180.***.***.***: 51 connections
/img/endlessh-thumb.png

"That's a spicy meatball!"

26/04/2019

QEMU is pretty neat, particularly when combined with KVM for hardware-accelerated virtualization. I’m currently using it with an Ubuntu guest for research purposes and missed a few convenience features that just magically work in VMware and VirtualBox: Seamless resizing and copy/paste between guest and host. I’ve found a few guides online telling me to use SPICE, but nearly all of them assumed you start the VM via virt-manager, a nice frontend with an ugly XML configuration format. Personally I prefer writing a small shell script, so here’s mine, adapted from the relevant Arch Wiki article:

#!/bin/bash
socket='/tmp/vm_spice.socket'
qemu-system-x86_64 \
    -m 2G -enable-kvm -drive file=disk.qcow2,format=qcow2 \
    -vga qxl -device virtio-serial-pci \
    -device virtserialport,chardev=spicechannel0,name=com.redhat.spice.0 \
    -chardev spicevmc,id=spicechannel0,name=vdagent \
    -spice "unix,addr=$socket,disable-ticketing" &

sleep 5
remote-viewer -f "spice+unix://$socket"

The first line of options is stuff specific to my setup. The guest won’t work comfortably without at least 2G of RAM, KVM makes virtualization run close to native speed and the drive points to a qcow2 file. The remaining options are for setting up SPICE with a QXL video device and all the guest interop jazz, listening on a UNIX socket. After waiting for a while, remote-viewer is spawned against the UNIX socket in full-screen mode. You might want to experiment here, for example you can configure it to use your favorite mouse cursor release key combination.

Inside the guest you’ll have to do a few more things:

[wasa@box ~]# apt-get install spice-vdagent xserver-xorg-video-qxl
[wasa@box ~]# systemctl enable spice-vdagentd
[wasa@box ~]# systemctl start spice-vdagentd
[wasa@box ~]$ spice-vdagent

The last line is about running the client which communicates clipboard requests and alike to the daemon. I discovered the hard way that the spice-vdagent package comes with autorun entries for popular desktop environments, so instead I went for launching it inside i3 as soon as my session starts.

To see changes to screen resolution you’ll have to restart your session and check /var/log/Xorg.0.log to correctly detect and use the QXL driver. If that’s indeed the case, xrandr will show a Virtual-0 device and xrandr --output Virtual-0 --auto changes its resolution to the best fitting one.

Finally, there’s one more underappreciated feature this setup gives you, changing focus is seamless so you’ll no longer need a dedicated key combination for releasing or capturing the mouse cursor. If you’re in windowed mode focus is back to the host once you hover over anything outside the VM’s screen, in fullscreen mode you can hover over the top middle part of its screen, an OSD appears (with options such as disabling fullscreen mode) and focus is back to the host again. Switching focus between host and guest no longer sends spurious keys to the guest, for example pressing $mod+3 to switch to a VM on workspace 3 used to enter the number 3 into whatever application had focus inside the VM. This is no longer the case with remote-viewer and makes for a smooth user experience.


Microcorruption

24/04/2019

Hey folks. After successfully completing the original Cryptopals exercises, I’ve decided to try the Microcorruption challenge Thomas Ptacek helped designing. They are about binary exploitation, the art of turning a buffer overflow into arbitrary code execution. There’s plenty of write-ups about solving the challenges, so mine will be about the low-level debugging involved. While the notes below are specific to the challenge, you may find them helpful for understanding debuggers in general. For more debugging insights, I can recommend reading “The Art of Debugging” by Norman Matloff and Peter Jay Salzman.

Challenge-specific points

Every program in the challenge follows the same pattern of munging some data, asking for user input and conditionally unlocking a door if all requirements have been met. Your task is figuring out user input that will unlock the door in a reproducible manner. The big revelation is that unlike in a reverse engineering challenge you don’t even have to provide a valid password, all that matters is that you trick the system into unlocking the door.

The challenge tries to make it as simple as possible to figure out what’s going on. The programs run on a MSP430 microcontroller, a RISC architecture so simple that the full list of mnemonics fits on two manual pages. Most of the reverse engineering work has been done for you by giving you a disassembled view of the program code, with function names for each section. There is a separate page for assembling and disassembling code. The debugger is graphical and features a dashboard of widgets to follow the program’s control flow. This makes it easy to spot patterns, as opposed to textual debuggers where you need to keep all relevant information in your head. It’s a bit like the difference between Nethack (top-down rogue-like) and Zork (classic text adventure).

Using the debugger

The most fundamental tool you have in your toolbox is the breakpoint. With it you put stop signs on your program and can skip the boring parts easily. Typically you put breakpoints on functions, but you can put them on absolute addresses as well. Note that if you reset the debugger (as opposed to rebooting it), breakpoints are preserved, this makes it easy to try a different way of executing the program. Another pattern that comes up is creating a breakpoint to skip ahead to some point of the program, then undoing it later to inspect that part of the program more closely.

Stepping through code is crucial to get right. You can continue execution to proceed to the next breakpoint or user input action. To proceed one instruction, use step which is also known as step in. This is because, when faced with a call to a function, step would step into it, with the next instruction being inside that function. To avoid this there’s next which is also known as step over, it behaves almost the same except that, if faced with a call, it will not enter the function, but step over it so that you stay at the same code snippet. If you find yourself in a function you’d want to get out quickly, use finish. This will execute instructions until the function’s end.

There are short forms of these commands, c (continue), s (step), n (next) and f (finish). To repeat the last command, use the enter key. It’s possible to repeat commands with a numerical argument and to define macros (mostly useful for more complex repetitive patterns), but I haven’t made use of these yet.

Poking the memory

To understand how memory can be corrupted in ways useful to an attacker, it’s important to know how a running program is represented in memory. For this, it’s useful to track the memory and register views as the program is running. You’ll find that in this challenge there are separate regions of memory dedicated to the stack, heap and program code. If the memory layout permits it, interesting things can happen when the attacker writes beyond the designated boundaries. For example, if the stack pointer points to attacker-controlled memory, anything reading memory from the stack will be influenced, such as the return address to the last function (which could be changed to jump to a different function). Another example is overwriting adjacent stack memory to change a local variable’s contents. It’s even possible to put machine code (also known as shellcode because it’s typically designed to spawn a shell) on the stack, then jump into it and continue execution from there on, provided that the stack is executable.

Endianness is a subtle point here. The MSP430 architecture is little-endian, so the least significant byte comes first. The number 256 for example is encoded as the bytes 0x00 and 0x01. This is important to know when trying to make sense of numbers in the disassembly and interpreting user input (like, when entering an address).


Smooth Video Game Emulation in Emacs

21/02/2019

I have a lengthy TODO.org of things I might eventually implement for Emacs, most of which are not exactly useful, are challenging to do or fulfill both conditions. A NES emulator fits all of these criteria neatly. I’ve kept hearing that they can run on poor hardware and learned that the graphics fit into a tiled model (meaning I wouldn’t have to draw each pixel separately, only each tile), so given good enough rendering speed it shouldn’t be an impossible task. Then the unexpected happened, someone else beat me to the punch with nes.el. It’s an impressive feat, but with one wrinkle, its overall speed is unacceptable: Mario runs with a slowdown of over 100x, rendering it essentially unplayable. For this reason I adjusted my goals a bit: Emulate a simpler game platform smoothly in Emacs at full speed.

Enter the CHIP-8. It’s not a console in that sense, but a video game VM designed in the 70ies with the following properties:

  • CPU: 8-Bit, 16 general-purpose registers, 36 instructions, each two bytes large
  • RAM: 4KB
  • Stack: 16 return addresses
  • Resolution: 64 x 32 black/white pixels
  • Rendering: Sprites are drawn in XOR mode
  • Sound: Monotone buzzer
  • Input: Hexadecimal keypad

It’s perfect. Sound is the only real issue here as the native sound support in Emacs is blocking, but this can be worked around with sufficient effort. Once it’s implemented there’s a selection of almost a hundred games to play, with a few dozen more if you implement the Super CHIP-8 extensions. I’d not have to implement Space Invaders, Pacman or Breakout with gamegrid.el. What could possibly be hard about this? As it turns out, enough to keep me entertained for a few weeks. Here’s the repo.

General strategy

First of all, I’ve located a reasonably complete looking ROM pack. It’s not included with the code as I’m not 100% sure on the legal status, some claim the games are old enough to be public domain, but since there are plenty of new ones, I decided to go for the safe route. Sorry about that.

Cowgod’s Chip-8 Technical Reference is the main document I relied upon. It’s clearly written and covers nearly everything I’d want to know about the architecture, with a few exceptions I’d have to find out on my own. Another helpful one is Mastering CHIP-8 to fill in some of the gaps.

To boot up a CHIP-8 game on real hardware you’d use a machine where the interpreter is loaded between the memory offsets #x000 and #x200, load the game starting at offset #x200, then start the interpreter. It would start with the program counter set to #x200, execute the instruction there, continue with the next instruction the program counter points to, etc. To make things more complicated there’s two timers in the system running at 60Hz, these decrement a special register if non-zero which is used to measure delays accurately and play a buzzing sound. However, there is no specification on how fast the CPU runs or how display updates are to be synchronized, so I had to come up with a strategy to accomodate for potentially varying clock speeds.

The standard solution to this is a game loop where you aim at each cycle to take a fixed time, for example by executing a loop iteration, then sleeping for enough time to arrive at the desired cycle duration. This kind of thing doesn’t work too well in Emacs, if you use sit-for you get user-interruptible sleep, if you use sleep-for you get uninterruptable sleep and don’t allow user input to be registered. The solution here is to invert the control flow by using a timer running at the frame rate, then being careful to not do too much work in the timer function. This way Emacs can handle user input while rendering as quickly as possible. The timer function would execute as many CPU cycles as needed, decrement the timer registers if necessary and finally, repaint the display.

Each component of the system is represented by a variable holding an appropriate data structure, most of which are vectors. RAM is a vector of bytes, the stack is a vector of addresses, the screen is a vector of bits, etc. I opted for using vectors over structs for simplicity’s sake. The registers are a special case because if they’re represented by a vector, I’d need to index into it using parts of the opcode. Therefore it would make sense to have constants representing each register, with their values being equal to the value used in the opcode. Initially I’ve defined the constants using copy-paste but later switched to a chip8-enum macro which defines them for me.

The built-in sprites for the hex digits were shamelessly stolen from Cowgod’s Chip-8 technical reference. They are copied on initialization to the memory region reserved for the interpreter, this allows the LD F, Vx instruction to just return the respective address. When implementing extended built-in sprites for the Super CHIP-8 instructions there was no convenient resource to steal them from again, instead I created upscaled versions of them with a terrible Ruby one-liner.

Basic Emulation

For debugging reasons I didn’t implement the game loop at first, instead I went for a loop where I keep executing CPU instructions indefinitely, manually abort with C-g, then display the display state with a debug function that renders it as text. This allowed me to fully concentrate on getting basic emulation right before fighting with efficiency concerns and rendering speed.

For each CPU cycle the CPU looks up the current value of the program counter, looks up the two-byte instruction in the RAM at that offset, then executes it, changing the program counter and possibly more in the process. One unspecified thing here is what one does if the program counter points to an invalid address and what actual ROMs do in practice when they’re done. Experimentation showed that instead of specifying an invalid address they fall into an infinite loop that always jumps to the same address.

Due to the design choice of constantly two-byte sized instructions, the type and operands of each instruction is encoded inline and needs to be extracted by using basic bit fiddling. Emacs Lisp offers logand and ash for this, corresponding to &, << and >> in C. First the bits to be extracted are masked by using logand with an argument where all bits to be kept are set to ones, then the result is shifted all the way to the right with ash using a negative argument. Take for example the JP nnn instruction which is encoded as #x1nnn, for this you’d extract the type by masking the opcode with #xF000, then shift it with ash by -12. Likewise, the argument can be extracted by masking it with #x0FFF, with no shift needed as the bits are already at the right side.

A common set of patterns comes up when dissecting the opcodes, therefore the chip8-exec function saves all interesting parts of the opcode in local variables using the abbreviations as seen in Cowgod’s Chip-8 technical reference, then a big cond is used to tell which type of opcode it is and each branch modifies the state of the virtual machine as needed.

Nearly all instructions end up incrementing the program counter by one instruction. I’ve borrowed a trick from other emulators here, before executing chip8-exec the program counter is unconditionally incremented by the opcode size. In case an instruction needs to do something different like changing it to an jump location, it can still override its value manually.

To test my current progress I picked the simplest (read: smallest) ROM doing something interesting: Maze by David Winter. My debug function printed the screen by writing spaces or hashes to a buffer, separated by a newline for each screen line. After I got this one working, I repeated the process with several other ROMs that weren’t requiring any user input and displayed a (mostly) static screen. The most useful from the collection was “BC Test” by BestCoder as it covered nearly all opcodes and tested them in a systematic fashion. Here’s a list of other ROMs I found useful for testing other features, in case you, the reader, shall embark on a similar adventure:

  • Jumping X and O: Tests delay timer, collision detection, out of bounds drawing
  • CHIP-8 Logo: Tests CALL nnn / RET
  • Sierpinski triangle: Slow, tests emulation speed
  • Zero: Animation, tests rendering speed (look for the flicker)
  • Minimal Game: Tests SKP Vx
  • Keypad Test: Tests LD Vx, K, uncovered a bug in the main loop
  • Tetris: Tests SKP Vx, SKNP Vx, playability
  • SC Test: Tests nearly all opcodes and a few Super CHIP-8 ones
  • Font Test: Tests drawing of small and big built-in sprites
  • Robot: Tests drawing of extended sprites
  • Scroll Test: Tests scrolling to the left and right
  • Car Race Demo: Tests scrolling down
  • Car: Tests emulation speed in extended mode
  • Emutest: Tests half-pixel scroll, extended sprites in low-res

Debugging and Analysis

Surprisingly enough, errors and mistakes keep happening. Stepping through execution of each command with edebug gets tiring after a while, even when using breakpoints to skip to the interesting parts. I therefore implemented something I’ve seen in Circe, my preferred IRC client, a logging function which only logs if logging is enabled and writes the logging output to a dedicated buffer. For now it just logs the current value of the program counter and the decoded instruction about to be executed. I’ve added the same kind of logging to a different CHIP-8 emulator, chick-8 by Evan Hanson from the CHICKEN Scheme community. Comparing both of their logs allowed me to quickly spot where they start to diverge, giving me a hint what instruction is faulty.

Looking through the ROM as it is executed isn’t terribly enlightening, it feels like watching through a peephole, not giving you the full picture of what’s about to happen. I started writing a simple disassembler which decodes every two bytes and writes their offset and meaning to a buffer, but stopped working on it after realizing that I have a much more powerful tool at hand to do disassembly and analysis properly: radare2. As it didn’t recognize the format correctly, I only used its most basic featureset for analysis, the hex editor. By displaying the bytes at a width of two per row and searching for hex byte sequences with regex support I was able to find ROMs using specific opcodes easily.

Later after I’ve finished most of the emulator, I started developing a CHIP-8 disassembly and analysis plugin using its Python scripting support. I ran into a few inconsistencies with the documentation, but eventually figured everything out and got pretty disassembly with arrows visualizing the control flow for jumps and calls.

/img/chip8-r2-graph-thumb.png

Later I discovered that radare2 actually does have CHIP-8 support in core, you need to enable it explicitly by adding -a chip8 to the command line arguments as it cannot be auto-detected that a file is a CHIP-8 ROM. The disassembly support is decent, but the analysis part had a few omissions and mistakes leading to less nice graphs. By using my Python version as basis I’ve managed improving the C version of the analysis plugin to the same level and even surpassed it as the C API allows adding extra meta-data to individual instructions, such as inline commentary. There is a pending PR for this functionality now, I expect it to be merged soon.

Testing

For maximum speed I set up firestarter to recompile the file on each save, added the directory of the project to load-path, then always launched a new Emacs instance from where I loaded up the package and emulated a ROM file. This is ideal if there isn’t much to test, but it’s hard to detect regressions this way. At some point I decided to give the great buttercup library another try and wrote a set of tests exercising every supported instruction with all edge cases I could think of. For each executed test the VM is initialized, some opcodes are loaded up and chip8-cycle is called as often as needed, while testing the state of the registers and other affected parts of the machinery. It was quite a bit of grunt work due to the repetitive nature of the code, but gave me greater confidence in just messing around with the code as retesting everything took less than a second.

Make no mistake here though, excessively testing the complicated parts of a package (I don’t believe it’s worth it testing the simple parts) is in no way a replacement for normal usage of it which can uncover completely different bugs. This is more of a safety net, to make sure code changes don’t break the most basic features.

Rendering

Retrospectively, this was quite the ride. Normally you’d pick a suitable game or multimedia library and be done, but this is Emacs, no such luxuries here. Where we go we don’t need libraries.

My favorite way of drawing graphics in Emacs is by creating SVG on the fly using the esxml library. This turned out to be prohibitively expensive, not only did it fail meeting the performance goals, it also generated an excessive amount of garbage as trees were recursively walked and thrown away over and over again. A variation of this is having a template string resembling the target SVG, then replacing parts of it and generating an image from them. I attempted doing this, but quickly gave up as it was too bothersome coming up with suitable identifiers and replacing all of them correctly.

I still didn’t want to just drop the SVG idea. Considering this was basically tiled graphics (with each tile being an oversized pixel), I considered creating two SVG images for white and black tiles respectively, then inserting them as if they were characters on each line. The downside of this approach was Emacs’ handling of line height, I couldn’t figure out how to completely suppress it to not have any kind of gaps in the rendering. gamegrid.el somehow solves it, but has rather convoluted code.

At this point I was ready to go back to plain text. I remembered that faces are a thing and used them to paint the background of the text black and white. No more annoying gaps. With this I could finally work and started figuring out how to improve the rendering. While the simple solution of always erasing the buffer contents and reinserting them again did work, there were plenty of optimization possibilities. The most obvious one was using dirty frame tracking to tell if the screen even needed to be redrawn. In other words, the code could set a chip8-fb-dirty-p flag and if the main loop discovered it’s set, it would do a redraw and unset it. Next up was only redrawing the changed parts. For this I’d keep a copy of the current and previous state of the screen around, compare them, repaint the changed bits and transfer the current to the previous state. To change the pixels in the buffer I’d erase them, then insert the correct ones.

The final optimization occurred me much later when implementing the Super CHIP-8 instructions. It was no longer possible to play games smoothly at quadrupled resolution, so I profiled and discovered that erasing text was the bottleneck. I considered the situation hopeless, fiddled around with XBM graphics backed by a bit-vector and had not much luck with getting them to work nearly as well at low resolution. It only occurred me by then that I didn’t try to just change the text properties of existing text instead of replacing text. That fixed all remaining performance issues. Another thing I realized is that anything higher-resolution than this will require extra trickery, maybe even involving C modules.

Garbage Collection Woes

Your code may be fast, your rendering impeccable, but what if every now and then your bouncing letters animation stutters? Congratulations, you’ve run into garbage collection ruining your day. In a language like C it’s much more obvious if you’re about to allocate memory from the heap, in a dynamic language it’s much harder to pin down what’s safe and what’s not. Patterns such as creating new objects on the fly are strictly forbidden, so I tried fairly hard to avoid them, but didn’t completely succeed. After staring hard at the code for a while I found that my code transferring the current to the old screen state was using copy-tree which kept allocating vectors all the time. To avoid this I wrote a memcpy-style function that copied values from one array to another one.

Another sneaky example was the initialization of the emulator state which assigned zero-filled vectors to the variables. I noticed this one only due to the test runner printing running times of tests. Most took a fraction of a millisecond, but every six or so the test took over 10 milliseconds for no obvious reason. This turned out to be garbage collection again. I rediscovered the fillarray function which behaves much like memset in C, used it in initialization (with the vectors assigned at declaration time instead) and the pauses were gone. No guarantees that this was the last of it, but I haven’t been able to observe other pauses.

Sound

If your Emacs has been compiled with sound support there will be a play-sound function. Unfortunately it has a big flaw, as long as the sound is playing Emacs will block, so using it is a non-starter. I’ve initially tried using the visual bell (which inverts parts of the screen) as a replacement, then discovered that it does the equivalent of sit-for and calling it repeatedly in a row will in the worst case of no pending user input wait as long as the intervals combined. There was therefore no easy built-in solution to this. To allow users to plug in their own solution I defined two customizable functions defaulting to displaying and clearing a message: chip8-beep-start-function and chip8-beep-stop-function.

The idea here is that given a suitable, asynchronous function you could kick off a beep, then later stop it. Spawning processes is the one thing you can easily do asynchronously, so if you had a way to control a subprocess to start and stop playing a sound file, that would be a good enough solution. I then remembered that mplayer has a slave mode and that mpv improved it in a multitude of ways, so I looked into the easiest way of remote controlling it. It turns out that mpv did away with slave mode in favor of controlling it via FIFO or a socket. To my surprise I actually made it work via FIFO, the full proof of concept can be found in the README.

User input

The CHIP-8 supports two ways of checking user input: Checking whether a key is (not) pressed (non-blocking) and waiting for any key to be pressed (blocking). Doing this in a game library wouldn’t be worth writing about, but this is Emacs after all, there is only a distinction between key up and down for mouse events. After pondering about this issue for a while I decided to fake it by keeping track of when keys have been last pressed in a generic key handler function, then comparing that timestamp against the current time: If it’s below a reasonable timeout, the key is considered pressed, otherwise it isn’t.

Solving the other problem required far more effort. The emulator was at this point sort of a state machine as I’ve tracked whether it was running with a boolean variable to implement a pause command. I’ve reworked the variable and all code using it to be mindful of the current state: Playing, paused or waiting for user input. This way the command merely changed the current state to waiting for input, set a global variable to the register to be filled with the pressed key and set the stage for the generic key handler function to continue execution. If that function detected the waiting state and a valid key has been pressed, it would record it in the respective register and put the emulator into playing state again.

Actually testing this with a keypad demo ROM unveiled a minor bug in the interaction between the main loop and the redrawing logic. Remember that a number of CPU cycles were executed, then a redraw was triggered if needed? Well, imagine that in the middle of the CPU cycles to be executed the state were changed to waiting and the redraw never happened! This would produce an inconsistent screen state, so I changed it to do a repaint immediately. Furthermore, if the state changed to waiting, the loop would still execute more cycles than needed (despite it being a blocking wait), therefore I had to add an extra check in the main loop’s constant amount of cycling whether the state changed and if yes, skeep the loop iteration alltogether.

Super CHIP-8

At this point I was pretty much done with implementing the full CHIP-8 feature set and started playing games like Tetris, Brix and Alien.

/img/chip8-tetris-thumb.png /img/chip8-brix-thumb.png /img/chip8-alien-thumb.png

Yet I wasn’t satisfied for some strange reason. I probably longed for more distraction and set out to implement the remaining Super CHIP-8 instructions. Unlike the main instruction set these weren’t nearly as well documented. My main resource was a schip.txt file which briefly describes the extra instructions. The most problematic extension is the extended mode which doubles the screen dimensions, requiring a clever way to draw a bigger or smaller screen whenever toggled. There are two ways of implementing such a thing: Drawing to one of two separate screen objects and painting the correct one or alternatively, always drawing to a big screen and rendering in a downscaled mode if needed. For simplicity’s sake I went with the first option.

The extra scroll extensions allow game programmers to efficiently change the viewport (though for some reason they forgot about an instruction scrolling up). My challenge here was to change the screen’s contents in-place, for this to be done correctly extra care was necessary to not accidentally overwrite contents you needed to move elsewhere. The trick here is to iterate in reverse order over the screen lines if necessary.

A few more instructions and optimizations later and I was ready to play the probably silliest arcade game ever conceived, Joust. The sprites in the picture below are supposed to be knights on flying ostrichs trying to push each other down with their lances, but they look more like flying rabbits to me.

/img/chip8-joust-thumb.png

Other Musings

Writing an emulator gives you great insight in how a machine actually works. Details like memory mapping you glossed over feels far more intuitive once you have to implement it yourself. One of the downsides is that I didn’t play games for my own enjoyment, but to further improve the emulator and understand the machine.

A few games and demo ROMs revealed bugs in the emulator, such as how to deal with drawing sprites that go outside the boundaries. Cowgod’s Chip-8 Technical Reference tells you to do wrap-around, but Blitz by David Winter seems to think otherwise, when rendered with wrap-around the player sprite collides immediately into a pixel on the edge and the “GAME OVER” screen is displayed. I decided in this case to forego that recommendation and clip the rendering to the screen edges.

It’s not always easy to make such decisions. Some quirks seem fairly reasonable, such as preferrably setting the VF flag to indicate an overflow/underflow condition for arithmetic, although it’s not always specified. Some quirks seem fairly obscure, such as the interpretation of Super CHIP-8 extensions in low-resolution mode: A demo insists that instead of drawing a high-resolution 16 x 16 sprite it should be drawn as 8 x 16 instead. As this doesn’t appear to affect any game and requires significant support code I decided against implementing it. In one case I was conflicted enough between the different interpretation of bit shifting operators that I introduced a customizable to toggle between both, with the incorrect, but popular behavior being the default.