EvilVM: Building an Assembler

This is part of my series on EvilVM, a project to create a tiny injectable VM with language support for Evil^TM agent features for pen-testers like me. In other words, a programming language and agent framework specifically designed for post-exploitation. The goal is to allow ad-hoc, reliable adaptation of malicious agents to situational needs during a pen-test. And to build it from scratch for fun (of course). The index and overview for this project can be found here.

Last time I added a few instructions to the VM so that I can start writing real code. As I manually created the machine code for a factorial() function, it became clear that I will need to write an assembler. This post is about my basic assembler – possibly getting into it a little deeper than the casual reader might want. Nevertheless, I’ve found that documenting this project in blog posts has really helped me tweak the design and find bugs, so here goes: let’s assemble!

TL;DR –> If you want to see the Assembler for EvilVM, here’s the source code

Overview

An assembler translates an assembly language into a machine language. You can get pretty complicated with an assembler. For an example of a big one, check out the NASM source code. But many of the features you find in a big-time assembler are not really necessary. My main need at this point is to avoid the hassle of encoding immediate values and calculating target addresses for jump instructions.

If you read the Wikipedia entry on assemblers, you’ll see that assemblers come in different shapes. For mental simplicity, I’ve chosen to build a multi-pass assembler. This diagram illustrates the essential process in assembling a program:

Read - Translate - Index - Jumps - Code

The main phases for my assembler are as follows:

Read – establish a representation of instruction mnemonics and labels, and code to read an assembly program in a format that suits translation into machine language.
Translate – in my assembly language, there are two kinds of instructions: (a) simple instructions, which can be immediately translated into their machine-code equivalents without requiring any contextual information, and (b) complex instructions, which require that the assembler interpret context before encoding them. Presently, only jump instructions are in the latter category, since their target locations cannot be known until the whole program is processed. This phase translates the simple instructions.
Index – in this pass, my assembler identifies labeled locations in the code and creates a data structure representing their locations.
Write Jumps – using the index created in the previous pass, in this pass the assembler creates machine-code translations for all jump instructions, since their target locations can be known.
Code – this final phase renders the listing of machine code instructions as a single sequence of bytes, which is the machine code that will be directly executable by the EvilVM.

Since I’m going through a Common Lisp phase lately, I’ll write my assembler in Lisp. Any language will obviously do, but there are several advantages in using Lisp to write my assembler. Mainly, Lisp gives me a good natural data structure for representing my assembly language, and nice high-level abstractions for implementing the phases above.

Details

My EvilVM design uses a pretty simple bytecode where each instruction’s op code is a single byte and either stands by itself or takes a 1- or 4-byte argument. So, let’s look at the following example, inspired by the classic Fizz Buzz program. This isn’t the whole thing; all it does is push a 'f' character on the stack if the number on top of the stack is divisible by 3. Otherwise, it just returns the number as given:

         dup
         pushi 3
         mod
         zerop
         jumpt :notfizz
         ret
notfizz: pop
         char #\f
         ret

We want to translate this into machine code that looks something like this:

di\x03\x00\x00\x00%zt\x06\x00\x00\x00<pcf<

Translate

The first decision to make is how I’m going to represent my assembly programs. I could go with a traditional syntax, but that requires more parsing. Since an assembly program is just a list of mnemonics, arguments, and labels, I can just use a list in Lisp. So here’s the above program represented as an input to my assembler:

((dup)
 (pushi 3)
 (mod)
 (zerop)
 (jumpt :notfizz)
 (ret)
 (label :notfizz)
 (pop)
 (char #\f)
 (ret))

Each element in the list is a mnemonic, with a couple special ones like the labels and jump targets. By representing these as keyword symbols in Common Lisp, it’ll be easy for me to recognize them as I’m translating this into machine code.

Since instruction mnemonics (e.g., the ‘names’ of the instructions, like jumpt or dup) translate to single byte op codes, we’ll start with some mapping of mnemonics to bytecodes:

(defparameter *instructions*
  (let ((table (make-hash-table))
        (pairs '((pushi . #\i)    (swap    . #\s)
                 (zerop . #\z)    (jumpt   . #\t)
                 (pop   . #\p)    (sub     . #\-)
                 (add   . #\+)    (mul     . #\*)
                 (jump  . #\j)    (halt    . #\h)
                 (dup   . #\d)    (prom    . #\^)
                 (print . #\o)    (eqi     . #\=)
                 (alloc . #\l)    (dealloc . #\L)
                 (load  . #\G)    (store   . #\S)
                 (getpc . #\@)    (call    . #\>)
                 (ret   . #\<)    (equal   . #\=)
                 (over  . #\O)    (emit    . #\e)
                 (char  . #\c)    (not     . #\!)
                 (mod   . #\%))))
    (loop for (i . c) in pairs do
      (setf (gethash i table) c))
    table))

We’ll make several passes through the list, incrementally bringing it closer to our final machine code representation. The first step will simply involve translating mnemonics to string representations of the instructions. Since jumps and labels are “hard”, we’ll just save them for later. Here’s #'translate:

(defun translate (desc)
  (loop
     for (op arg) in desc 
     for instr = (gethash op *instructions*) collect
       (cond
         ((eq op 'label) (list op arg))
         ((symbolp arg)  (list instr arg))
         ((eq nil arg)   (string instr))
         (t              (op-with-arg instr arg)))))

At its core, #'translate walks through the list of instructions and translates them to strings if possible. The special cases are when we encounter either a label or a jump – identifiable by the presence of the 'label mnemonic or an argument that is a symbol. If we #'translate our fizz program above, we’ll get something like this:

* (translate *fizz*)

("d"
 "i^C^@^@^@"
 "%"
 "z"
 (#\t :NOTFIZZ)
 "<"
 (LABEL :NOTFIZZ)
 "p"
 "cf"
 "<")

Note that the “i^C^@^@^@” is the pushi instruction, which takes in a literal argument as an integer to push on the stack. The #'op-with-arg function handles conversion of arguments to things like little-endian 32-bit integers or characters, etc.

Index

Now that the easy translation is done, we want to insert the jumps. To do this simply, we’ll make a couple passes. In the first pass, we’ll find all the labels and make a map of the labels and their indices in the list of instructions. The code to do this looks something like this:

(defun label-indices (list)
  (flet ((labelp (i) (and (listp i) (eq 'label (first i)))))
    (let ((hash (make-hash-table)))
      (loop 
         for index from 0
         for i in list
         when (labelp i)
         do (setf (gethash (second i) hash) index))
      hash)))

The helper function labelp is a predicate that determines whether or not a given element in the list of translated instructions is a label. Simple instructions are already converted into strings, so the remaining special ones are lists (hence the test with #'listp) and have the mnemonic label as their first element.

The rest of the function loops through the instructions while keeping track of their numerical index in the list so we can create a little hash table mapping label names to numeric indices. For the list above, this’ll look something like this (Common Lisp doesn’t have a literal syntax for hashes, so this is what it would look like in pseudo-Ruby):

{ :NOTFIZZ => 6 }

As you can imagine, this would be much more complex in a more useful assembly program. Once we know what these mappings look like, we want to build a list of insertions that we’re going to make to render the jump* instructions correctly. This will be done by the #'find-inserts function:

(defun find-inserts (list)
  (let* ((indices (label-indices list)))
    (loop 
       for index from 0
       for i in list
       when (jumpp i) collect
         (calculate-jump list
                         index
                         (gethash (second i) indices)))))

The result of this function will be a mapping (this time an association list) of indices where a jump needs to be inserted and the distance (positive or negative) required for the jump. The #'calculate-jump function basically takes the list of instructions and a from / to index pair, and then calculates how many bytes that jump will be in the final machine code. The guts of it are actually a function called #'distance, which takes a list of translated machine code represented by the jump and determines how long the resulting single-string representation will be.

Write Jumps

Finally, we need to walk through the code in one more pass to apply the jumps we just calculated. I could have done these two things in one pass (calculate and insert jumps), but as I wrote it I thought of it as two mental steps, and I have just left it this way:

(defun patch-addresses (list)
  (let ((inserts (find-inserts list)))
    (loop
       for index from 0
       for token in list
       for jump = (assoc index inserts)
       unless (markupp token) collect
         (if jump 
             (op-with-arg (first token) (cdr jump))
             token))))

Code

I then add the #'assemble and #'main functions, which wraps all of the above together to assemble code into machine code and provide an entrypoint to the program:

(defun assemble (list)
  (format nil "~{~A~}" (patch-addresses (translate list))))

(defun main ()
  (let* ((source (read))
         (code   (assemble source))
         (bytes  (flexi-streams:string-to-octets code)))
    (write-sequence bytes *standard-output*))
  (sb-ext:exit))

And create a shell script to wrap up my assembler:

#!/bin/bash

sbcl --noinform --load assembler.lisp --eval "(main)"

So, now returning to the factorial() example from my last post, here’s the source code:

((dup)
 (dup)
 (pushi 0)
 (equal)
 (jumpt :ret0)
 (label :loop)
 (pushi 1)
 (sub)
 (swap)
 (over)
 (pushi 0)
 (equal)
 (jumpt :ret)
 (over)
 (mul)
 (swap)
 (jump :loop)
 (label :ret0)
 (pop)
 (pop)
 (pushi 1)
 (pushi 1)
 (label :ret)
 (swap)
 (pop)
 (halt))

Storing that in a file, then here is the result of running my basic assembler:

Example showing assembly in action

Or, for inserting into my C test suite:

$ ./assemble.sh < fact.as | dk cstring
\x64\x64\x69\x00\x00\x00\x00\x3d\x74\x20\x00\x00\x00\x69\x01\x00
\x00\x00\x2d\x73\x4f\x69\x00\x00\x00\x00\x3d\x74\x19\x00\x00\x00
\x4f\x2a\x73\x6a\xea\xff\xff\xff\x70\x70\x69\x01\x00\x00\x00\x69
\x01\x00\x00\x00\x73\x70\x68

Conclusion

So in 97 lines of reasonably abstract Common Lisp, I have a working assembler so that I can easily write code for EvilVM. Of course, my intent is not to make my end-users write code in assembly, but as I build my basic language constructs, I’ll be using it quite a bit. For example (spolier alert!), here’s the if function (yes, if will be a function in my language!):

((jumpt :truequote)
 (swap)
 (pop)
 (call)
 (ret)
 (label :truequote)
 (pop)
 (call)
 (ret)))
	
=> \x74\x09\x00\x00\x00\x73\x70\x3e\x3c\x70\x3e\x3c

This, and many other primitives, for my core language will be implemented in raw EvilVM assembly. Once I have the necessary core built, then the rest of the language will be built on top of that foundation. There are a couple tidbits in that listing above that I don’t think I’ve covered on the blog yet, but I’ll get to them next time. In the next post, I’m going to talk about concatenative languages and control flow!