a note shared with knot markdown source

LLVM for Grad Students

These are some notes on doing research with the LLVM compiler infrastructure. It should be enough for a grad student to go from mostly uninterested in compilers to excited to learn more about how to use it.

What is LLVM?

LLVM is a compiler. It’s a really nice ahead-of-time compiler for “native” languages like C and C++.

Of course, since LLVM is so awesome, you will also hear that it is much more than this (it can also be a JIT; it powers a great diversity of un-C-like languages; it is the bytecode format for the App Store; etc.; etc.) These are all true, but for all purposes, the above definition is what matters.

A few huge things make LLVM different from other compilers:

Why Would a Grad Student Care About LLVM?

LLVM is a great compiler, but who cares if you don’t do compilers research? Your first reaction may be that compiler hacking is mostly useful for implementing new compiler optimizations. But your research is almost certainly not on new compiler optimizations! But the same factors that make LLVM passes good for implementing optimizations also make them good for doing a surprisingly broad array of things that researchers want to do.

A compiler infrastructure is useful whenever you need to do stuff with programs. Which, in my experience, is kind of a lot. You can analyze programs to see how often they do a certain behavior you’re interested in, transform them to work better with your system, or change them to pretend to use your hypothetical new architecture or OS without actually fabbing a new chip or writing an kernel module. For grad students, a compiler infrastructure is more often the right tool than most people give it credit for. I encourage you to reach for LLVM by default before any of these tools unless you have a really good reason:

Even if a compiler doesn’t seem like a perfect match for your task, it can often get you 90% of the way there far easier than, say, a source-to-source translation.

Here are some nifty examples of research projects that used LLVM to do things that are not necessarily all that compilery:

The Pieces

A picture, which could be a picture of basically any realistic compiler.

Getting Oriented

Documentation. Doxygen. Source code (from the git (GitHub?) mirror). Build instructions. Packages from Homebrew, etc.

Often, the version of LLVM that comes with your OS doesn’t have all the headers necessary to hack with it. You’ll need to install it from source. Brandon Holt has good instructions for building it “right” on OS X. There’s also a Homebrew formula, to which you’ll want to pass the --with-clang option.

So We’re Going to Write a Pass

An example template to start from, including build system.

You may also want to check out the “Writing an LLVM Pass” tutorial. If you do, ignore the Makefile-based build system instructions and skip straight to the CMake-based “out-of-source” instructions, which is the only rational course of action.)

A Skeleton

Clone the llvm-pass-skeleton repository from GitHub. It contains a useless LLVM pass where we can do our work.

Here’s the relevant part of Skeleton.cpp:

virtual bool runOnFunction(Function &F) {
  errs() << "I saw a function called " << F.getName() << "!\n";
  return false;
}

There are several kinds of LLVM pass, and we’re using one called a function pass (it’s a good place to start). Exactly as you would expect, LLVM invokes that function above with every function it finds in the program we’re compiling. For now, all it does is print out the name.

Details:

Build It

Build the pass with CMake:

$ cd llvm-pass-skeleton
$ mkdir build
$ cd build
$ cmake ..  # Generate the Makefile.
$ make  # Actually build the pass.

If LLVM isn’t installed globally, you will need to tell CMake where to find it. You can do that by giving it the path to the share/llvm/cmake/ directory inside wherever LLVM resides in the LLVM_DIR environment variable. Here’s an example with the path from Homebrew:

$ LLVM_DIR=/usr/local/opt/llvm/share/llvm/cmake cmake ..

Run It

To run your new pass, you just have to invoke clang on some C program and use some freaky flags to get it in place:

$ clang -Xclang -load -Xclang build/skeleton/libSkeletonPass.* something.c
I saw a function called main!

(You can also run passes one at a time, independently from invoking clang, with LLVM’s opt command. I won’t cover that here.)

Understanding a Program in LLVM

Modules, Functions, Basic Blocks, instructions

We can inspect all of these objects with a convenient common method in LLVM named dump(). It just prints out the human-readable representation of an object in the IR. Here’s some code to do that, which is available in the containers branch of the llvm-pass-skeleton repository:

errs() << "Function body:\n";
F.dump();

for (auto &B : F) {
  errs() << "Basic block:\n";
  B.dump();

  for (auto &I : B) {
    errs() << "Instruction: ";
    I.dump();
  }
}

Using C++11’s fancy auto and foreach syntax makes the containment of LLVM’s object hierarchy clear.

Most things are Values (including globals and constants, like 5)

The SSA graph (what is SSA?).

Now Make the Pass Do Something Mildly Interesting

The real magic comes in when you look for patterns in the program and, optionally, change the code when you find them. Here’s a really simple example: let’s say we want to switch the order of every binary operator in the program. So a + b will be come b + a. Sounds useful, right?

for (auto &B : F) {
  for (auto &I : B) {
    if (auto *op = dyn_cast<BinaryOperator>(&I)) {
      op->swapOperands();
    }
  }
}

Details:

Now if we compile a program like this:

#include <stdio.h>
int main(int argc, const char **argv) {
    printf("%i\n", argc - 2);
}

You can see the substraction goes the wrong way!

That would be incredibly challenging to do as a raw source-code transformation. It would be easier at the AST level, but do you really want to worry about templates, etc.?

Eventually, explain IRBuilder.

Linking With a Runtime Library

Probably want to use some code you wrote at run time. Don’t write it by generating LLVM instructions!

Annotations

Most projects eventually need to interact with the programmer. Some ways to do this:

Not Covered Here