These are some notes on doing research with the LLVM compiler infrastructure. It should be enough for a grad student to go from mostly uninterested in compilers to excited to learn more about how to use it.
LLVM is a compiler. It’s a really nice ahead-of-time compiler for “native” languages like C and C++.
Of course, since LLVM is so awesome, you will also hear that it is much more than this (it can also be a JIT; it powers a great diversity of un-C-like languages; it is the bytecode format for the App Store; etc.; etc.) These are all true, but for all purposes, the above definition is what matters.
A few huge things make LLVM different from other compilers:
LLVM is a great compiler, but who cares if you don’t do compilers research? Your first reaction may be that compiler hacking is mostly useful for implementing new compiler optimizations. But your research is almost certainly not on new compiler optimizations! But the same factors that make LLVM passes good for implementing optimizations also make them good for doing a surprisingly broad array of things that researchers want to do.
A compiler infrastructure is useful whenever you need to do stuff with programs. Which, in my experience, is kind of a lot. You can analyze programs to see how often they do a certain behavior you’re interested in, transform them to work better with your system, or change them to pretend to use your hypothetical new architecture or OS without actually fabbing a new chip or writing an kernel module. For grad students, a compiler infrastructure is more often the right tool than most people give it credit for. I encourage you to reach for LLVM by default before any of these tools unless you have a really good reason:
sed
to complicated stuff like AST parsing and serialization)Even if a compiler doesn’t seem like a perfect match for your task, it can often get you 90% of the way there far easier than, say, a source-to-source translation.
Here are some nifty examples of research projects that used LLVM to do things that are not necessarily all that compilery:
A picture, which could be a picture of basically any realistic compiler.
Documentation. Doxygen. Source code (from the git (GitHub?) mirror). Build instructions. Packages from Homebrew, etc.
Often, the version of LLVM that comes with your OS doesn’t have all the headers necessary to hack with it. You’ll need to install it from source. Brandon Holt has good instructions for building it “right” on OS X. There’s also a Homebrew formula, to which you’ll want to pass the --with-clang
option.
An example template to start from, including build system.
You may also want to check out the “Writing an LLVM Pass” tutorial. If you do, ignore the Makefile-based build system instructions and skip straight to the CMake-based “out-of-source” instructions, which is the only rational course of action.)
Clone the llvm-pass-skeleton repository from GitHub. It contains a useless LLVM pass where we can do our work.
Here’s the relevant part of Skeleton.cpp
:
virtual bool runOnFunction(Function &F) {
errs() << "I saw a function called " << F.getName() << "!\n";
return false;
}
There are several kinds of LLVM pass, and we’re using one called a function pass (it’s a good place to start). Exactly as you would expect, LLVM invokes that function above with every function it finds in the program we’re compiling. For now, all it does is print out the name.
Details:
errs()
thing is an LLVM-provided C++ output stream we can use to print to the consolefalse
to indicate that it didn’t modify F
.Build the pass with CMake:
$ cd llvm-pass-skeleton
$ mkdir build
$ cd build
$ cmake .. # Generate the Makefile.
$ make # Actually build the pass.
If LLVM isn’t installed globally, you will need to tell CMake where to find it. You can do that by giving it the path to the share/llvm/cmake/
directory inside wherever LLVM resides in the LLVM_DIR
environment variable. Here’s an example with the path from Homebrew:
$ LLVM_DIR=/usr/local/opt/llvm/share/llvm/cmake cmake ..
To run your new pass, you just have to invoke clang
on some C program and use some freaky flags to get it in place:
$ clang -Xclang -load -Xclang build/skeleton/libSkeletonPass.* something.c
I saw a function called main!
(You can also run passes one at a time, independently from invoking clang
, with LLVM’s opt
command. I won’t cover that here.)
Modules, Functions, Basic Blocks, instructions
We can inspect all of these objects with a convenient common method in LLVM named dump()
. It just prints out the human-readable representation of an object in the IR. Here’s some code to do that, which is available in the containers
branch of the llvm-pass-skeleton
repository:
errs() << "Function body:\n";
F.dump();
for (auto &B : F) {
errs() << "Basic block:\n";
B.dump();
for (auto &I : B) {
errs() << "Instruction: ";
I.dump();
}
}
Using C++11’s fancy auto
and foreach syntax makes the containment of LLVM’s object hierarchy clear.
Most things are Values (including globals and constants, like 5)
The SSA graph (what is SSA?).
The real magic comes in when you look for patterns in the program and, optionally, change the code when you find them. Here’s a really simple example: let’s say we want to switch the order of every binary operator in the program. So a + b
will be come b + a
. Sounds useful, right?
for (auto &B : F) {
for (auto &I : B) {
if (auto *op = dyn_cast<BinaryOperator>(&I)) {
op->swapOperands();
}
}
}
Details:
swapOperands()
method, you just have to dig around. The best way is to click around in Doxygen (here’s the page for Binary Operator). Or you can train Google to love to look in the LLVM docs (when I search for “binaryoperator”, it knows exactly what I want).dyn_cast<T>(p)
construct is LLVM-specific. It uses some conventions from the LLVM codebase to made type checks and such really efficient, since, in practice, compilers have to use them all the time. This particular construct returns a null pointer if I
is not a BinaryOperator
, so it’s perfect for special-casing like this.Now if we compile a program like this:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("%i\n", argc - 2);
}
You can see the substraction goes the wrong way!
That would be incredibly challenging to do as a raw source-code transformation. It would be easier at the AST level, but do you really want to worry about templates, etc.?
Eventually, explain IRBuilder.
Probably want to use some code you wrote at run time. Don’t write it by generating LLVM instructions!
Most projects eventually need to interact with the programmer. Some ways to do this:
__annotate__