Building a Compiler, Part 1

I want to build a compiler. For simplicity, I will target x64 Linux only. The compiler will produce executable ELF files directly — I do not want heavy dependencies and writing a linker and maybe an assembler is kind of boring.[1] Also, I like small binaries but the usual tools produce massive binaries: with static musl, it takes ~30KiB to call printf("Hello, world!\n"), or ~15KiB if you use write instead. I will probably not get small binaries unless I get around the system assembler and linker.

So, I just read the man page for elf(5) and wrote hex. The first chunk below is the ELF header. The second chunk describes the code segment, and the third chunk describes the data segment. The second and third chunks are called program headers, and collectively called the program header table. After these comes the actual code and data. Pretty simple.

7F 45 4C 46 02 01 01 00
00 00 00 00 00 00 00 00
02 00 3E 00 01 00 00 00
B0 00 01 00 00 00 00 00  # entry address
40 00 00 00 00 00 00 00  # program header offset
00 00 00 00 00 00 00 00
00 00 00 00 40 00 38 00
02 00 40 00 00 00 00 00  # 02 00: number of segments

01 00 00 00 05 00 00 00
B0 00 00 00 00 00 00 00  # offset
B0 00 01 00 00 00 00 00  # address
00 00 00 00 00 00 00 00
24 00 00 00 00 00 00 00  # size in file
24 00 00 00 00 00 00 00  # size in memory
00 00 00 00 00 00 00 00

01 00 00 00 06 00 00 00
D4 00 00 00 00 00 00 00  # offset
D4 00 02 00 00 00 00 00  # address
00 00 00 00 00 00 00 00
0E 00 00 00 00 00 00 00  # size in file
0E 00 00 00 00 00 00 00  # size in memory
00 00 00 00 00 00 00 00

B8 01 00 00 00           # eax = 1 (write)
BF 01 00 00 00           # edi = 1 (stdout)
48 8D 35 13 00 01 00     # rsi = rip + 0x10013 = 0x200D4
BA 0E 00 00 00           # edx = 14
0F 05

B8 3C 00 00 00           # eax = 60 (exit)
BF 00 00 00 00           # edi = 0
0F 05

48 65 6C 6C 6F 2C 20
77 6F 72 6C 64 21 0A     # "Hello, world!\n"
$ sed 's/ \+#.*//' hello.hex | xxd -r -p > hello
$ chmod +x hello
$ ./hello
Hello, world!

Pretty neat, huh?

Not really. Try changing the 0x10013 to 0x13, so rsi points just after the code. It would still work, which is not good. The thing here is that the segments are mapped into memory in 4KiB pages, so the code segment actually maps the whole file to 0x10000 and the data segment maps the whole file to 0x20000. This sounds pretty bad, and I am not sure what will happen if we try to write to 0x200C2 (the 0E byte).

So, we do need to pad each segment to a 4KiB page boundary. That means the ELF header and the program header table, 176 bytes in total, would take up the first 4KiB of our binary. When we get to it, we can put read-only data after the headers, so it is not totally lost.

Page alignment also solves another nasty problem. Even if the offset of a segment is not aligned to a page boundary, it must align with its address. That is why the code segment gets mapped to 0x100B0 instead of 0x10000, which would be nicer. Without padding, the length of the code segment would influence the offset of the data segment, and thus influence the addresses of data objects. Then, we have to perform some kind of relocation after all code have been generated, which would be nasty. In this compiler, I want to avoid relocations as much as possible.[2]

It is now clear how the compiler will generate ELF files. The ELF header and program headers go first, followed by read-only data. Align to 4KiB and emit the data. Align to 4KiB again and emit the code. The bad news is that we will probably not fit a C compiler in ~11KiB.

Footnotes

  1. Actually, I once tried writing a x64 assembler and it did not go well. Also, not every instruction will be used in compiler output, anyway.

  2. Which might be "not very much", and that would be fine.