This page explains, with three simple examples, the basic structure of a PDF file and how to draw lines and simple text. We only consider PDF 1.0 and ignore embedded fonts for simplicity.
The following is a blank page of A4 paper (210 mm × 297 mm, or roughly 595.276 pt × 841.890 pt, where a point is 1/72 of an inch):
%PDF-1.0 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.276 841.89] >> endobj xref 0 4 0000000000 65535 f 0000000010 00000 n 0000000060 00000 n 0000000118 00000 n trailer << /Size 4 /Root 1 0 R >> startxref 197 %%EOF
Three objects are defined:
a /Catalog object, a /Pages object,
and a /Page object.
Objects reference each other by number, as in 2 0 R.
The /MediaBox of the /Page object
specifies the page's dimensions.
The cross-reference (xref) table
contains byte offsets of the objects,
in this case 10, 60, and 118.[1]
The first line specifies that there are 4 objects with numbers starting at 0.
Object 0 and the second and third columns of the table
are for incremental updates.[2]
The trailer follows the xref table.
It specifies the number of objects in the xref table
and the /Root of the document, the /Catalog object.
Below startxref is the byte offset of the xref table.
Note that each row in the xref table has a trailing space
to make it exactly 20 bytes long,[3]
and the /Pages object's /Count
is not just the length of its /Kids.[4]
Page contents are stored in streams, which are just objects with extra data:[5]
4 0 obj << /Length 17 >> stream 0 0 m 200 200 l S endstream endobj
The stream above draws ("strokes") a line
from (0,0) to (200,200)
when used as the /Contents of a /Page.
See this cheatsheet for details of the language.
To make it work,
the page needs a /ProcSet resource called /PDF,
as shown below.
%PDF-1.0 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.276 841.89] /Resources << /ProcSet [/PDF] >> /Contents 4 0 R >> endobj 4 0 obj << /Length 17 >> stream 0 0 m 200 200 l S endstream endobj xref 0 5 0000000000 65535 f 0000000010 00000 n 0000000060 00000 n 0000000118 00000 n 0000000249 00000 n trailer << /Size 5 /Root 1 0 R >> startxref 317 %%EOF
The last topic is drawing text in the Base 14 fonts. These fonts are built into every PDF reader so we do not need to embed the font files, which is not exactly trivial:
Times-Roman Courier
Times-Bold Courier-Bold
Times-Italic Courier-Oblique
Times-BoldItalic Courier-BoldOblique
Helvetica Symbol
Helvetica-Bold ZapfDingbats
Helvetica-Oblique
Helvetica-BoldOblique
We need a few operators for text drawing:
Tf sets font and size,
Td moves the "pen",
Tj draws a line of text, and
they are wrapped between BT and ET.
For example,
BT /F 10 Tf 150 450 Td (The quick fox jumps over the lazy dog.) Tj ET
puts the sentence near the page's center.
The actual font used is bound to the name /F with a font resource,
which, for the Base 14 fonts, simply names the font.
With an additional /ProcSet resource called /Text:
%PDF-1.0 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [ 3 0 R ] >> endobj 3 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 595.276 841.89] /Resources << /ProcSet [ /PDF /Text ] /Font << /F 4 0 R >> >> /Contents 5 0 R >> endobj 4 0 obj << /Type /Font /Subtype /Type1 /Name /F /BaseFont /Courier >> endobj 5 0 obj << /Length 69 >> stream BT /F 10 Tf 150 450 Td (The quick fox jumps over the lazy dog.) Tj ET endstream endobj xref 0 6 0000000000 65535 f 0000000010 00000 n 0000000060 00000 n 0000000120 00000 n 0000000283 00000 n 0000000361 00000 n trailer << /Size 6 /Root 1 0 R >> startxref 481 %%EOF
Note that the /Font object can only
be bound to its /Name.
This restriction and the whole /ProcSet business
are presumably for PostScript compatibility.
The xref table is not just an optimization:
without it, a stream whose length is given as a reference to an object
that appears after the stream cannot be safely parsed,
since endstream may appear in the stream.
Essentially,
new and modified objects are appended to the original file,
followed by a new xref table and a new trailer,
which contains the byte offset of the old trailer.
Unless the file uses CRLF line endings, which is one byte larger..
In general, /Pages objects
may reference other /Pages objects to form a tree,
and /Count is the total number of pages in a subtree.
The /Parent reference in each non-root node
enables random page access with only one object in memory at a time,
which is probably more useful in the 1990s
(when PDF was first developed) than it is now.
The line ending that follows stream
and the one that precedes endstream are not part of the stream.
The PDF 1.1 spec requires that the former be either LF or CRLF,
because otherwise a CRLF would be ambiguous:
the LF may or may not be part of the stream. Go figure.
Another quirk of PDF is that lines are limited to 255 bytes long.
Also, the spec does not define a formal syntax,
so, for example, it is unclear exactly where newlines are required.