Plain ZIP: Notes

2026-03-11 / 2026-03-30

I posted a pseudo-specification of a "Plain ZIP" file format in February 2025, but then removed it for its use of MS-DOS datetimes.[1] I started thinking about it again recently and tried writing a command-line utility pz for it, but got stuck before getting anywhere, partly because I made the oddball choice to go libc-free and issue Linux system calls directly. Even though writing a blog about doing the thing is not doing the thing, I'll blog about it and see if that helps.

Motivation. I want a sane bundle file format, that is, tarballs but better. The usual choices for bundles are TAR and ZIP, but I find neither satisfying. TAR has multiple distinct extensions to work around limitations of the original format, stores numbers as zero-filled octal numbers in ASCII, and requires some effort to create reproducible archives. On the other hand, ZIP suffers from ambiguities and inconsistencies in path encoding, timezone, and program handling.[2] The ZIP APPNOTE is also disorganized and unclear. Since I cannot force everyone with one of my bundles to learn a new format, I decided to define a restricted ZIP format to (hopefully) ensure consistent behavior. I call this "new" format Plain ZIP.

Bundle, not archive. Both TAR and ZIP are archive formats and, as such, support user IDs, permissions, symbolic links, etc., so users can archive their stuff. Bundles are different: I don't care who owns a file in a source code tarball, for example. Therefore, Plain ZIP represents sensible mappings from path to data, and nothing more. Specifically, Plain ZIP only stores regular files, and does not store last modification times, ownerships, and permissions.

Reproducibility / canonicality. Reproducible builds are good, and for that, we only need to sort members by path and give up compression. Compressing individual files is less efficient than compressing the whole bundle, and support for good compressors like LZMA and zstd in ZIP is fairly recent (read: not in Info-ZIP unzip, which is from 2009) anyway. No compression also means the format doesn't need to be updated to use future compressors.

Alternative formats. There aren't a lot of choices, as far as I know. ar (the static library thing) is not standardized and has multiple variants. shar is weird. ARC is textual and I only know it used in WARC. SquashFS and EROFS are full-fledged filesystems, still under development, and EROFS sounds NSFW. QOP is sane but supported nowhere. Game engines bundle resources with custom formats like WAD (Doom) and XP3 (KiriKiri), which are also supported nowhere.

No ZIP64 extensions. Since I want the format mostly for stuff like source code tarballs, blog archives, and resource bundles, I'm okay with ZIP's limit of 65535 files and ~4 GB. For reference, the Linux kernel is currently ~93k files and ~1.5 GB, gcc is ~158k files and ~0.8 GB, and llvm-project is ~172k files and ~2.0 GB. I hope this limit won't cause too much trouble for people who want such large bundles.

No directories. Since Info-ZIP unzip creates parent directories automatically (e.g. extracting a/x creates a if necessary), storing directories breaks canonicality, in the sense that extracting a bundle that contains a and a/x is equivalent to extracting a bundle that only contains a/x. An unfortunate result of this decision is that empty directories cannot be created by extracting a Plain ZIP bundle. The design option of only storing empty directories is considered, but it wuold complicate bundle creation by necessitating some kind of "store this directory but don't recurse into it" option (which is the default behavior of Info-ZIP zip without -r), and it still breaks canonicality in some weaker sense when extracting into a non-empty directory (there's no sane "no-clobber" or "overwrite" semantics for directories).

Filename restrictions. By default, Plain ZIP only allows ASCII encoded filenames in the POSIX portable filename character set. Users may opt to allow any byte sequence that doesn't contain the ASCII encoded NUL and slash, for practical reasons. In either case, filenames must not be empty, ., or .., but may begin with a dash (a.k.a. hyphen-minus). If non-portable filenames are allowed, consumers must properly handle filenames that are not valid UTF-8 strings.

Filename and path length limits. Filename and path lengths are limited to 65535 bytes because ZIP stores them in 16-bit unsigned integers. It is usually expected that adding a prefix or suffix to a valid path results in a valid path, so I decide not to impose any further limit. Note that paths longer than 4095 bytes are truncated to 4095 bytes by Info-ZIP unzip while emitting a warning, and considered invalid by busybox unzip. Also note that 65535-byte filenames would not be accessible on Linux via getdents64, since the 16-bit unsigned d_reclen field includes the length of a 19-byte header and a NUL terminator, but I digress.

Deviations from ZIP. To simplify the format, every last modification time field is set to zero, which is technically not a valid MS-DOS datetime (the MS-DOS epoch, 1980-01-01 00:00:00, would be 0x00210000). Similarly, every "version needed to extract" field is set to zero, which is technically not defined in the APPNOTE. These does not seem to cause problems in practice, since ZIP utilities don't really validate anyway.

Compatibility with unzip. Also for simplicity, every "version made by" field is set to zero. Since its upper byte identifies the host system and zero corresponds to "MS-DOS and OS/2", this creates some compatibility problems with Info-ZIP unzip on Unix-like systems:

unzip converts backslashes in paths to forward slashes when extracting, because some buggy zippers use backslashes. There seems to be no workaround.
unzip assumes that paths are encoded in the OEM charset. Use -Outf8 if necessary.
unzip converts paths to lowercase if -L is used or if unzip is older than 5.11. Since unzip 6.00 is from 2009, old versions are probably pretty rare. Don't use -L.
unzip converts CRLF in file data to LF if -aa is used. Don't use -aa.

From this investigation it appears that VFAT (host system 14) or Unix (host system 3) may be a better choice. The format is not yet finalized, and I may change my mind about this matter.

Footnotes

MS-DOS datetimes are little-endian 32-bit bitfields with, from the most significant to the least, 7 bits for year since 1980, 4 bits for month, 5 bits for day, 5 bits for hour, 6 bits for minute, and 5 bits for second, in some local timezone. As such, they only last until 2107 and only have two-second precision. Moreover, some software may not handle years after 2099 properly. [citation needed]
Since the end-of-central-directory record at the end of the file ends with a variable-length comment field, there can be multiple valid end-of-central-directory records. Also, unzip -l uses the central directory and doesn't check the local file headers, while unzip -t and unzip -x uses local file headers and doesn't check the central directory.