A download of multiple files (e.g., PDFs) at once is usually offered to end users as a ZIP-archive. After selecting the files, the user eventually has to wait for the archive to be created. Below, a way to assemble the archive during the download, without waiting, is presented. This still allows changing file names and directory structure of the ZIP-archive on each download.

What is a ZIP-archive?

A ZIP-archive is a concatenation of files with metadata, optionally compressed or encrypted, and an index of all files at the end. A per-file error detection code (CRC32) of the uncompressed data is included as well.

A ZIP-archive consists of

  • Files (payload) together with their metadata
  • An index of the files (only the metadata) and their positions in the archive

Metadata for each file consists of:

  • Information for the end user: filename, modification date, …
  • Technical metadata: error detection code (CRC32), required ZIP version, …

Check out the ZIP specification

The CRC32 of each file can be pre-calculated together with other metadata and persisted with the files intended for download. Then, ZIP-archive can be assembled on demand with neglectible computational effort: concatenation of the files can be done as they are streamed-out and the index can be added at the end.

Compression is often not required

In ZIP-archives, compression is applied per included file, not to the archive as a whole. Similarities between files (e.g., when copies of the same file are added multiple times) are not used for compression.

Data in common formats like JPEG for images, or MP3 or OGG for audio, is usually already compressed. Compressing PDFs without media can have a small effect, but for PDFs with a few pages, the reason to compress them is usually included media which will not get much smaller with ZIP.

A minimal POC in T-SQL

We’ve implemented a proof of concept in T-SQL targeted at scenarios where files are saved in a database. It can easily be applied to other programming languages and frameworks and to files saved on object storages as well.

Note, that this is not a recommendation to save unstructured data inside a database. This comment shall also not advice against it.

The source code can be found on our GitHub page: https://github.com/ddunicorn/zip.schema

The code below shows how the POC can be used to create a ZIP-archive

declare @contents_1 varbinary(max) = convert(varbinary(max), 'contents_1');
declare @contents_2 varbinary(max) = convert(varbinary(max), 'contents_2');

declare @zip_input ZIP.TYPE_FILE_LIST;
insert @zip_input (file_name, content, file_date, crc32, content_length) values
    ('test/test', @contents_1, '2021-01-01', zip.f_crc32(@contents_1), LEN(@contents_1)),
    ('test/test2', @contents_2, GETDATE(), zip.f_crc32(@contents_2), LEN(@contents_2))

-- Demo output
declare @zip_file varbinary(max) = (select ZIP.F_ZIP(@zip_input));
select ' echo "' 
    + CAST('' as xml).value('xs:base64Binary(sql:variable("@zip_file"))', 'varchar(max)')
    + '" | base64 -d > sql.zip'