Skip to content

Zstandard as a patching engine

Yann Collet edited this page Dec 17, 2020 · 3 revisions

Zstandard CLI is introducing a new command line option --patch-from, which leverages existing compressors, dictionaries and long range match finder to deliver a high speed engine for producing and applying patches to files.

--patch-from is based on dictionary compression. It will consider a previous version of a file as a dictionary, to better compress a new version of same file. This operation preserves fast zstd speeds at lower compression levels. To this ends, it also increases the previous maximum limit for dictionaries from 32 MB to 2 GB, and automatically uses the long range match finder when needed (though it can also be manually overruled). --patch-from can also be combined with multi-threading mode at a very minimal compression ratio loss.

Example usage:

# create the patch
zstd --patch-from=<oldfile> <newfile> -o <patchfile>

# apply the patch
zstd -d --patch-from=<oldfile> <patchfile> -o <newfile>

Benchmarks: We compared zstd to bsdiff, a popular industry grade diff engine. Our test corpus were tarballs of different versions of source code from popular repositories. Specifically:

repos = {
    # ~31mb (small file)
    "zstd": {"url": "https://github.com/facebook/zstd", "dict-branch": "refs/tags/v1.4.2", "src-branch": "refs/tags/v1.4.3"},
    # ~273mb (medium file)
    "wordpress": {"url": "https://github.com/WordPress/WordPress", "dict-branch": "refs/tags/5.3.1", "src-branch": "refs/tags/5.3.2"},
    # ~1.66gb (large file)
    "llvm": {"url": "https://github.com/llvm/llvm-project", "dict-branch": "refs/tags/llvmorg-9.0.0", "src-branch": "refs/tags/llvmorg-9.0.1"},
    # ~1gb (large file)
    "linux kernel tarball": {"url": "https://www.kernel.org/", "dict-branch": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v5.9-rc7", "src-branch": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v5.8"},
    "linux kernel tarball (larger patch)": {"url": "https://www.kernel.org/", "dict-branch": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v5.9-rc7", "src-branch": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v4.4.238"},

}

Starting with improvements introduced in zstd 1.4.7, --patch-from on level 19 is comparable with bsdiff

In terms of patch size, level 19 --patch-from and bsdiff are comparable, though --patch-from wins out particularly on larger patches.

In terms of speed, level 19 --patch-from greatly outperforms bsdiff.

In terms of memory footprint, level 19 --patch-from also significantly outperforms bsdiff.

The remainder of the benchmarks (primarily focusing on faster compression levels) were conducted on zstd 1.4.6

--patch-from at level 1 and 3 is significantly faster (>200X faster on level 1 and >100X faster on level 3) vs bsdiff

speed-bsdiff-vs-zstd-19-1

And of course, there is no change to the fast zstd decompression speed.

PatchFrom vs. SmartVersion vs. Xdelta (for wiki)

After releasing --patch-from, we were made aware of two other popular diff engines by the community: SmartVersion and Xdelta. SmartVersion is a file versioning tool that produces patches with the help of Zstd compression. Xdelta is a binary diff engine that is optimized for patch creation speed.

In our new benchmarks comparing the three, we looked at three metrics: patch size, patch creation speed, and patch extraction speed. And we used two main data sets: binary data (eg: clang) and text data (eg: clang source code). Measurements were made using the default parameters of each program and the parameters that optimize for patch creation speed. Note zstd-default is the same as Zstd on level 3.

The full benchmark with reproduction details can be found here and the highlights are visualized below. Our general takeaway is this: all three tools are excellent diff engines with clear advantages (especially in speed) over the popular bsdiff. Patch sizes for both binary and text data produced by all three are pretty comparable with Xdelta underperforming Zstd and SmartVersion only slightly [1]. For patch creation speed, Xdelta is the clear winner for text data and Zstd is the clear winner for binary data [2]. And for Patch Extraction Speed (ie. decompression), Zstd is fastest in all scenarios [3].

[1] Patch Size _ Default Parameters

[2] Patch Creation Speed (MB_s) _ Default Parameters

[3] Patch Extraction Speed (MB_s) _ Default Parameters


Improvements of --patch-from compression ratio from v1.4.5 to v1.4.7 :

long_v145_v147