Intel® QuickAssist Technology Zstandard Plugin, an External Sequence Producer for Zstandard

Intel_AI_Community · ‎08-16-2023

Posted on behalf of:

Author: Brian Will

Contributors: David Qian, Abhishek Khade, Joel Schuetze

Introduction

Zstandard (zstd) is one of the most popular lossless compression algorithms/formats in use today due to its exceptional speed in decompression and compression while achieving impressive compression ratios. It's a very flexible format allowing for adaption to many types of data and applications. At a high level, the algorithm is a two-stage process. First, the process of finding matches or repetition in the data that results in possible areas of replacement with a more condensed representation, namely a dictionary coder (e.g., LZ77). The output of this stage is several sequences, each of which specifies an offset to a match, match length, and potentially a literal length. The second stage is a process of encoding these output sequences using Finite State Entropy encoding (FSE) or Huffman encoding.

The topic for this blog is a new feature added to zstd v1.5.4, which allows for an external implementation of a sequence producer to be injected into the zstd pipeline. This enables the utilization of Intel® QuickAssist Technology (Intel® QAT), which can deliver up to 3.2x better throughput, 3.8x reduction in P99 latency, and 3.3x better performance per watt when compared to zstd for compression. With these improvements, it's expected that Intel QAT will open new breakthrough use cases where compression can now be leveraged for workloads where it would not have been feasible previously.

Intel QAT will be an external sequence producer for zstd, improving performance while exposing the functionality to applications through the familiar zstd interface.

External Sequence Producers

An external sequence producer searches an input buffer for matching bytes in the input set. These matches are represented as a list of `ZSTD_Sequence`'s capturing information on:

distance to matching sequence
match length
is match a literal
rep code information

We'll also cover details of the interfaces to support external sequence producers, along with an implementation for Intel QAT.

Zstandard sequence producer registration function

In zstd v1.5.4, the block-level sequence producer interface was introduced; this allowed an external ‘plugin’ to be invoked per block of data and respond with a set of sequences (literals and matches) for that data. The interface contains a registration function:

ZSTDLIB_STATIC_API void
ZSTD_registerSequenceProducer(
  ZSTD_CCtx* cctx,
  void* sequenceProducerState,
  ZSTD_sequenceProducer_F* sequenceProducer);

The function produces sequences for zstd and has the following signature:

#define ZSTD_SEQUENCE_PRODUCER_ERROR ((size_t)(-1))
typedef size_t ZSTD_sequenceProducer_F (
  void* sequenceProducerState,
  ZSTD_Sequence* outSeqs, size_t outSeqsCapacity,
  const void* src, size_t srcSize,
  const void* dict, size_t dictSize,
  int compressionLevel,
  size_t windowSize);

This allows an external implementation to register with a ZSTD_CCtx to be called for each block, to compress and pass an opaque state. The state can be used as a location for the sequence producer to maintain any transactional information for this instance.

By providing this registration capability, the sequence producer is utilized behind the standard zstd interfaces, allowing applications to continue using zstd in the same way, with interfaces such as `ZSTD_compress2()` or `ZSTD_compressStream2()`. While the API is compatible with all zstd APIs which respect advanced parameters, there are some limitations.

QAT Zstandard Plugin sequence producer API

QAT-ZSTD-Plugin provides acceleration of sequence production using Intel QAT through the zstd APIs. The producer function to be registered is provided by qatSequenceProducer.

size_t qatSequenceProducer(
  void *sequenceProducerState, 
  ZSTD_Sequence *outSeqs, size_t outSeqsCapacity,
  const void *src, size_t srcSize,
  const void *dict, size_t dictSize,
  int compressionLevel,
  size_t windowSize);

Intel QAT will take the input data stream and search for sequences in hardware, returning a set of output sequences. These are then processed by the zstd library, encoding and constructing zstd formatted data. Intel QAT will increase throughput and decrease latency for a set of zstd's compression levels. The plugin is only applicable to compression operations; it does not support decompression. Not all features of the plugin API are currently supported.

In the context of Intel QAT, the state variable passed in with the API will be used for storing details of the device being used for acceleration and its configuration and capabilities. As such, this state variable is managed externally to zstd through the producer plugin using the functions `QZSTD_createSeqProdState` and `QZSTD_freeSeqProdState`.

Intel QAT as a device needs to be started/stopped using the functions `QZSTD_startQatDevice` and `QZSTD_stopQatDevice` prior to registration of the sequence producer. This flow is captured on the plugin repo page, Integration of Intel QAT sequence producer into an Application.

These are the changes required to integrate the QAT-ZSTD-Plugin into your application. Future iterations will be more transparent.

Functional calling sequence to integrate Intel® QAT-Zstd Plugin into your application.

Performance results

To compare performance data between a software implementation and acceleration with Intel QAT, a benchmark utility was developed that submits requests using the `ZSTD_compress2` interface. The utility, QAT-ZSTD plugin benchmark, allows for setting the number of threads of execution, the block size in which to compress the input file, compression level, and several other parameters for changing configuration details.

For these measurements, the command line used was:

./benchmark -m${mode} -l1 –t${threads} –c${blocksize} –L${compression_level} –E2 <input_file>

mode: defines if the operations should be run purely in software "0" or uses Intel(r) QAT for acceleration "1".
threads: the number of pthreads used for submitting compression requests simultaneously.
blocksize: input file will be chunked into the specified size and submitted to the zstd API. Input can be submitted as KBs or MBs, e.g., 16K is 16,384.
compression_level: the level value passed in on the zstd compression API.
input_file: for these benchmarks Silesia Corpus is the input data set.

QAT ZSTD Plugin Sequence Producer compared to zstd-1.5.4

The following measurements were taken using a block size of 16KB with Intel QAT HW configured for its best compression ratio.

Intel QAT is delivering up to 3.2x higher throughput compared to zstd compression level 5 and 2.5x compared to zstd level 4. For the Silesia Corpus, data compression ratios are:

QAT-ZSTD level 9: 2.76
zstd level 4: 2.74
zstd level 5: 2.77

Ratio: is calculated as the input size divided by the output size from compression.

Intel QAT reaches a peak performance of 11.15 Giga Bytes per second. If an application requires further performance, zstd software can be utilized to continue scaling, all while using the same interface from the application.

Compression throughput performance (MB/s) for 16KB requests, comparing Zstandard v1.5.5 vs Intel® QAT-Zstd Plugin across various core combinations.

If your application is latency sensitive, Intel QAT can reduce P99 request latency to ~1/4 that of zstd in this configuration and maintain a flat latency. QAT-ZSTD delivers up to 3.8x lower P99 latency compared to zstd level 5 and 3.2x compared to zstd level 4.

P99 16KB request latency (microseconds) comparing Zstandard v1.5.5 vs. Intel® QAT-Zstd Plugin across various core combinations.

A detailed system configuration is below.

Power comparison

The savings we have covered with throughput and latency also come with a power-saving component. Intel QAT acceleration is able to deliver up to 3.3x better performance per watt when compared to zstd level 5 software alone.

Performance per Watt comparison for QAT ZSTD plugin and Zstandard v1.5.5 software for 16KB requests

Data is collected using the same utility; details are captured in the configuration section below.

If we view this from the angle of cores saved

Core savings when utilizing Intel® QAT-Zstd Plugin vs. Zstandard v1.5.5 for 16KB request sizes.

This represents a core savings of 73% for similar throughput and compression ratios when using Intel QAT Acceleration. This translates into acceleration, with Intel QAT taking 90W less wall power when compared to Zstandard software. This is a significant savings in platform power while providing additional cores for other application workloads to use.

Conclusion

The addition of the sequence producer interface into zstd allows for the acceleration of one of the more costly operations in the compression pipeline, namely searching for matching byte strings. Intel QAT is able to provide HW acceleration of this function, delivering throughput improvements up to 3.2x over zstd SW and P99 latency reduction of 3.8x for a comparable compression ratio. All while delivering 3.3x better performance per watt.

Tight integration into zstd allows the application to access Intel QAT acceleration while programming to the same zstd interfaces. Applications will be able to easily access Intel QAT acceleration with a future path to transparent integration.

Intel QAT provides tangible value for applications in throughput, latency, and power leading to an overall Total Cost of Ownership benefit for applications requiring compression performance. The QAT ZSTD Plugin will continue adding features to improve compression ratio (dictionary support) and performance.

A special thank you to Yann Collet (cyan@meta.com) and Elliot Gorokhovsky (embg@meta.com), for working closely with us to provide this new mode of operation and adding the necessary interfaces & implementation to the mainline of Zstandard.

Want to learn more? Please visit the following:

Configuration

Intel® 4xxx (Intel®QuickAssist Technology (Gen 4)
Intel®Xeon® Platinum 8470N Processor
- Turbo Disabled
Memory configuration:
- DDR5 4800 MT/s
- 32 GB * 16 DIMMs
QAT Driver:
- QAT20.L.1.0.40
Hyper-Thread enabled
OS (Kernel):
- Ubuntu 22.04.1 LTS
BIOS:
- EGSDCRB1.SYS.9409.P01.2211280753
- SpeedStep (Pstates) disabled
- Turbo Mode disabled
QAT configuration:
- ServicesEnabled – dc
- NumberCyInstances – 0
- NumberDcInstances – 1-64
- SVM Disabled & ATS Disabled
Test file
- [Silesia corpus](http://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip)
Software
- Zstandard: <https://github.com/facebook/zstd.git> tag v1.5.5
- QAT ZSTD Plugin: <https://github.com/intel/QAT-ZSTD-Plugin/releases/tag/v0.0.1>
Benchmark tool: Included with Intel(r) QAT ZSTD plugin
- Disable searchForExternalRepcodes
- Test command: `./benchmark -m${mode} -l1 –t${threads} –c${blocksize} –L${compression_level} –E2`
- Example: numactl -C 1-18,105-122 ./test/benchmark -l2 -t36 -L9 -c16K -m1 -E2 silesia.concat

Configuration Details:

Test by Intel as of 05/22/23. 1-node, 2x Intel(R) Xeon(R) Platinum 8470N, 52 cores, HT On, Turbo On, Total Memory 512GB (16x32GB DDR5 4800 MT/s [4800 MT/s]), BIOS EGSDCRB1.SYS.9409.P01.2211280753, microcode 0x2b000161, 2x 223.6G KINGSTON SUV400S37240G, 1x 447.1G INTEL SSDSC2BB480G7, 1x 240M Disk, Ubuntu 22.04.1 LTS, 5.15.0-56-generic, GCC 11.3.0, QAT ZSTD: v1.5.5, QAT-ZSTD-Plugin: v0.0.1, QAT Driver:QAT20.L.1.0.40

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software, or service activation.