Author: Eshe Pickett, Cloud Software Development Engineer, Intel
Introduction
This blog examines how Intel® Xeon® processors with Intel® Advanced Matrix Extensions (Intel® AVX-512) support enhances the performance of CRC32C checksums.
PostgreSQL* has long prioritized performance and data integrity, and with the release of PostgreSQL 18, the Intel PostgreSQL team contributed a significant optimization: Intel® Advanced Vector Extensions 512 (Intel® AVX-512) for CRC32C checksum calculations. These calculations play a vital role, especially for PostgreSQL’s Write-Ahead Logging (WAL) feature.
Background
CRC32C is a widely used checksum for data integrity in storage and network protocols. One critical example of its usage is corruption detection in PostgreSQL WAL.
Key Terms
CRC32C
Cyclic Redundancy Check 32-bit Castagnoli - A checksum algorithm using the Castagnoli polynomial for efficient error detection in data storage and transmission.
Intel® AVX-512
Intel® Advanced Vector Extensions 512 - Intel's instruction set that enables processors to perform operations on 512 bits of data simultaneously for enhanced performance.
VPCLMULQDQ
Vector Pairwise Carry-less Multiply of Quadwords - A specialized Intel® AVX-512 instruction for performing Carry-less Multiplication operations, particularly useful for cryptographic and checksum calculations.
ZMM Registers
512-bit wide registers in Intel® AVX-512 capable processors that can hold and process large amounts of data in parallel operations.
XGETBV
Extended Get Processor Extended State - A CPU instruction used to query which advanced processor features (like Intel® AVX-512) are enabled and available for use.
Intel® SSE4.2
Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2) - Instruction set predecessor to Intel® AVX-512 optimizations that includes hardware-accelerated CRC32 instructions.
What Changed?
Previously, the x86 PostgreSQL CRC32C computation used the Intel® SSE 4.2 extension, which handles up to 8 bytes per instruction. In PostgreSQL 18, the database can now use Intel® AVX-512 and the VPCLMULQDQ instruction to process 64 bytes in a single loop iteration. The implementation gracefully falls back to Intel® SSE 4.2 for smaller inputs or older hardware.
Key Changes:
- New Intel® AVX-512 Path:
When Intel® AVX-512 and VPCLMULQDQ instructions are detected at runtime, PostgreSQL processes larger chunks of data in parallel, accelerating checksum calculations. - Smarter Runtime Checks:
The build system and CRC logic have been updated to detect which CPU features are available and to select the fastest compatible algorithm at runtime. - Broader Hardware Support:
The dynamic instruction set detection and fallback ensure maximum compatibility for modern hardware and legacy systems.
Benchmark Summary
On each benchmark performed, Intel AVX-512 demonstrates a significant improvement compared to Intel SSE4.2.
CRC32C bench (Algorithm-Level Performance)
The raw CRC32C algorithm performance on random data buffers shows Intel® AVX-512 achieving 2.2x to 8.1x speedup over Intel® SSE4.2.
PostgreSQL test_crc32c (Database Integration)
Within PostgreSQL's CRC32C implementation, Intel® AVX-512 provides 1.04x to 5.60x speedup over Intel® SSE4.2 across buffer sizes from 128 bytes to 32KB.
pg_basebackup (Real-World Database Operations)
For complete database backup operations (1.5GB to 256GB), Intel AVX-512 delivers a 13-17% throughput improvement with a consistent 1.15x average speedup.
Intel® AVX-512 and VPCLMULQDQ implementation details
What is VPCLMULQDQ?
VPCLMULQDQ is an Intel® AVX-512 instruction for Carry-less Multiplication, which is important for polynomial arithmetic. It can process 512 bits (64 bytes) in parallel per instruction. This instruction, available on Intel Xeon Scalable processors and other Intel processors with Intel AVX-512 support, is beneficial for speeding up critical computation hotspots, such as CRC computations within databases like PostgreSQL. This instruction is combined with function inlining and indirect calls, which are used strategically to prevent critical paths in PostgreSQL (such as WAL) from incurring extra overhead.
Implementation Details
In the following sections, we review how the modifications enable PostgreSQL to utilize the most efficient instructions possible when the hardware supports them.
- Detection at Runtime and Build Time
- Updated build scripts (configure, configure.ac, meson.build) check not only for Intel® SSE4.2 and ARM* CRC, but also for the presence of Intel® AVX-512 and VPCLMULQDQ.
- At runtime, the code checks CPUID bits to see if the CPU and OS support Intel® AVX-512, VPCLMULQDQ, and the necessary ZMM registers (via XGETBV).
- Function Dispatch
- A function pointer (pg_comp_crc32c) is set up at the first use to point to the optimal implementation:
- Intel® AVX-512 (if supported)
- Intel® SSE4.2
- Slicing-by-8 fallback
- This is done in pg_crc32c_sse42_choose.c with fine-grained checks for each CPU feature.
- For critical WAL checksum cases, the function is inlined for optimal speed.
- A function pointer (pg_comp_crc32c) is set up at the first use to point to the optimal implementation:
- Intel® AVX-512 Algorithm
- Derived from the MIT-licensed fast-crc32 project.
- The main loop processes 64 bytes at a time using:
- _mm512_loadu_si512 to load data
- _mm512_clmulepi64_epi128 for Carry-less Multiplication
- _mm512_ternarylogic_epi64 and other Intel® AVX-512 intrinsics for reductions
- The code aligns to cache lines and uses scalar CRC for any unaligned bytes at the beginning.
- The 512-bit result is reduced (folded) stepwise to a 32-bit CRC.
Code Example (Excerpt)
pg_crc32c
pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
{
// ...alignment and setup...
while (len >= 64) {
// Load 64 bytes
__m512i x0 = _mm512_loadu_si512(buf);
// Polynomial folding with carryless multiplication
__m512i y0 = clmul_lo(x0, k);
x0 = clmul_hi(x0, k);
x0 = _mm512_ternarylogic_epi64(x0, y0, ...);
// Advance to next chunk
buf += 64;
len -= 64;
}
// Reduce to 32 bits and finish
// ...
return crc;
}
Why it Matters
Operations in PostgreSQL, such as backup, restore, and WAL, often process data in blocks of 8KB or larger. Many network protocols and storage systems utilize large buffers (4 KB or greater) for checksums. Modern file systems perform integrity checks using large blocks of data. Because Intel® AVX-512 is designed to provide speedup for larger contiguous data blocks, these operations will perform better on a system with the instruction set available. The CRC32C implementation using Intel® AVX-512 improves PostgreSQL in the following ways:
- Performance:
Faster checksums mean less CPU time spent on integrity checks, especially during high-throughput operations or bulk loads. - Modernization:
By adopting Intel® AVX-512, PostgreSQL stays on the leading edge of hardware-accelerated computing. - Reliability:
The fallback logic ensures robust operation on all supported hardware, maintaining accuracy and safety.
Benchmarks and Results
We performed all benchmarks on an Intel® 4th Generation Xeon® Scalable processor with PostgreSQL 18 RC1, compiled with and without the Intel AVX-512 instruction set for comparison. We ran each benchmark multiple times, and report results from a representative run that exhibited typical performance observed across all executions. This approach ensured consistency and allowed us to rule out transient system effects. We created build and run scripts for automation and reproducibility.
The results below show the performance of the CRC32C implementation from three different approaches:
- CRC32C Bench (crc32c bench): This benchmark measures the execution time of various CRC32C algorithms on random data buffers of multiple sizes.
- PostgreSQL (test_crc32c): This benchmark is a PostgreSQL extension that measures the performance of CRC32C computations within PostgreSQL by repeatedly calculating the checksum over a randomly initialized buffer.
- PostgreSQL Backup (pg_basebackup): This benchmark evaluates throughput during a PostgreSQL database backup operation, with databases ranging in size from 1.5 GB to 250 GB.
CRC32C bench: Algorithm-Level Performance Results
The crc32c microbenchmark provides an algorithm-level assessment. The table below compares the performance of scalar (simple, non-SIMD), Intel SSE4.2-optimized, Intel AVX-512-optimized, and Corsix CRC32C implementations of the CRC32C computations.
For high-performance databases, AVX-512 CRC32C delivers superior throughput for large data blocks, reducing overhead for integrity checks and speeding up write/read paths.
On legacy hardware, Intel SSE4.2 and scalar implementations continue to deliver solid performance for smaller buffers.
The significance is greatest in scenarios where integrity checks are a bottleneck, such as heavy I/O, replication, or backup verification.
PostgreSQL test_crc32c: Database Integration Results
Within the test_crc32c extension, the drive_crc32c function provides an assessment at the PostgreSQL application level. It allocates a buffer of size num, fills it with random data, and then repeatedly computes the CRC32C checksum on this buffer for count iterations. The final CRC32C value is returned as a bigint. It is installed as an extension in PostgreSQL. The timing was captured via PostgreSQL /timing. For all but the smallest buffer sizes, Intel® AVX-512 is markedly more efficient.
At very large buffer sizes (16384, 32768), Intel AVX-512 maintains substantial performance advantages, achieving a speedup of 4.8x or more.
PostgreSQL Backup (pg_basebackup): Real-World Database Operations Results
To determine performance in a real-world scenario (as opposed to algorithm and application level), we performed an end-to-end backup workflow on a PostgreSQL server using pg_basebackup. This was done on progressively larger database sizes, scaling from small to enterprise size. Analysis of the CRC32C operation patterns revealed that in most cases (99.99%), PostgreSQL processes database backups in 32KB chunks, utilizing the default page size of 8KB (which processes four pages at a time), thereby providing consistent computational workload characteristics across different database sizes.
The graph illustrates the performance of Intel SSE4.2 compared to Intel AVX-512.
A large database backup requires a substantial number of CRC32C computations. The underlying implementation has a direct impact on backup time. The Intel® AVX-512 execution shows that when compared to Intel® SSE4.2, the performance improves on average by 15% with real-world implications for time savings:
| Database Size | Intel® SSE4.2 Time | Intel® AVX-512 Time | Time Saved |
| 15.0 GB | 10.4 sec | 9.2 sec | 1.2 sec |
| 75.0 GB | 51.4 sec | 44.2 sec | 7.2 sec |
| 150.0 GB | 99.5 sec | 85.2 sec | 14.3 sec |
| 256.3 GB | 173.0 | 152.6 sec | 20.4 sec |
When translated to production benefits:
- Enterprise backups (256GB): Save 20+ seconds per backup
- Daily backups: 20 seconds × 365 days = 2+ hours annually
- Multiple databases: Savings multiply across database instances
This represents meaningful time savings for production environments with large databases and frequent backup requirements.
Conclusion
These results demonstrate that enabling Intel® AVX-512 in your PostgreSQL CRC32C implementation yields a clear and substantial performance improvement for a wide range of realistic buffer sizes (from very small to moderately large). This patch is a state-of-the-art example of utilizing modern SIMD instructions to accelerate critical database primitives, with a focus on portability, correctness, and runtime efficiency. PostgreSQL users can apply the patch to backport the functionality on modern hardware or upgrade to PostgreSQL 18 for native support.
Configuration
This section provides the system and software configuration(s) required to reproduce the results provided.
Hardware and Software Configurations
Intel Xeon Platinum 8480: 1-node, 2x Intel(R) Xeon(R) Platinum 8480+, 56 cores, 350W TDP, HT On, Turbo On, Total Memory 512GB (16x32GB DDR5 4800 MT/s [4800 MT/s]), BIOS 3A06, microcode 0x2b000639, 2x Ethernet Controller X710 for 10GBASE-T, 1x 1.8T Samsung SSD 970 EVO Plus 2TB, 1x 3.7T addlink M.2 PCIE G4x4 NVMe, 1x 953.9G KBG40ZNS1T02 TOSHIBA MEMORY, Ubuntu 22.04.5 LTS, 6.8.0-60-generic. Test by Intel as of 09/23/25.
PostgreSQL 18RC1
Minimal requirements:
- Intel processor with Intel SSE4.2 support (Core i7/i5 2nd gen+, Xeon 5500+)
- PCLMUL instruction support
- Operating System: Linux variant with kernel 6.x or later
Optimal Performance:
- Intel® 4th Generation Xeon® Scalable processor or newer (for Intel AVX-512 VPCLMULQDQ)
- Intel® 3rd Generation Xeon® Scalable processor or newer (for improved Intel AVX-512 performance)
Modifications
- PostgreSQL (Intel AVX-512 natively enabled): 18.0 (see installation requirements)
- PostgreSQL (Intel SSE4.2 enforcement): 18.0
- If using Meson:
# Patch meson.build to skip AVX512 detection sed -i
'/cdata.set.*USE_AVX512_CRC32C_WITH_RUNTIME_CHECK.*1/s/1/0/' meson.build sed -i
'/cdata.set.*USE_AVX512_POPCNT_WITH_RUNTIME_CHECK.*1/s/1/0/' meson.build - If using Make:
# Patch configure file to skip AVX512 detection
sed -i '18293s/.*/# \$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h/' configure
- If using Meson:
Acknowledgments
This patch was co-authored and reviewed by contributors from Intel and the PostgreSQL community, with benchmarking and discussion in related PostgreSQL mailing lists.
Credits to John Naylor (AWS), Nathan Bossart (AWS), Raghuveer Devulapalli (Intel), Paul Amonson (Intel), Matthew Sterrett (Intel), Kelly Mckeighan (Intel), Eshe Pickett (Intel), and Akash Shankaran for various contributions and feedback to this effort.
Commit: 3c6e8c123896584f1be1fe69aaf68dcb5eb094d5
Date: April 6, 2025
References
- Postgres SQL function for crc32c benchmarking
- PostgreSQL 18 Release Announcement
- fast-crc32 on GitHub
- Intel’s “Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction” (2009)
- PosgreSQL Turns To AVX-512 for CRC32 Computations: Up To 3x Faster
- Which AVX-512 Instructions Are Supported by Intel Xeon Scalable Processors?
- Raw CRC32C Algorithm Benchmark
Notices and Disclaimers
Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. *Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.