Performance

HSM versus software PQC: latency benchmarks under realistic financial transaction load

May 6, 2025 · Daniel Pryce, Head of Hardware Engineering · 9 min read

Benchmark chart comparing latency of HSM hardware PQC versus software-only PQC

The question we hear most from payment network security architects evaluating post-quantum migration options: "Can software PQC running on our existing authorization servers handle peak load?" The honest answer requires actual benchmark data, not marketing assertions. This article presents the methodology and numbers from Cryptrig's internal comparative testing of software PQC versus CQ1 hardware under payment authorization workload profiles.

The comparison is not between Cryptrig and a named competitor's product. It is between the general class of software PQC implementations running on modern server hardware and CQ1. We include OpenSSL's liboqs integration, the PQClean reference implementation compiled for x86-64 with AVX2, and a commercial PKCS#11 software module (unnamed) as representative software benchmarks.

METHODOLOGY NOTE

All benchmarks were conducted on an internal test infrastructure using CQ1 pre-production evaluation hardware. Software PQC benchmarks used a dual-socket server with Intel Xeon processors and 256GB RAM. Payment transaction profiles were synthetic, based on published ISO 8583 transaction size distributions for card authorization environments. These are internal test results, not third-party certified benchmarks. Independent verification is available to qualified institutions under NDA as part of the CQ1 evaluation program.

Baseline: what financial transaction load looks like

A mid-size regional bank processing card transactions typically sees authorization volumes of 800–2,500 transactions per second during peak periods (Friday evening, pre-holiday). Each authorization flow involves at minimum:

One symmetric decryption operation (PIN block or payload)
One MAC generation or verification (message authentication)
Potentially one key diversification operation (per-card key derivation)

In a post-quantum migration scenario, the RSA or ECDH key exchange used to establish session keys between acquirer HSMs, network switches, and issuer HSMs would be replaced with Kyber-1024 key encapsulation. The question is whether software Kyber execution on the authorization server itself can handle this without adding latency budget that violates the total round-trip time requirements of the authorization network (typically sub-200ms end-to-end).

Benchmark 1: Kyber-1024 key encapsulation throughput

Test: Maximum sustained Kyber-1024 key encapsulations per second under 100% CPU utilization.

Implementation	Peak ops/sec	P50 latency	P99 latency
OpenSSL liboqs (AVX2)	~1,850	0.51ms	2.8ms
PQClean x86-64 AVX2	~2,200	0.44ms	2.3ms
Software PKCS#11 module	~1,200	0.82ms	4.1ms
CQ1 (FPGA hardware)	>18,000	0.18ms	<0.8ms

The throughput gap at peak is approximately 8–15x in favor of hardware. More importantly, the P99 latency for software implementations at high load climbs into the 2–4ms range. For authorization flows where the total cryptographic budget (HSM operations) is 10–15ms, adding 4ms of P99 latency on the key exchange step alone represents a 25–40% budget increase.

Benchmark 2: mixed-algorithm workload at sustained load

Real authorization environments do not run a single operation type. A more realistic test combines Kyber-1024 encapsulation, Dilithium-3 signing, AES-256-GCM decryption, and SHA-3 hashing in a ratio representative of authorization message flows.

Test configuration: 5,000 transactions/second, each requiring one Kyber-1024 operation and one Dilithium-3 sign. 60-second sustained run, measuring P99 latency across all operations.

Configuration	Achievable TPS at <5ms P99	P99 at 5,000 TPS	CPU utilization at 5,000 TPS
Software PQC (AVX2, 32 cores)	~2,100	18.4ms	97%
CQ1 hardware offload	>12,000	1.1ms	4% (host)

At 5,000 TPS with mixed Kyber + Dilithium operations, software PQC on a 32-core server is at 97% CPU utilization and P99 latency is 18ms — exceeding typical authorization budgets and leaving no headroom for burst. CQ1 handling the same workload leaves host CPU at 4% utilization with P99 latency under 2ms. The host CPU headroom means authorization server capacity is not dictated by cryptographic workload.

The shared-core contention problem

The benchmarks above test ideal conditions — a server dedicated entirely to PQC operations. In practice, authorization servers also run application logic: ISO 8583 parsing, fraud scoring, routing decisions, database calls. The NTT operations in Kyber compete for the same CPU cache lines and memory bandwidth as application code running on adjacent cores.

We observed a consistent 15–25% degradation in software PQC throughput when running concurrently with a simulated authorization application (database queries, in-memory cache operations). This is not a software bug — it is a fundamental consequence of shared-memory architecture. NTT polynomial multiplication has high memory access density, which creates L3 cache pressure that affects all cores on the socket.

FPGA execution is physically isolated from the host CPU. CQ1's NTT pipeline has its own local memory, its own clock domain, and no shared path to host DRAM during key operations. Application code running on the host server experiences zero cache pressure from CQ1 key operations. This is not a performance optimization — it is a different security model. The side-channel attack surface for FPGA key operations is structurally different from CPU-resident key operations.

Side-channel implications beyond throughput

Latency and throughput are the metrics that procurement conversations focus on. But the security case for hardware PQC is not primarily a performance argument — it is a side-channel argument.

Software PQC implementations that pass NIST's Known Answer Tests (KATs) may still be vulnerable to:

Cache-timing attacks: NTT table lookups have address patterns that can be observed by colocated virtual machines or processes via shared L1/L2 cache access timing. Constant-time implementation helps but does not eliminate all cache-timing vectors in shared-CPU environments.
Power analysis: In physical access scenarios (server room access, supply chain interception), differential power analysis (DPA) of CPU power consumption during NTT operations can leak information about the secret polynomial coefficients.
Fault injection: Glitching attacks on CPU voltage rails during signing operations can induce faults that leak the secret key in some signature schemes.

CQ1's physical security boundary — the FIPS 140-3 Level 3-designed tamper-detection mesh and power conditioning — addresses these vectors at the hardware level. Active power-line filtering reduces DPA signal. The tamper mesh detects physical access attempts. FPGA execution in a boundary-isolated fabric eliminates shared-memory side channels.

Financial institutions deploying PQC in environments subject to PCI HSM requirements cannot substitute a software PKCS#11 module for a hardware HSM regardless of throughput characteristics. The PCI HSM standard requires hardware key boundaries for transaction key operations. That constraint does not change when the algorithm changes from RSA to Kyber.

When software PQC is appropriate

Software PQC implementations are the correct choice for some use cases:

Client-side TLS — browser and mobile application connections where hardware HSMs are impractical
Development and testing environments where hardware evaluation units are not yet available
Low-volume internal systems with sub-100 ops/sec cryptographic requirements
Systems where physical key protection is not a regulatory requirement

Cryptrig is not a software library vendor. We do not produce or support software PQC stacks. Our assessment of software PQC performance and security properties is based on technical analysis of published implementations and internal testing — not competitive positioning. The honest assessment is that software PQC is appropriate for a large fraction of use cases and inadequate for the specific use case our product addresses: high-throughput, tamper-boundary-required, payment authorization infrastructure.

Practical conclusions for payment network architects

For security architects evaluating PQC deployment options for financial transaction infrastructure:

Software PQC throughput on modern servers is sufficient for sub-1,000 TPS environments with relaxed latency budgets. Above that, P99 latency degradation under burst becomes problematic without dedicated hardware.
CPU utilization at high PQC load leaves no headroom for application logic. In authorization environments where server capacity is sized to 70–80% utilization at peak, adding full-server software PQC will cause capacity failures.
PCI HSM requirements mandate hardware key boundaries regardless of algorithm or throughput capacity.
Side-channel isolation for PQC key operations in shared-CPU environments requires constant-time implementation discipline and is not achievable at the same assurance level as physical FPGA isolation.
The migration from classical HSMs to PQC-capable HSMs via PKCS#11 drop-in is the path of least operational disruption. Software PQC requires application code changes to integrate the PQC library — hardware HSM replacement does not, for PKCS#11-conformant applications.

Understanding the P99 cliff: why averages mislead

The benchmark tables above show P50 and P99 latency separately because the gap between them is the operationally relevant number. Software PQC at P50 looks tolerable; at P99 under burst it does not. This gap is not a measurement artifact — it reflects fundamental properties of software execution on shared hardware.

On a modern Linux server running a PKCS#11 application, Kyber-1024 operations are scheduled by the OS kernel alongside other processes. During average load, the Kyber operation gets CPU time quickly and P50 latency is close to the single-threaded benchmark. During burst — a sudden spike in authorization requests, an end-of-hour settlement batch running simultaneously, a monitoring agent triggering a CPU-intensive scan — the scheduler queue depth increases, and the P99 operation waits longer for CPU time. On systems running near capacity, this queue effect is non-linear: the 99th percentile wait time can be 10–20x the 50th percentile wait time at 85–90% CPU utilization.

The hardware approach changes this dynamic structurally. CQ1 processes operations from a dedicated hardware queue implemented in the FPGA control fabric. Operations arrive via PCIe from the host, are enqueued in FPGA-local memory, and are dispatched to NTT pipeline instances without any OS scheduling involvement. There is no shared scheduler between CQ1 and host application processing. The P99-to-P50 ratio for CQ1 at maximum throughput is under 1.5x — latency is nearly uniform regardless of host CPU load state.

For payment authorization architectures where SLA violations have direct revenue impact — declined transactions due to timeout, settlement failures due to missed batch windows — the P99 cliff of software PQC at high load is not an acceptable operational risk. The cost of a hardware HSM that eliminates this risk is justified on operational reliability grounds alone, independent of security considerations.

The PKCS#11 integration path: a more realistic migration comparison

The benchmark comparison above tests maximum throughput in isolation. A migration-relevant comparison also needs to account for integration effort: how much code change is required to deploy hardware HSM versus software PQC in a production payment environment?

Software PQC library integration (liboqs, BouncyCastle PQC, or a commercial PQC SDK) requires:

Integrating the PQC library into the application build
Modifying key exchange call sites to use the new API (liboqs has a different API than OpenSSL's classical functions)
Updating buffer allocation for larger PQC key sizes and ciphertexts
Updating TLS configuration to advertise hybrid cipher suites
Implementing key management for PQC key pairs (generation, storage, rotation)

Hardware HSM replacement (classical HSM → CQ1 via PKCS#11) requires:

Installing CQ1 and its PKCS#11 provider library
Updating the PKCS#11 provider configuration file to point to CQ1 rather than the existing HSM
Updating buffer allocation for PQC key sizes (same requirement as software path)
Running existing PKCS#11 application test suites against CQ1 in classical mode to validate drop-in compatibility
Enabling hybrid mode via policy configuration once classical-mode testing passes

The software PQC path requires changing API call sites across the application codebase. The hardware HSM replacement path changes the provider configuration and leaves API call sites intact (since they already use PKCS#11 abstractions). For institutions with large authorization codebases or multiple application teams that each own part of the payment stack, the provider-swap approach is substantially less risky than a library-swap approach — it reduces the blast radius of the migration to the HSM integration layer rather than spreading it across all application code that performs cryptographic operations.

This is not a universal argument against software PQC in all environments. For institutions building new payment infrastructure from scratch, software PQC with proper abstraction layers is a reasonable design. For institutions migrating existing payment infrastructure — which describes most of the market — the hardware HSM replacement path minimizes code change risk and integration testing scope while providing the throughput and security properties that regulated payment environments require.

A note on benchmarking methodology and reproducibility

Latency benchmarks for cryptographic operations are highly sensitive to test conditions. Numbers cited in vendor materials — including our own — should be contextualized against the test environment. Single-threaded peak throughput on an idle machine tells you the theoretical ceiling for that algorithm on that hardware; it does not tell you the throughput when the machine is also running authorization middleware at 60% CPU capacity.

When evaluating vendor benchmark claims for both software PQC and hardware HSMs, the useful questions are: What was the CPU and memory utilization of non-cryptographic workloads during the test? Was the measurement P50, P95, or P99? What was the test duration — 10 seconds of burst, or 60 minutes of sustained load? Was the test conducted on dedicated hardware or shared infrastructure? Without answers to these questions, latency numbers are useful for rough ordering (hardware is faster than software) but not for detailed capacity planning.

Cryptrig makes evaluation hardware available to qualified institutions for on-site benchmarking in their own environments. Our benchmark numbers reflect internal testing conditions as documented in the callout above; independent third-party verification is available through the evaluation program. For institutions that need production-accurate latency data, running CQ1 against your own authorization workload profile — not against our test configuration — is the only way to get numbers you can size infrastructure around with confidence.

The benchmark exercise itself has value beyond capacity planning. Running PQC operations against your actual PKCS#11 application code surfaces integration issues — buffer sizes, API contract changes, thread safety assumptions — that don't appear in specification review. Institutions that complete benchmark testing in evaluation environments before committing to production migration timelines consistently report faster Phase 1 integration cycles than those that skip this step. The benchmarking is part of the migration, not a preliminary to it.

See CQ1 performance benchmarks in your own environment

Request Evaluation Unit