The Number Theoretic Transform is the inner loop of CRYSTALS-Kyber. Every encapsulation and decapsulation operation — the cryptographic work at the center of every post-quantum TLS handshake, every key wrapping operation, every zone working key establishment — invokes the NTT multiple times on 256-coefficient polynomials over the ring Z_q[x]/(x^256 + 1) where q = 3329.
When I describe CQ1's architecture to security architects who are evaluating hardware versus software PQC, the NTT is where the conversation gets specific. Software NTT implementations run in roughly 5,000–8,000 clock cycles on an optimized x86-64 core. That sounds fast — and for a single operation, it is. The problem is the interaction between NTT's memory access pattern, modern CPU cache architecture, and the concurrent workloads that live on the same physical cores in a payment processing server. This post explains exactly what happens in hardware and why the results are categorically different.
NTT over Z_3329: the mathematical structure
The NTT over a finite field Z_q is structurally identical to the Discrete Fourier Transform — it transforms a polynomial from the coefficient domain to the evaluation domain using "twiddle factors" drawn from the multiplicative group of Z_q. For Kyber's parameter q = 3329, the primitive 512th root of unity is ζ = 17 (since 17 generates a cyclic subgroup of order 512 in the multiplicative group of Z_3329).
The standard Cooley-Tukey butterfly structure applies: the 256-coefficient input polynomial is divided recursively into even- and odd-indexed coefficients, and at each level a butterfly computes:
a' = a + ζ^k · b (mod q)
b' = a − ζ^k · b (mod q)
For a 256-point NTT, this requires log2(256) = 8 levels, each with 128 butterfly operations, for a total of 1,024 butterfly operations per NTT call. Each butterfly involves one multiplication in Z_3329 and two additions. With Montgomery reduction for the modular multiplication, the arithmetic is efficient — but the twiddle factor ζ^k for each butterfly must be read from a precomputed table of 256 values.
In a Kyber-1024 encapsulation, the NTT is called on each of the 4×4 matrix entries (from the public key), plus the 4 secret and error polynomial vectors — roughly 32 NTT calls total per encapsulation. Each call reads 256 coefficients (512 bytes at 2-byte int16 representation) plus 256 twiddle factors (512 bytes). The total working set per encapsulation is approximately 32 KB of NTT-related data.
Why CPUs struggle: the cache coherency problem
32 KB of working set fits comfortably in L1 cache on most modern server CPUs (typically 32–64 KB per core). When a single thread is doing nothing but Kyber encapsulation in a tight loop, the NTT working set stays hot in L1 and performance is excellent — this is the benchmark condition under which published "fast software PQC" numbers are measured.
Payment authorization servers do not run in this benchmark condition. A payment server core is running: the application thread, TLS session ticket lookups, I/O interrupt handlers, timer callbacks, and frequently JVM garbage collection threads (if the authorization logic is Java). Each of these workloads evicts cache lines. The 32 KB Kyber NTT working set — which is access-pattern-irregular due to the butterfly stride pattern — is particularly vulnerable to eviction by competing workloads.
The consequence in practice: measured Kyber software performance at 1,000 concurrent connections per second on a production payment server is 3–5× worse than single-threaded microbenchmark performance, because the NTT coefficient array and twiddle factor table are being evicted from L1 cache between calls. We measured this directly during CQ1 development, profiling a liboqs Kyber-1024 implementation on server hardware representative of mid-size payment processing deployments. At 1,000 req/sec, L1 cache miss rate on the NTT twiddle table climbed to 23%. At 2,500 req/sec with background I/O load, it reached 61%.
CQ1's NTT pipeline: the architecture
CQ1's NTT implementation lives entirely within FPGA fabric. There is no shared cache, no interrupt contention, no competing workloads. The architectural choices that flow from this:
On-fabric polynomial memory: The 256-coefficient polynomial and the 256-entry twiddle factor table are stored in dedicated BRAM blocks on the FPGA fabric, not in external DRAM. BRAM access latency on the target FPGA family is 2 clock cycles, compared to L1 cache hit latency of 4 cycles on current Intel server CPUs — but more importantly, BRAM latency is deterministic. There is no cache-miss pathway. Every NTT coefficient and twiddle factor access takes exactly 2 cycles, regardless of what else is happening on the fabric or the host system.
Parallel butterfly stages: The 8-level butterfly computation is pipelined across all 8 stages simultaneously. In a software NTT, levels must execute sequentially because level n+1 reads outputs from level n. In the CQ1 fabric, each of the 8 pipeline stages has its own register file and operates on its own subset of the in-flight polynomial. New butterfly input can enter stage 1 while stages 2–8 are processing earlier butterflies, achieving full pipeline occupancy within 8 clock cycles of the first butterfly dispatch. At 200 MHz fabric clock, the 1,024-butterfly NTT computation completes in approximately 512 clock cycles (pipelined) rather than the 1,024 cycles a non-pipelined implementation would require — and critically, there is no stall for twiddle factor lookup because the pipeline pre-fetches twiddle values one stage ahead.
Montgomery modular multiplication in dedicated fabric: The modular multiplication a · b mod q in Z_3329 is implemented as a dedicated multiply-accumulate unit in the FPGA fabric. The Montgomery form reduction for q = 3329 uses a correction factor q' = 3327 (the Montgomery constant for this specific prime). The multiply-reduce takes 3 fabric clock cycles with full throughput at one result per cycle. This is lower latency than a software Montgomery multiplication, which requires 4–6 instructions on x86-64 and is subject to the same cache pressure issues as the rest of the NTT.
Keccak-f[1600] co-processor: The Kyber key generation and encapsulation algorithms make heavy use of SHA-3 (specifically SHAKE-128 and SHAKE-256 for the A matrix expansion and hash operations). CQ1 includes a dedicated Keccak-f[1600] permutation block in hardware, processing the 1,600-bit state in a fixed 24-round pipeline. This eliminates the CPU cost of SHA-3 XOF operations that appear in the Kyber key generation path — in software, SHA-3/SHAKE is notoriously slow on hardware without AES-NI equivalent acceleration.
Side-channel resistance in hardware NTT
Side-channel resistance is where the architectural difference between hardware and software NTT becomes most significant. A software NTT running on a shared CPU core is, from a side-channel perspective, a nightmare: variable execution time due to cache misses, shared micro-architectural state (branch predictor, prefetch units, TLB) with other threads, and power consumption that varies with the specific polynomial coefficients being processed (due to data-dependent memory access patterns in the butterfly stride).
CQ1's NTT pipeline runs at a fixed clock frequency with a fixed number of fabric cells active per butterfly operation. Power consumption is dominated by fabric switching activity — and because the pipeline always executes the same 8-stage butterfly sequence regardless of the input polynomial coefficients, the switching activity is nearly constant across all inputs. We measured power variation across a sample of 100,000 random Kyber encapsulation operations at less than 1.2% peak-to-peak. This is not sufficient to qualify as DPA-resistant by itself (full DPA resistance requires active countermeasures including power randomization and temporal masking, which are implemented in CQ1's physical security layer), but it establishes a narrow power side-channel that an attacker must work within.
For Differential Power Analysis specifically: a DPA attack on the NTT requires correlation between the hypothesis about a polynomial coefficient and the measured power trace. The constant-power NTT pipeline reduces the correlation coefficient the attacker can measure, increasing the number of traces required for a successful DPA recovery. Combined with the active power conditioning in CQ1's physical security layer (which adds controlled noise to the power supply at the fabric level), the DPA resistance target for CQ1 is to require more traces than an adversary can practically collect during the module's operational lifetime under normal deployment conditions.
Importantly, software NTT implementations on shared server hardware cannot offer any of this. CPU power management, frequency scaling, and shared workload power variation make DPA correlation straightforward on any server-resident software implementation. For payment infrastructure where HSM-resident key material is a regulatory requirement, this is not a theoretical concern — it is the reason PCI HSM v3 mandates hardware key protection rather than software key protection.
Integration: what the PKCS#11 interface looks like
From the application's perspective, the NTT pipeline is entirely hidden. The PKCS#11 interface exposes standard Cryptoki functions extended with Kyber-specific mechanisms. Encapsulation is invoked via a vendor mechanism CKM_KYBER_1024 using C_EncapsulateKey (a vendor-extended function), passing the recipient's Kyber-1024 public key as a CKA_VALUE attribute of a CKO_PUBLIC_KEY object and receiving back the ciphertext and shared secret. The application does not interact with the NTT, the twiddle factor table, the Montgomery multiplier, or the Keccak co-processor — it makes a PKCS#11 call and receives a 32-byte shared secret within the CQ1's latency budget.
The JCA/JCE provider for CQ1 (available in the evaluation unit SDK) wraps the PKCS#11 interface in a standard Java KeyAgreement implementation, allowing Java-based payment middleware to use Kyber encapsulation without modifying the application code beyond the algorithm name change from ECDH to KYBER_1024.
The hardware complexity — the on-fabric memory, the pipelined butterfly stages, the Montgomery multiplier, the Keccak co-processor — exists so that application code doesn't have to think about it. That is, ultimately, the point of a hardware security module: the cryptographic primitive is correct, fast, and physically protected, and the application calls an interface.