STWO ML Prover

The STWO ML Prover is a GPU-accelerated proving library that generates zero-knowledge proofs for neural network inference using Circle STARKs over the Mersenne-31 field. It powers both the ZKML verification system and the VM31 privacy protocol.

3.04s/layer

Prove Time

50-112x

FFT Speedup

17 MB

Proof Size

96-bit

Security

Architecture

The prover has two modes of operation:

🧠

ZKML Mode

Prove neural network forward passes. GKR protocol walks the computation graph layer-by-layer, generating sumcheck proofs for each operation.

🔐

VM31 Privacy Mode

Generate STARK proofs for UTXO transactions. Deposit, withdraw, and spend circuits with Poseidon2-M31 constraints.

Proof Pipeline

Model Compilation

Neural network (ONNX, HuggingFace SafeTensors) → ComputationGraph DAG. Each node is a typed operation: MatMul, Activation, LayerNorm, RoPE, Attention, etc.

Forward Pass + Trace

Execute the model on GPU, recording all intermediate values. Dual-track: f32 for output + M31 for proving.

GKR Proving

Walk the computation graph output→input. Generate sumcheck proofs per layer. MatMul gets 42-255x trace reduction via multilinear extension sumcheck.

STARK Aggregation

Aggregate non-matmul components (activations, LayerNorm, RMSNorm) into a single unified STARK proof. Combine with GKR layer proofs.

On-Chain Submission

Serialize proof to Cairo-compatible calldata. Submit to SumcheckVerifier contract. Fiat-Shamir replay verifies the entire computation on-chain.

Supported Layer Types

Layer	Proof Method	Trace Reduction	GPU Accelerated
MatMul	Sumcheck (degree-2)	42-255x	Yes
Activation (ReLU, GELU, Sigmoid)	LogUp lookup table	1x (constant)	Yes
LayerNorm	Eq-sumcheck + LogUp rsqrt	1x	Yes
RMSNorm	Eq-sumcheck + LogUp rsqrt	1x	Yes
Attention (MHA/GQA/MQA)	Composed sub-matmuls	varies	Yes
RoPE	LogUp rotation table	1x	No
Embedding	LogUp sparse lookup	1x	No
Dequantize	LogUp 2D table	1x	No
Add/Mul	Linear split / Eq-sumcheck	0 rounds / log(n)	No

Sumcheck MatMul Innovation

The key insight: matrix multiplication C[i,j] = Σ_k A[i,k] · B[k,j] can be verified via sumcheck over multilinear extensions (MLEs) of A and B. This reduces the STARK trace from O(m×k×n) naive rows to O(m×n + m×k + k×n) — a 42-255x reduction for production-sized matrices.

GPU Backend

The prover uses STWO's GPU backend with CUDA kernels:

Operation	GPU Speedup	Threshold
Circle FFT	50-112x	≥ 1M elements
Sumcheck rounds	10-50x	≥ 16K elements
MLE restriction	5-20x	≥ 16K elements
Merkle hashing (Blake2s)	2-4x	≥ 64K leaves
FRI folding	5-15x	Log_size ≥ 14

Multi-GPU Support

Throughput mode: Independent proofs across GPUs
Distributed mode: Single proof split across GPUs
Device detection: Auto-detect compute capability, SM count, tensor cores
Memory management: 80% safety margin, chunked processing for 2^26+ polynomials

Model Support

🤖

HuggingFace Loader

Direct loading from HuggingFace SafeTensors format. Supports f16, bf16, f32, INT4, INT8 weights.

📦

ONNX Compiler

Import PyTorch/TensorFlow models via tract-onnx. Auto-detect activations and layer types.

💾

Tile Streaming

For models exceeding GPU memory: double-buffered tile pipeline loads weight tiles while proving. Background I/O hidden behind sumcheck computation.

🔄

SIMD Block Batching

Prove N identical transformer blocks in one GKR pass with only log(N) extra sumcheck rounds per layer.

Verified Models

Model	Parameters	Layers	Prove Time	Status
Single MatMul	—	1	Under 100ms	On-chain verified
MLP Network	~100K	3	~500ms	On-chain verified
LayerNorm Chain	~500K	3	~1s	On-chain verified
Residual Network (DAG)	~1M	4	~2s	On-chain verified
Qwen3-14B	14B	40	37.64s	On-chain verified (18-TX streaming)

VM31 Transaction Proving

For the VM31 UTXO privacy pool, the prover generates STARK proofs for three transaction types:

Transaction	Poseidon2 Perms	Trace Width	Description
Deposit	2	~1,372 cols	Public → UTXO commitment
Withdraw	32 (padded)	~20,932 cols	UTXO → public (Merkle + nullifier)
Spend	64 (padded)	~41,864 cols	2-in/2-out private transfer

Batch Proving

The VM31 relayer batches up to 1,000 transactions into a single GKR proof:

Proving time: ~2 seconds on GPU (H100)
Verification: ~300K gas on Starknet
Cost per transaction: ~0.0003 STRK

Security Parameters

Parameter	Value
pow_bits	26
n_queries	70
log_blowup	1
Security level	96-bit
Field	M31 (p = 2³¹ - 1)
Extension field	QM31 (128-bit secure)
Hash	Poseidon2-M31 (Fiat-Shamir)

Confidential Computing

For models requiring data privacy, the prover supports TEE-based confidential computing on NVIDIA GPUs:

H100/H200: Hardware attestation via SPDM protocol
CPU attestation: TDX/SEV-SNP via nvTrust SDK
Memory encryption: AES-GCM-256 matching GPU DMA
Key derivation: HKDF with SHA-256

Next Steps

GKR Protocol — interactive proof protocol details
GPU Acceleration — CUDA kernel specifics
On-Chain Verification — Cairo verifier contract
Transformer Proving — proving Llama/Qwen blocks