QubGPU Benchmarks — Share Your Performance Results

⚡ QubGPU — Neural Datagram Protocol is designed for extreme throughput on modern GPU clusters. This thread is the community benchmark board.

When sharing results, please include:

GPU model and VRAM
Driver version and CUDA/ROCm version
Batch size and precision (FP32 / BF16 / INT8)
Throughput (tokens/sec or TFLOPS)
Latency P50 / P95 / P99
Any custom kernel patches applied

Comparing implementations? Use the standardised qubgpu-bench CLI tool included in the SDK — it ensures reproducible results across environments.

1 reply

Admin

Admin Explorer Investor

20 Mar 2026 11:26

Reference baseline from our lab: RTX 4090 24GB, CUDA 12.3, BF16, batch size 512 → 847 TFLOPS effective throughput with the default NDP kernel. Post your results below and let's push the limit! 🔥

You must be logged in to reply.