Community QubGPU QubGPU Benchmarks — Share Your Performance Results
A
Admin
Admin Explorer Investor
20 Mar 2026 11:26 ·401 views 📌 Pinned

QubGPU Benchmarks — Share Your Performance Results

QubGPU — Neural Datagram Protocol is designed for extreme throughput on modern GPU clusters. This thread is the community benchmark board.

When sharing results, please include:

  • GPU model and VRAM
  • Driver version and CUDA/ROCm version
  • Batch size and precision (FP32 / BF16 / INT8)
  • Throughput (tokens/sec or TFLOPS)
  • Latency P50 / P95 / P99
  • Any custom kernel patches applied

Comparing implementations? Use the standardised qubgpu-bench CLI tool included in the SDK — it ensures reproducible results across environments.

1 reply
A
Admin
Admin Explorer Investor
20 Mar 2026 11:26

Reference baseline from our lab: RTX 4090 24GB, CUDA 12.3, BF16, batch size 512 → 847 TFLOPS effective throughput with the default NDP kernel. Post your results below and let's push the limit! 🔥

You must be logged in to reply.

Login to Reply
Category
QubGPU
Rules
  1. 1. This category is for QubGPU / Neural Datagram Protocol (NDP) discussions.
  2. 2. Benchmark posts must include hardware specs, driver version, and methodology.
  3. 3. Do not publish GPL/proprietary code without proper licence attribution.
  4. 4. Discussions of GPU overclocking or unofficial kernel patches are at your own risk.
  5. 5. QubitPage does not endorse unofficial modifications discussed in this forum.