✓ MIT License — Free Forever Python 3.10+ NVIDIA GPU v2.1

QubGPU — Neural Datagram
Protocol

Route AI training data like network packets. Zero-copy. GPU-direct. 5.55× faster training throughput — tested, measured, proven.

🎁 Free & Open Source  — MIT licensed, no cost, no strings attached
5.55×
Training Throughput
13.5×
Faster Token Load
50%
Lower Final Loss
8B
8-Byte Packet Overhead
MIT
License — Free

1 The Problem

Training AI models wastes GPU cycles on data plumbing before a single gradient is computed

Your data passes through 6 software layers before reaching the GPU: JSON parsing → tokenization → padding → tensor creation → device transfer → batching. Physical network routers move packets at line speed using binary headers and CRC checksums. What if we applied the same router protocol design to AI data pipelines?

We built NDP and tested it. It works.

✗ Traditional Pipeline — 6 Layers
📄 Disk (JSON)
→ json.loads
→ tokenizer.encode
→ pad / truncate
→ torch.tensor
→ .cuda()
🎯 GPU
799 samples/sec · 90K tokens/sec
✓ NDP Zero-Copy — 3 Layers
💾 Disk (.qubgpu, mmap)
→ struct.unpack header
🎯 GPU tensor
2,390 samples/sec · 1.22M tokens/sec · 3.0× faster

2 Real Benchmark Results

Identical conditions: same model (Qwen2.5-Coder-1.5B), same data, same hyperparameters, same hardware. Only the data pipeline changes.

RTX 3090 · 50K Examples · 200 Steps · LoRA Fine-Tune · Batch 2×8 Grad Accumulation

Data Loading · 400 Samples Measured

MetricTraditional JSONLNDP Zero-CopySpeedup
Samples/sec7992,3903.0×
Tokens/sec90,5911,221,70213.5×
Load Time0.50s0.17s2.9×

Training · 200 Steps

MetricJSONL (tokenize on-the-fly)NDP (pre-tokenized mmap)NDP + CRC-Drop 5%
Throughput454 tok/s2,518 tok/s2,698 tok/s
Final Loss1.24390.66270.6123
Training Time737.1s650.8s605.5s
Data I/O %0.6%0.3%0.4%
Model Quality0.9671.0001.000

Chat Inference · 3 Coding Prompts

ModelQuality ScoreAvg Response TimeSpeed
Base (no adapter)0.93317.3s6.5 tok/s
JSONL-trained (200 steps)1.00030.7s2.7 tok/s
NDP-trained (200 steps)1.00017.7s3.0 tok/s
NDP + CRC-Drop (200 steps)0.95016.0s3.2 tok/s
5.55×
Training throughput increase — NDP: 2,698 tok/s vs 454 tok/s for JSONL
50%
Lower final loss — 0.6123 vs 1.2439 for JSONL in the same number of steps
18%
Faster wall-clock time — NDP finishes 200 steps in 605s vs 737s
1.000
Perfect model quality — NDP-trained model scores 1.000 on all 6 coding benchmarks
🔬
The CRC-Drop Discovery: Dropping 5% of packets with corrupted CRC checksums during training acts as natural data augmentation, similar to dropout regularisation — but operating on the input data stream rather than neural activations. This achieves the lowest loss (0.6123) and fastest throughput (2,698 tok/s) simultaneously.

3 Quick Start

From zero to 5× faster training in 5 steps

1. Clone & Install

bash
git clone https://github.com/qubitpage/qubgpu.git
cd qubgpu
pip install -r requirements.txt

2. Convert Your Data

bash
python training/convert.py \
  data/my_dataset.jsonl \
  data/my_dataset.qubgpu \
  --tokenizer Qwen/Qwen2.5-Coder-1.5B-Instruct \
  --max-tokens 512

3. Train with NDP

python
from training.dataloader import QubGPUDataset
from torch.utils.data import DataLoader

dataset = QubGPUDataset(
    "data/my_dataset.qubgpu",
    max_seq_length=512,
    pad_token_id=0,
    crc_drop_rate=0.05,  # Enable CRC-Drop augmentation
)
loader = DataLoader(dataset, batch_size=2, pin_memory=True)

for batch in loader:
    input_ids = batch["input_ids"].cuda()
    labels = batch["labels"].cuda()
    # ... your training loop

4. Run the Platform

bash
python -m uvicorn api.server:app --host 0.0.0.0 --port 8080
# Open http://localhost:8080

5. Run the Real A/B Benchmark

bash
python benchmarks/real_ab_training.py
# Trains 200 steps × 3 pipelines, measures everything, saves results

4 Protocol Format

Like Ethernet frames, but for neural data — fixed 8-byte overhead, CRC-16 integrity
┌──────┬──────┬──────┬──────┬────────┬──────────────┬────────┐
│ 0xAASRCDSTTYPLEN    │  PAYLOAD     │ CRC-16 │
│ 1B   │ 1B   │ 1B   │ 1B   │ 2B BE  │  variable    │ 2B BE  │
└──────┴──────┴──────┴──────┴────────┴──────────────┴────────┘

SRC:  Source type   (TEXT_CORPUS=0x01, CODE_REPO=0x02, GRADIENT=0x05, ...)
DST:  Target layer  (EMBEDDING=0x01, ATTENTION=0x02, FFN=0x03, LORA=0x05, ...)
TYP:  Encoding      (TOKEN_IDS=0x01, FLOAT16=0x03, BFLOAT16=0x06, ...)
CRC:  CRC-16/CCITT — corrupted packets silently dropped (like UDP)

Fixed overhead: 8 bytes per packet. Token storage: int32 big-endian (supports 150K+ vocabularies).

5 Architecture

FastAPI backend · PyTorch zero-copy dataloader · 9-page web dashboard · 22+ API endpoints
qubgpu/ ├── protocol/ # Core binary protocol │ ├── qubgpu.py # KnowledgePacket, QubGPUFile, CRC-16 │ └── ndp_v2.py # 5-layer protocol stack (NDP v2) ├── training/ # Training pipeline │ ├── convert.py # JSONL → .qubgpu converter │ ├── dataloader.py # Zero-copy mmap PyTorch Dataset │ └── download_datasets.py # HuggingFace dataset downloader ├── engine/ # Neural Router engine │ └── neural_router.py # Direct weight injection (Hopfield, ROME) ├── api/ # FastAPI backend (22+ endpoints) │ └── server.py ├── web/ # Web dashboard SPA │ ├── index.html # 9-page dashboard │ └── whitepaper.html ├── benchmarks/ # Real benchmark results │ ├── real_ab_training.py # A/B training comparison │ ├── real_benchmark_results.json # Training benchmark data │ └── chat_model_comparison.json # Chat inference comparison └── tests/ ├── test_protocol.py # 42 protocol tests └── test_chat_models.py # Chat model comparison
🧪
42 protocol tests — run python tests/test_protocol.py to verify all protocol operations. Full A/B benchmark: python benchmarks/real_ab_training.py. Chat quality comparison: python tests/test_chat_models.py

6 Deploy on Any NVIDIA GPU Server

One command setup — works on any machine with an NVIDIA GPU and Python 3.10+
bash
# Clone and install
git clone https://github.com/qubitpage/qubgpu.git && cd qubgpu
pip install -r requirements.txt
mkdir -p models datasets logs benchmarks

# Download training datasets
python training/download_datasets.py

# Convert to NDP format
python training/convert.py datasets/large_coding_dataset.jsonl \
  datasets/large_coding_dataset.qubgpu

# Start the platform
python -m uvicorn api.server:app --host 0.0.0.0 --port 8080

Free. Open Source. MIT Licensed.

QubGPU is a gift to the AI research community from QubitPage. Download, use, modify, and distribute — no fees, no restrictions, no vendor lock-in.

By QubitPage Research · qubitpage.com · MIT License