QubGPU — Neural Datagram
Protocol
Route AI training data like network packets. Zero-copy. GPU-direct. 5.55× faster training throughput — tested, measured, proven.
1 The Problem
Your data passes through 6 software layers before reaching the GPU: JSON parsing → tokenization → padding → tensor creation → device transfer → batching. Physical network routers move packets at line speed using binary headers and CRC checksums. What if we applied the same router protocol design to AI data pipelines?
We built NDP and tested it. It works.
2 Real Benchmark Results
RTX 3090 · 50K Examples · 200 Steps · LoRA Fine-Tune · Batch 2×8 Grad Accumulation
Data Loading · 400 Samples Measured
| Metric | Traditional JSONL | NDP Zero-Copy | Speedup |
|---|---|---|---|
| Samples/sec | 799 | 2,390 | 3.0× |
| Tokens/sec | 90,591 | 1,221,702 | 13.5× |
| Load Time | 0.50s | 0.17s | 2.9× |
Training · 200 Steps
| Metric | JSONL (tokenize on-the-fly) | NDP (pre-tokenized mmap) | NDP + CRC-Drop 5% |
|---|---|---|---|
| Throughput | 454 tok/s | 2,518 tok/s | 2,698 tok/s |
| Final Loss | 1.2439 | 0.6627 | 0.6123 |
| Training Time | 737.1s | 650.8s | 605.5s |
| Data I/O % | 0.6% | 0.3% | 0.4% |
| Model Quality | 0.967 | 1.000 | 1.000 |
Chat Inference · 3 Coding Prompts
| Model | Quality Score | Avg Response Time | Speed |
|---|---|---|---|
| Base (no adapter) | 0.933 | 17.3s | 6.5 tok/s |
| JSONL-trained (200 steps) | 1.000 | 30.7s | 2.7 tok/s |
| NDP-trained (200 steps) | 1.000 | 17.7s | 3.0 tok/s |
| NDP + CRC-Drop (200 steps) | 0.950 | 16.0s | 3.2 tok/s |
3 Quick Start
1. Clone & Install
git clone https://github.com/qubitpage/qubgpu.git cd qubgpu pip install -r requirements.txt
2. Convert Your Data
python training/convert.py \ data/my_dataset.jsonl \ data/my_dataset.qubgpu \ --tokenizer Qwen/Qwen2.5-Coder-1.5B-Instruct \ --max-tokens 512
3. Train with NDP
from training.dataloader import QubGPUDataset from torch.utils.data import DataLoader dataset = QubGPUDataset( "data/my_dataset.qubgpu", max_seq_length=512, pad_token_id=0, crc_drop_rate=0.05, # Enable CRC-Drop augmentation ) loader = DataLoader(dataset, batch_size=2, pin_memory=True) for batch in loader: input_ids = batch["input_ids"].cuda() labels = batch["labels"].cuda() # ... your training loop
4. Run the Platform
python -m uvicorn api.server:app --host 0.0.0.0 --port 8080
# Open http://localhost:8080
5. Run the Real A/B Benchmark
python benchmarks/real_ab_training.py
# Trains 200 steps × 3 pipelines, measures everything, saves results
4 Protocol Format
┌──────┬──────┬──────┬──────┬────────┬──────────────┬────────┐ │ 0xAA │ SRC │ DST │ TYP │ LEN │ PAYLOAD │ CRC-16 │ │ 1B │ 1B │ 1B │ 1B │ 2B BE │ variable │ 2B BE │ └──────┴──────┴──────┴──────┴────────┴──────────────┴────────┘ SRC: Source type (TEXT_CORPUS=0x01, CODE_REPO=0x02, GRADIENT=0x05, ...) DST: Target layer (EMBEDDING=0x01, ATTENTION=0x02, FFN=0x03, LORA=0x05, ...) TYP: Encoding (TOKEN_IDS=0x01, FLOAT16=0x03, BFLOAT16=0x06, ...) CRC: CRC-16/CCITT — corrupted packets silently dropped (like UDP)
Fixed overhead: 8 bytes per packet. Token storage: int32 big-endian (supports 150K+ vocabularies).
5 Architecture
python tests/test_protocol.py to verify all protocol operations. Full A/B benchmark: python benchmarks/real_ab_training.py. Chat quality comparison: python tests/test_chat_models.py6 Deploy on Any NVIDIA GPU Server
# Clone and install git clone https://github.com/qubitpage/qubgpu.git && cd qubgpu pip install -r requirements.txt mkdir -p models datasets logs benchmarks # Download training datasets python training/download_datasets.py # Convert to NDP format python training/convert.py datasets/large_coding_dataset.jsonl \ datasets/large_coding_dataset.qubgpu # Start the platform python -m uvicorn api.server:app --host 0.0.0.0 --port 8080
Free. Open Source. MIT Licensed.
QubGPU is a gift to the AI research community from QubitPage. Download, use, modify, and distribute — no fees, no restrictions, no vendor lock-in.
By QubitPage Research · qubitpage.com · MIT License