Benchmarks

Benchmarks, with receipts

Credible technical evidence, not marketing numbers. Throughput is relative to Ray Data on identical hardware. Every row lists its dataset, hardware, command and environment — and links to a reproducible script.

Summary

Benchmark summary

WorkloadDatasetHardwareVaneRay DataDaftNotes
vLLM batch inference66K rows2× A100 80GB3.1×1.0×1.6×stable
Text embedding480M chunks8× A10G2.4×1.0×1.3×stable
Image decode + CLIP12M images4× A10G1.9×1.0×1.7×experimental
Audio transcription120K clips4× A1002.2×1.0×experimental

Higher is better. Vane / Ray Data / Daft columns are relative throughput; baseline = Ray Data 1.0×.

Detail

vLLM batch inference

Prefix bucketing groups similar-length prompts to cut padding waste, raising effective batch utilization on the same GPUs.

3.1×
throughput vs Ray Data
41 min
wall-clock (was 127 min)
92%
mean GPU utilization
Throughput (higher is better)
Vane
3.1×
Daft
1.6×
Ray Data
1.0×
Vanebaseline engines
Dataset66K prompt rows · s3://bench/prompts-66k.parquet
HardwareAWS p4d · 2× A100 80GB · NVLink
Commandpython bench_vllm.py --gpus 2 --bucketing prefix
Envray==2.40 · vllm==0.6.3 · CUDA 12.4
Runtime41 min (median of 3, warm cache excluded)
Throughput3.1× vs Ray Data baseline
NotesGains shrink to ~2.4× without prefix bucketing.
Multimodal

Multimodal pipeline benchmarks

Image, audio, document and video workloads. Image-decode + CLIP and audio transcription are measured below; document and video are in progress.

Image · CLIP — stableAudio · transcription — stableDocument · extraction — experimentalVideo · frames — in progress
WorkloadImage decode + CLIP features
Dataset12M images · LAION subset
Hardware4× A10G
Commandpython bench_image.py --gpus 4
Throughput1.9× vs Ray Data
NotesExperimental; decode path still CPU-bound.
WorkloadAudio transcription
Dataset120K call recordings
Hardware4× A100
Commandpython bench_audio.py --gpus 4
Throughput2.2× vs Ray Data
NotesWhisper-large-v3; batch tuned to 8.
Methodology

Methodology

A benchmark you can't reproduce is a marketing number. Everything here is pinned and scripted.

DatasetsCommon Crawl segments, RedPajama, LAION subset, internal call set. Manifests pinned by SHA.
HardwareAWS p4d / g5 instances. CUDA 12.4, driver 550, NVLink where noted.
Environmentray==2.40, vllm==0.6.3, pyarrow==14 — pinned in benchmarks/requirements.lock.
MeasurementMedian of 3 runs, warm cache excluded. Wall-clock from first read to last write.
BaselinesRay Data and Daft on identical hardware, same dataset, same output target.
Reproduce

Reproduce

reproduce.sh
# clone, pin the environment, run
git clone https://github.com/AstroVela/vane
cd vane/benchmarks
pip install -r requirements.lock

# vLLM batch inference benchmark
python bench_vllm.py \
    --dataset s3://bench/prompts-66k.parquet \
    --gpus 2 --bucketing prefix --runs 3

Build your first multimodal AI pipeline with Vane.