Benchmarks

Benchmarks, with receipts

Credible technical evidence, not marketing numbers. Throughput is relative to Ray Data on identical hardware. Every row lists its dataset, hardware, command and environment — and links to a reproducible script.

Summary

Benchmark summary

Workload	Dataset	Hardware	Vane	Ray Data	Daft	Notes
vLLM batch inference	66K rows	2× A100 80GB	3.1×	1.0×	1.6×	stable
Text embedding	480M chunks	8× A10G	2.4×	1.0×	1.3×	stable
Image decode + CLIP	12M images	4× A10G	1.9×	1.0×	1.7×	experimental
Audio transcription	120K clips	4× A100	2.2×	1.0×	—	experimental

Higher is better. Vane / Ray Data / Daft columns are relative throughput; baseline = Ray Data 1.0×.

Detail

vLLM batch inference

Prefix bucketing groups similar-length prompts to cut padding waste, raising effective batch utilization on the same GPUs.

3.1×

throughput vs Ray Data

41 min

wall-clock (was 127 min)

92%

mean GPU utilization

Throughput (higher is better)

Vane

3.1×

Daft

1.6×

Ray Data

1.0×

Vanebaseline engines

Dataset	66K prompt rows · s3://bench/prompts-66k.parquet
Hardware	AWS p4d · 2× A100 80GB · NVLink
Command	python bench_vllm.py --gpus 2 --bucketing prefix
Env	ray==2.40 · vllm==0.6.3 · CUDA 12.4
Runtime	41 min (median of 3, warm cache excluded)
Throughput	3.1× vs Ray Data baseline
Notes	Gains shrink to ~2.4× without prefix bucketing.

Multimodal

Multimodal pipeline benchmarks

Image, audio, document and video workloads. Image-decode + CLIP and audio transcription are measured below; document and video are in progress.

Image · CLIP — stableAudio · transcription — stableDocument · extraction — experimentalVideo · frames — in progress

Workload	Image decode + CLIP features
Dataset	12M images · LAION subset
Hardware	4× A10G
Command	python bench_image.py --gpus 4
Throughput	1.9× vs Ray Data
Notes	Experimental; decode path still CPU-bound.

Workload	Audio transcription
Dataset	120K call recordings
Hardware	4× A100
Command	python bench_audio.py --gpus 4
Throughput	2.2× vs Ray Data
Notes	Whisper-large-v3; batch tuned to 8.

Methodology

A benchmark you can't reproduce is a marketing number. Everything here is pinned and scripted.

Datasets	Common Crawl segments, RedPajama, LAION subset, internal call set. Manifests pinned by SHA.
Hardware	AWS p4d / g5 instances. CUDA 12.4, driver 550, NVLink where noted.
Environment	ray==2.40, vllm==0.6.3, pyarrow==14 — pinned in benchmarks/requirements.lock.
Measurement	Median of 3 runs, warm cache excluded. Wall-clock from first read to last write.
Baselines	Ray Data and Daft on identical hardware, same dataset, same output target.

Reproduce

reproduce.sh

# clone, pin the environment, run
git clone https://github.com/AstroVela/vane
cd vane/benchmarks
pip install -r requirements.lock

# vLLM batch inference benchmark
python bench_vllm.py \
    --dataset s3://bench/prompts-66k.parquet \
    --gpus 2 --bucketing prefix --runs 3

Build your first multimodal AI pipeline with Vane.

Read the Docs →Explore use cases