Build SQL, Python UDF, preprocessing, embedding, and model inference pipelines on Ray — with DuckDB-compatible APIs.
import vane from vane.ai import describe, embed vane.configure(runner="ray") media = vane.read("media/*") media = describe( media, columns=["video", "audio", "text"], output="understanding", schema=["summary", "objects", "topics", "actions"], ) media = embed(media, "understanding.summary") media.write("ai_ready_media")
Vane unifies multimodal data processing, long-running agents, and reinforcement learning on a single execution core that runs on a laptop or a Ray cluster.
Sensors, metadata, lineage, and model artifacts under one execution semantics.
Continuous flow for large objects with adaptive batching and pressure control.
CPU, GPU, IO, and model inference overlap through asynchronous scheduling.
The same pipeline runs across local devices and Ray clusters.
Text, images, audio and video pipelines usually scatter SQL, preprocessing, inference and output across separate systems. Vane unifies them on Ray behind DuckDB-compatible APIs.
Clean Common Crawl pages, chunk text, and generate embeddings.
Embed large text datasets and match related records.
Normalize text, compute MinHash signatures, and remove near-duplicates.
Read, decode, analyze, and transform image data with batch UDFs.
Run prompt-to-image generation across batches and GPUs.
Run VLM evaluation with images, JSON responses, and judge passes.
Transcribe, summarize, subtitle, and embed audio segments.
One credible number, fully reproducible — vLLM batch inference over 66K rows on 2 GPUs, measured against Ray Data and Daft.