Vane

The multimodal-native
data engine for AI workloads

Build SQL, Python UDF, preprocessing, embedding, and model inference pipelines on Ray — with DuckDB-compatible APIs.

Get Started View use cases
$ pip install vane-ai·pre-release·Apache-2.0
multimodal.pyrunning
import vane
from vane.ai import describe, embed

vane.configure(runner="ray")
media = vane.read("media/*")

media = describe(
    media,
    columns=["video", "audio", "text"],
    output="understanding",
    schema=["summary", "objects",
            "topics", "actions"],
)

media = embed(media, "understanding.summary")
media.write("ai_ready_media")
Platform

Data, agents, and RL on one always-on core.

Vane unifies multimodal data processing, long-running agents, and reinforcement learning on a single execution core that runs on a laptop or a Ray cluster.

Multimodal Inputs
Sensors
Tables
Documents
Images
Video
Audio
Events
Embeddings

VANE

Vane Data

Available nowMultimodal data processing
IngestParseTransformInferEnrichPackage

Vane Agent

Coming soonAlways-on agent framework
ObserveReasonActMemoryLong-running Tasks

Vane RL

Coming soonRL for embodied AI
RolloutTrajectoryRewardTrainingEvaluation
Outputs / Outcomes
Model-ready Multimodal Assets
Grounded Context Packages
Agent Actions & Recommendations
Trajectory & Learning Updates

Vane Core

Local Runtime+Ray Runtime

Unified Multimodal Data Type

Sensors, metadata, lineage, and model artifacts under one execution semantics.

Streaming + Backpressure + Dynamic Batching

Continuous flow for large objects with adaptive batching and pressure control.

Overlapped Heterogeneous Execution

CPU, GPU, IO, and model inference overlap through asynchronous scheduling.

Edge-Cloud Coordination

The same pipeline runs across local devices and Ray clusters.

Why Vane Data

Why multimodal AI workloads need a data engine.

Text, images, audio and video pipelines usually scatter SQL, preprocessing, inference and output across separate systems. Vane unifies them on Ray behind DuckDB-compatible APIs.

The old way

  • SQL in one system
  • Preprocessing in Python scripts
  • Inference in separate Ray jobs
  • Embeddings written through glue code
  • Images, audio & video handled separately

With Vane Data

  • DuckDB-compatible SQL
  • Python map_batches UDFs
  • Ray task & actor execution
  • AI functions for embedding & prompting
  • Parquet / S3 output in one pipeline
How it works

From data lake to results, in one graph.

Sources
Parquet
S3 / object store
Data lake
vane-dataon Ray
query DuckDB-compatible SQL / Relation
transform map_batches · flat_map · UDFs
execute Ray tasks · actors
Results
Arrow
Parquet
S3
01DuckDB-compatible API
02Ray distributed execution
03Python-native AI UDFs
04Multimodal preprocessing
Benchmarks

Built for real batch inference workloads.

One credible number, fully reproducible — vLLM batch inference over 66K rows on 2 GPUs, measured against Ray Data and Daft.

vLLM batch inference · 66K rows · 2× A100
3.1×
throughput vs Ray Data, with prefix bucketing on identical hardware.
Full benchmarks
Throughput — vLLM batch inference (higher is better)
Vane
3.1×
Daft
1.6×
Ray Data
1.0×
Vanebaseline engines
66K rows · 2× A100 · prefix bucketing
⌥ AstroVela/vane

Build your first multimodal AI pipeline with Vane.