What Is Vane?
Vane Data is a Python-first data processing engine for building AI-ready datasets. It combines DuckDB-compatible SQL, Python/Arrow UDFs, Ray-backed execution, and AI helper functions in one relation workflow.
Use Vane Data when your data is already structured or can be represented as rows, but the pipeline still needs Python libraries, model inference, embeddings, prompting, media processing, or distributed execution before the data is ready for training, search, analytics, or application systems.
The distribution package is vane-ai; the Python import package is vane.
The role Vane Data plays
Vane Data sits between storage and downstream systems.
It is designed to prepare datasets before they move into:
- Model training or fine-tuning jobs.
- Search and retrieval systems.
- Vector databases.
- Analytical warehouses or lakehouse tables.
- Model serving and application storage.
It is not an online model server, vector index, workflow scheduler, or transactional database. It focuses on the batch and analytical data processing step where SQL, Python, and AI model work need to stay in one pipeline.
The core idea
Vane Data keeps work table-oriented.
A pipeline starts with SQL, files, Arrow data, or another DuckDB-compatible input. Each stage returns a relation. You can keep adding relational operations, Python UDFs, or AI helpers, then materialize the result by showing, fetching, converting, or writing it.
That gives the pipeline one consistent shape:
- Use SQL to scan, filter, project, join, aggregate, and write tabular data.
- Use Python UDFs for logic that should not be forced into SQL.
- Use AI helpers for common embedding, classification, and prompting tasks.
- Use Ray-backed execution when selected stages need distributed workers or GPU placement.
A small example
The example below starts with SQL, runs a Python Arrow batch UDF, and returns another relation.
import pyarrow as pa import vane con = vane.connect() docs = con.sql(""" select * from (values (1, 'Quarterly revenue report'), (2, 'Customer support policy'), (3, 'Model evaluation notes') ) as t(id, text) where text is not null """) def add_text_features(batch: pa.Table) -> pa.Table: ids = batch.column("id").to_pylist() texts = batch.column("text").to_pylist() return pa.table({ "id": ids, "text": texts, "text_length": [len(value) for value in texts], "mentions_policy": ["policy" in value.lower() for value in texts], }) features = docs.map_batches( add_text_features, schema={ "id": "INTEGER", "text": "VARCHAR", "text_length": "BIGINT", "mentions_policy": "BOOLEAN", }, batch_size=1024, execution_backend="subprocess_task", ) features.show()
map_batches returns the columns produced by the UDF. If the next stage needs input columns such as id or text, return them from the UDF and include them in schema, as shown above.
Core capabilities
DuckDB-compatible SQL
SQL is a first-class interface through con.sql(...) and con.execute(...).
Use SQL for relational work:
- Read files and tables supported by DuckDB-compatible APIs and configured extensions.
- Filter and project rows before expensive Python or model stages.
- Join metadata, labels, and feature tables.
- Aggregate quality metrics and dataset summaries.
- Write curated tabular outputs.
Vane Data does not define a separate SQL dialect. Treat the SQL surface as DuckDB-compatible SQL with Vane Data execution features around it.
Python and Arrow UDFs
Use UDFs when a step needs Python libraries, custom code, model state, or row expansion.
| API | Shape | Typical use |
|---|---|---|
| map_batches(...) | Arrow batch to Arrow batch | Batch transforms, model inference, decoding, enrichment |
| flat_map(...) | Row to zero, one, or many rows | Chunking, normalization, expansion |
| map(...) | Input values to one scalar value | Small row-wise enrichments |
UDFs declare output schemas at the relation boundary. This keeps columns explicit even when Python code creates the data.
Use task-style backends for stateless functions. Use actor-style backends when a model, tokenizer, client, or other expensive object should be initialized once and reused across batches.
AI helper functions
Vane Data exposes relation methods for common AI data enrichment tasks:
- rel.embed_text(...) for text embeddings.
- rel.classify_text(...) for text classification.
- rel.prompt(...) for text and multimodal prompting.
These helpers use the same relation and UDF execution model as custom Python code. Use them when the built-in provider pattern fits the task. Use custom UDFs when you need full control over model loading, batching, retry behavior, output shape, or provider-specific logic.
Local and Ray-backed execution
The native runner is the local execution path. It is the right starting point for development, testing, lightweight SQL, and single-node UDF jobs.
Configure Ray when the workload needs distributed scans, distributed writes, multiple Python workers, or GPU-aware model execution:
import vane vane.configure(runner="ray")
The pipeline shape stays the same. The execution mode changes how selected work is scheduled and where Python UDFs can run.
Common workflow patterns
AI-ready dataset preparation
Use SQL to select source rows and normalize metadata, then add embeddings, labels, prompts, scores, or model-derived features before writing a curated dataset.
This is the typical pattern for training data, retrieval corpora, offline evaluation sets, and analytics-ready enrichment tables.
Multimodal data processing
Represent documents, images, audio, video, or other media as table rows with paths, bytes, metadata, and derived columns. SQL keeps metadata work inspectable, while Python UDFs handle decoding, parsing, OCR, transcription, image processing, or model inference.
Vane Data does not require a universal multimodal file format. The table schema is the contract between stages.
Local-to-distributed scaling
Start locally with the native runner and subprocess UDF backends. After the data contract and UDF behavior are correct, move expensive stages to Ray-backed execution.
This keeps early development simple while leaving a path to larger scans, distributed writes, and model workers across a Ray cluster.
When Vane Data is a good fit
Choose Vane Data when:
- SQL is useful for the relational parts of the pipeline.
- Python libraries or custom code are required for the expensive parts.
- The output should remain tabular and inspectable.
- Model calls, embeddings, or prompting are part of the data preparation path.
- A local workflow may later need distributed execution.
- You want one relation pipeline instead of separate SQL jobs, Python scripts, and ad hoc model enrichment steps.
Use another primary system when:
- The main requirement is online, low-latency inference.
- You need a persistent transactional database.
- You need a vector index or retrieval service as the serving layer.
- You need a workflow scheduler, lineage system, or cross-job orchestration platform.
- The required connector or storage feature is not available through DuckDB-compatible APIs, configured extensions, or your own UDFs.
Import and compatibility model
New Vane Data projects should use:
import vane con = vane.connect()
Vane Data exposes DuckDB-compatible Python APIs and relation behavior where compatibility applies. Existing DuckDB-style code can often continue to use familiar connection, SQL, relation, and type APIs, while Vane-specific configuration and AI helpers are available through vane.