Vane Data / Quickstart

What Is Vane?

Vane Data is a Python-first data processing engine for building AI-ready datasets. It combines DuckDB-compatible SQL, Python/Arrow UDFs, Ray-backed execution, and AI helper functions in one relation workflow.

Use Vane Data when your data is already structured or can be represented as rows, but the pipeline still needs Python libraries, model inference, embeddings, prompting, media processing, or distributed execution before the data is ready for training, search, analytics, or application systems.

The distribution package is vane-ai; the Python import package is vane.

The role Vane Data plays

Vane Data sits between storage and downstream systems.

It is designed to prepare datasets before they move into:

  • Model training or fine-tuning jobs.
  • Search and retrieval systems.
  • Vector databases.
  • Analytical warehouses or lakehouse tables.
  • Model serving and application storage.

It is not an online model server, vector index, workflow scheduler, or transactional database. It focuses on the batch and analytical data processing step where SQL, Python, and AI model work need to stay in one pipeline.

The core idea

Vane Data keeps work table-oriented.

A pipeline starts with SQL, files, Arrow data, or another DuckDB-compatible input. Each stage returns a relation. You can keep adding relational operations, Python UDFs, or AI helpers, then materialize the result by showing, fetching, converting, or writing it.

That gives the pipeline one consistent shape:

  1. Use SQL to scan, filter, project, join, aggregate, and write tabular data.
  2. Use Python UDFs for logic that should not be forced into SQL.
  3. Use AI helpers for common embedding, classification, and prompting tasks.
  4. Use Ray-backed execution when selected stages need distributed workers or GPU placement.

A small example

The example below starts with SQL, runs a Python Arrow batch UDF, and returns another relation.

example.py
import pyarrow as pa
import vane


con = vane.connect()


docs = con.sql("""
    select *
    from (values
        (1, 'Quarterly revenue report'),
        (2, 'Customer support policy'),
        (3, 'Model evaluation notes')
    ) as t(id, text)
    where text is not null
""")


def add_text_features(batch: pa.Table) -> pa.Table:
    ids = batch.column("id").to_pylist()
    texts = batch.column("text").to_pylist()


    return pa.table({
        "id": ids,
        "text": texts,
        "text_length": [len(value) for value in texts],
        "mentions_policy": ["policy" in value.lower() for value in texts],
    })


features = docs.map_batches(
    add_text_features,
    schema={
        "id": "INTEGER",
        "text": "VARCHAR",
        "text_length": "BIGINT",
        "mentions_policy": "BOOLEAN",
    },
    batch_size=1024,
    execution_backend="subprocess_task",
)


features.show()

map_batches returns the columns produced by the UDF. If the next stage needs input columns such as id or text, return them from the UDF and include them in schema, as shown above.

Core capabilities

DuckDB-compatible SQL

SQL is a first-class interface through con.sql(...) and con.execute(...).

Use SQL for relational work:

  • Read files and tables supported by DuckDB-compatible APIs and configured extensions.
  • Filter and project rows before expensive Python or model stages.
  • Join metadata, labels, and feature tables.
  • Aggregate quality metrics and dataset summaries.
  • Write curated tabular outputs.

Vane Data does not define a separate SQL dialect. Treat the SQL surface as DuckDB-compatible SQL with Vane Data execution features around it.

Python and Arrow UDFs

Use UDFs when a step needs Python libraries, custom code, model state, or row expansion.

APIShapeTypical use
map_batches(...)Arrow batch to Arrow batchBatch transforms, model inference, decoding, enrichment
flat_map(...)Row to zero, one, or many rowsChunking, normalization, expansion
map(...)Input values to one scalar valueSmall row-wise enrichments

UDFs declare output schemas at the relation boundary. This keeps columns explicit even when Python code creates the data.

Use task-style backends for stateless functions. Use actor-style backends when a model, tokenizer, client, or other expensive object should be initialized once and reused across batches.

AI helper functions

Vane Data exposes relation methods for common AI data enrichment tasks:

  • rel.embed_text(...) for text embeddings.
  • rel.classify_text(...) for text classification.
  • rel.prompt(...) for text and multimodal prompting.

These helpers use the same relation and UDF execution model as custom Python code. Use them when the built-in provider pattern fits the task. Use custom UDFs when you need full control over model loading, batching, retry behavior, output shape, or provider-specific logic.

Local and Ray-backed execution

The native runner is the local execution path. It is the right starting point for development, testing, lightweight SQL, and single-node UDF jobs.

Configure Ray when the workload needs distributed scans, distributed writes, multiple Python workers, or GPU-aware model execution:

example.py
import vane


vane.configure(runner="ray")

The pipeline shape stays the same. The execution mode changes how selected work is scheduled and where Python UDFs can run.

Common workflow patterns

AI-ready dataset preparation

Use SQL to select source rows and normalize metadata, then add embeddings, labels, prompts, scores, or model-derived features before writing a curated dataset.

This is the typical pattern for training data, retrieval corpora, offline evaluation sets, and analytics-ready enrichment tables.

Multimodal data processing

Represent documents, images, audio, video, or other media as table rows with paths, bytes, metadata, and derived columns. SQL keeps metadata work inspectable, while Python UDFs handle decoding, parsing, OCR, transcription, image processing, or model inference.

Vane Data does not require a universal multimodal file format. The table schema is the contract between stages.

Local-to-distributed scaling

Start locally with the native runner and subprocess UDF backends. After the data contract and UDF behavior are correct, move expensive stages to Ray-backed execution.

This keeps early development simple while leaving a path to larger scans, distributed writes, and model workers across a Ray cluster.

When Vane Data is a good fit

Choose Vane Data when:

  • SQL is useful for the relational parts of the pipeline.
  • Python libraries or custom code are required for the expensive parts.
  • The output should remain tabular and inspectable.
  • Model calls, embeddings, or prompting are part of the data preparation path.
  • A local workflow may later need distributed execution.
  • You want one relation pipeline instead of separate SQL jobs, Python scripts, and ad hoc model enrichment steps.

Use another primary system when:

  • The main requirement is online, low-latency inference.
  • You need a persistent transactional database.
  • You need a vector index or retrieval service as the serving layer.
  • You need a workflow scheduler, lineage system, or cross-job orchestration platform.
  • The required connector or storage feature is not available through DuckDB-compatible APIs, configured extensions, or your own UDFs.

Import and compatibility model

New Vane Data projects should use:

example.py
import vane


con = vane.connect()

Vane Data exposes DuckDB-compatible Python APIs and relation behavior where compatibility applies. Existing DuckDB-style code can often continue to use familiar connection, SQL, relation, and type APIs, while Vane-specific configuration and AI helpers are available through vane.

Next