Vane Data / Quickstart

Quickstart: Python

This quickstart shows the Python relation API, Arrow batch UDFs, and the path from local execution to Ray-backed execution.

The examples use PyArrow. Install vane-ai[all] or install pyarrow directly if your environment does not already include it.

1. Create a relation

example.py
import pyarrow as pa
import vane


con = vane.connect()


table = pa.table({
    "id": [1, 2, 3],
    "text": ["hello Vane Data", "batch UDF", "distributed SQL"],
})


rel = con.from_arrow(table)
rel.show()

A relation represents a table-shaped computation. You can create one from SQL, Arrow data, files, or other DuckDB-compatible inputs.

2. Add a batch UDF

Use map_batches(...) when a Python function should process Arrow batches and return declared output columns:

example.py
def add_features(batch: pa.Table) -> pa.Table:
    ids = batch.column("id").to_pylist()
    text = batch.column("text").to_pylist()
    return pa.table({
        "id": ids,
        "text": text,
        "length": [len(value) for value in text],
        "upper_text": [value.upper() for value in text],
    })


features = rel.map_batches(
    add_features,
    schema={
        "id": "BIGINT",
        "text": "VARCHAR",
        "length": "BIGINT",
        "upper_text": "VARCHAR",
    },
    batch_size=1024,
    execution_backend="subprocess_task",
)


features.show()

schema describes the columns returned by the UDF. map_batches returns the UDF output columns; include input columns such as id or text in the returned table when later stages still need them.

3. Reuse state with an actor backend

Use an actor backend when setup is expensive and should be reused across batches. Pass a callable class to actor backends:

example.py
class Prefixer:
    def __init__(self) -> None:
        self.prefix = "vane:"


    def __call__(self, batch: pa.Table) -> pa.Table:
        ids = batch.column("id").to_pylist()
        values = batch.column("text").to_pylist()
        return pa.table({
            "id": ids,
            "prefixed": [self.prefix + value for value in values],
        })


prefixed = rel.map_batches(
    Prefixer,
    schema={"id": "BIGINT", "prefixed": "VARCHAR"},
    batch_size=1024,
    execution_backend="subprocess_actor",
    concurrency=2,
)


prefixed.show()

Use task backends for plain functions. Use actor backends for callable classes that hold reusable state, such as models or service clients.

4. Optional: run UDFs on Ray

Use Ray after the local UDF path is correct and the workload needs distributed execution:

example.py
import vane


vane.configure(runner="ray")


con = vane.connect()
rel = con.sql("select * from read_parquet('s3://bucket/data/*.parquet')")

For distributed stateless UDF work, use ray_task:

example.py
processed = rel.map_batches(
    add_features,
    schema={
        "id": "BIGINT",
        "text": "VARCHAR",
        "length": "BIGINT",
        "upper_text": "VARCHAR",
    },
    batch_size=1024,
    execution_backend="ray_task",
)

For distributed model reuse or GPU-backed inference, use ray_actor with a callable class:

example.py
embedded = rel.map_batches(
    MyModelBatch,
    schema={"id": "BIGINT", "embedding": "FLOAT[]"},
    batch_size=64,
    execution_backend="ray_actor",
    gpus=1,
    concurrency=4,
)

Replace MyModelBatch with your own callable class. Return identifiers or metadata from the UDF when downstream stages need to join model outputs back to source rows.

Every Ray worker must be able to import the UDF code, access the same storage, and see any model files or credentials required by the callable.

5. Optional: use AI helpers

Vane Data exposes common AI operations as relation methods:

  • rel.embed_text(...)
  • rel.classify_text(...)
  • rel.prompt(...)

Example:

example.py
embedding_only = rel.embed_text(
    "text",
    provider="transformers",
    model="sentence-transformers/all-MiniLM-L6-v2",
    output_column="embedding",
    execution_backend="subprocess_actor",
)

AI helpers return the configured output column. If the final dataset needs source columns as well, use a custom map_batches UDF that returns the complete schema, or explicitly combine helper output with source rows in a validated step.

Install the provider libraries required by the provider and model you choose.

Next steps