Vane Data / Concepts

Architecture

Vane Data is a layered data engine. It exposes a DuckDB-compatible SQL and Python surface, adds relation-level Python and AI operations, and can execute selected work locally or through Ray.

The design goal is simple: keep pipelines table-shaped while allowing SQL, Python libraries, model inference, and distributed execution to live in one workflow.

System view

A Vane Data pipeline has four user-visible layers:

  1. API surface: import vane, vane.connect(), DuckDB-compatible SQL, relation methods, UDFs, and AI helpers.
  2. Relation pipeline: SQL operations, projections, filters, joins, UDF stages, and writes are composed as relation steps.
  3. Execution runner: local execution handles development and single-node jobs; the Ray runner distributes suitable scans, writes, and UDF work.
  4. UDF runtime: Python functions and callable classes run through task or actor backends.

This separation lets a pipeline start locally, then move selected work to Ray when data volume, model throughput, or resource placement requires it.

Public user surface

Use these as the primary public entry points:

  • import vane
  • vane.connect()
  • vane.configure(...)
  • vane.current_config()
  • vane.env
  • DuckDB-compatible SQL through con.sql(...) and con.execute(...)
  • DuckDB-compatible relation APIs
  • rel.map_batches(...)
  • rel.flat_map(...)
  • rel.map(...)
  • rel.embed_text(...)
  • rel.classify_text(...)
  • rel.prompt(...)

Existing DuckDB-style imports remain available for compatible code, but new Vane Data projects should prefer import vane.

Runner modes

Vane Data has two main execution modes.

Local mode:

  • Runs on the local process and local machine.
  • Is the right starting point for development, testing, lightweight SQL, and single-node UDF jobs.
  • Supports SQL, relation operations, and local UDF backends.

Ray mode:

  • Selected with vane.configure(runner="ray") or VANE_RUNNER=ray.
  • Uses Ray for distributed scans, distributed writes, distributed UDFs, and actor-based model workers.
  • Requires every worker to have access to the same Python dependencies, storage systems, model files, and credentials needed by the pipeline.

If the process already has Ray initialized, runner selection can interact with that environment. For predictable behavior in production scripts, configure the intended runner explicitly before creating connections and running pipelines.

UDF execution layer

Python UDFs are relation operations. They receive table-shaped input, run user code, and return declared output columns.

Choose the backend based on callable shape and resource needs:

BackendCallable shapeUse case
subprocess_taskPython functionLocal stateless batch transforms
subprocess_actorCallable classLocal model or client reuse
ray_taskPython functionDistributed stateless transforms
ray_actorCallable classDistributed model reuse or GPU-backed inference

Task backends are for functions without long-lived state. Actor backends are for classes that should be constructed once and reused across batches.

AI helper layer

AI helpers are relation methods built on the same UDF execution model:

  • embed_text produces embedding output columns.
  • classify_text produces label output columns.
  • prompt produces response output columns from text or multimodal prompts.

Use AI helpers for common provider-backed operations. Use custom UDFs when you need full control over model loading, batching, output schema, or provider behavior.

Configuration model

Programmatic configuration:

example.py
import vane


vane.configure(runner="ray")

Environment configuration:

shell
export VANE_RUNNER=ray

vane.configure(...) applies registered VANE_* settings through the process environment. Use vane.current_config() or vane.env.as_dict() to inspect the public configuration snapshot.

Worker environment

Ray workers need the same runtime context as the driver for any work they execute. In practice, confirm that workers can access:

  • Python modules used by UDFs and AI providers.
  • Object storage credentials and endpoint settings.
  • Model files or model download locations.
  • Relevant VANE_*, AWS_*, S3FS_*, DUCKDB_*, PYTHONPATH, and RAY_ADDRESS environment values.

Set the required environment before creating connections or running pipelines.

Data movement

Common data paths:

  • SQL reads local or remote data into DuckDB-compatible relations.
  • Relations stream Arrow batches into Python UDFs.
  • map_batches processes Arrow tables and returns the declared output columns. Return input columns from the UDF as well when downstream stages still need them.
  • Ray mode can materialize partitions through Ray workers.
  • Writes use DuckDB-compatible and Vane Data relation write paths, depending on runner and target.

Place filters and column projections before expensive Python or model stages. This reduces the amount of data crossing into Python and, in Ray mode, reduces cluster data movement.

Extension boundary

The configured source build includes core_functions, json, parquet, icu, jemalloc, and httpfs.

Other DuckDB extensions depend on the build and runtime environment. Treat extension availability as a deployment property: verify that the extension is bundled, installable, or loadable where the job runs.