Vane Data / Guides

Multimodal Ingest

Vane Data treats multimodal records as table rows. Paths, bytes, metadata, decoded features, labels, embeddings, and model outputs can move through one relation as explicit columns.

Use this pattern when you need SQL filtering and table outputs, but part of the pipeline must call Python libraries for decoding, feature extraction, or model inference.

Ingest pattern

Build a relation with stable IDs, source locations, metadata, and optional bytes.
Filter and project with SQL before expensive Python work.
Use map_batches for decoding, feature extraction, or model inference.
Keep UDF output schemas explicit.
Write curated outputs to Parquet or another DuckDB-supported target.

Vane Data does not define a universal multimodal file format. Choose column names and output formats based on the downstream system.

Images

For local image files, start with metadata and bytes:

example.py

from pathlib import Path


import pyarrow as pa
import vane


rows = []
for idx, path in enumerate(Path("images").glob("*.png")):
    rows.append({
        "id": idx,
        "source_uri": str(path),
        "image_bytes": path.read_bytes(),
    })


con = vane.connect()
rel = con.from_arrow(pa.table({
    "id": [row["id"] for row in rows],
    "source_uri": [row["source_uri"] for row in rows],
    "image_bytes": [row["image_bytes"] for row in rows],
}))

Decode or inspect the images with a batch UDF:

example.py

import io


import pyarrow as pa
from PIL import Image


def inspect_images(batch: pa.Table) -> pa.Table:
    ids = batch.column("id").to_pylist()
    source_uris = batch.column("source_uri").to_pylist()
    sizes = []
    modes = []
    for data in batch.column("image_bytes").to_pylist():
        image = Image.open(io.BytesIO(data))
        sizes.append(f"{image.width}x{image.height}")
        modes.append(image.mode)
    return pa.table({
        "id": ids,
        "source_uri": source_uris,
        "image_size": sizes,
        "image_mode": modes,
    })


out = rel.map_batches(
    inspect_images,
    schema={
        "id": "BIGINT",
        "source_uri": "VARCHAR",
        "image_size": "VARCHAR",
        "image_mode": "VARCHAR",
    },
    batch_size=32,
    execution_backend="subprocess_task",
)


out.show()

For model inference, keep decoding and inference in separate UDF stages when possible. That makes failures easier to isolate and lets you tune batch sizes independently.

Audio

Represent audio as a source URI plus either bytes or a resolved local path. Use a UDF for transcription or acoustic feature extraction.

Recommended output columns:

audio_id
source_uri
transcript
language
segment_json or one row per segment
start_time and end_time when segment rows are used

Keep provider-specific transcription code inside the UDF. Vane Data manages the relation, batching, schema, and execution backend; it does not provide a built-in transcription engine.

Documents

For PDFs, office documents, HTML, or long text files, keep extraction, chunking, and embedding as separate stages:

Extract text and document metadata.
Split text into chunks with stable IDs.
Embed or classify chunks.
Write document-level and chunk-level outputs.

Recommended columns:

document_id
source_uri
page_number
chunk_id
text
embedding
model/provider metadata when model outputs are stored

Video

Treat video ingest as a heavier pipeline:

Read metadata and file paths.
Decode frames or clips in controlled batches.
Run model UDFs over frames or clips.
Store frame-level or clip-level outputs with timestamps.

Avoid returning large Python objects from UDFs. Convert outputs to Arrow-compatible values such as strings, numbers, lists, bytes, or structured JSON strings.

Local vs Ray

Start locally:

example.py

out = rel.map_batches(
    decode_batch,
    schema={"id": "BIGINT", "feature": "VARCHAR"},
    batch_size=32,
    execution_backend="subprocess_task",
)

Move to Ray when input size, decoding cost, model throughput, or GPU placement requires it:

example.py

import vane


vane.configure(runner="ray")


out = rel.map_batches(
    ModelBatch,
    schema={"id": "BIGINT", "label": "VARCHAR"},
    batch_size=64,
    execution_backend="ray_actor",
    gpus=1,
    concurrency=4,
)

Replace decode_batch and ModelBatch with your own function or callable class. Return stable IDs from each UDF when later stages need to join model outputs back to source records. Every Ray worker must be able to import the callable and access the same storage, model files, and credentials.

Practical checks

Keep stable IDs before decoding or model inference.
Store source URI and enough metadata to reproduce a row.
Filter and project before loading bytes when possible.
Keep UDF outputs small and typed.
Validate a small sample before running a large media job.
Write outputs to a filesystem that all workers can access when using Ray.