Multimodal Ingest
Vane Data treats multimodal records as table rows. Paths, bytes, metadata, decoded features, labels, embeddings, and model outputs can move through one relation as explicit columns.
Use this pattern when you need SQL filtering and table outputs, but part of the pipeline must call Python libraries for decoding, feature extraction, or model inference.
Ingest pattern
- Build a relation with stable IDs, source locations, metadata, and optional bytes.
- Filter and project with SQL before expensive Python work.
- Use map_batches for decoding, feature extraction, or model inference.
- Keep UDF output schemas explicit.
- Write curated outputs to Parquet or another DuckDB-supported target.
Vane Data does not define a universal multimodal file format. Choose column names and output formats based on the downstream system.
Images
For local image files, start with metadata and bytes:
from pathlib import Path import pyarrow as pa import vane rows = [] for idx, path in enumerate(Path("images").glob("*.png")): rows.append({ "id": idx, "source_uri": str(path), "image_bytes": path.read_bytes(), }) con = vane.connect() rel = con.from_arrow(pa.table({ "id": [row["id"] for row in rows], "source_uri": [row["source_uri"] for row in rows], "image_bytes": [row["image_bytes"] for row in rows], }))
Decode or inspect the images with a batch UDF:
import io import pyarrow as pa from PIL import Image def inspect_images(batch: pa.Table) -> pa.Table: ids = batch.column("id").to_pylist() source_uris = batch.column("source_uri").to_pylist() sizes = [] modes = [] for data in batch.column("image_bytes").to_pylist(): image = Image.open(io.BytesIO(data)) sizes.append(f"{image.width}x{image.height}") modes.append(image.mode) return pa.table({ "id": ids, "source_uri": source_uris, "image_size": sizes, "image_mode": modes, }) out = rel.map_batches( inspect_images, schema={ "id": "BIGINT", "source_uri": "VARCHAR", "image_size": "VARCHAR", "image_mode": "VARCHAR", }, batch_size=32, execution_backend="subprocess_task", ) out.show()
For model inference, keep decoding and inference in separate UDF stages when possible. That makes failures easier to isolate and lets you tune batch sizes independently.
Audio
Represent audio as a source URI plus either bytes or a resolved local path. Use a UDF for transcription or acoustic feature extraction.
Recommended output columns:
- audio_id
- source_uri
- transcript
- language
- segment_json or one row per segment
- start_time and end_time when segment rows are used
Keep provider-specific transcription code inside the UDF. Vane Data manages the relation, batching, schema, and execution backend; it does not provide a built-in transcription engine.
Documents
For PDFs, office documents, HTML, or long text files, keep extraction, chunking, and embedding as separate stages:
- Extract text and document metadata.
- Split text into chunks with stable IDs.
- Embed or classify chunks.
- Write document-level and chunk-level outputs.
Recommended columns:
- document_id
- source_uri
- page_number
- chunk_id
- text
- embedding
- model/provider metadata when model outputs are stored
Video
Treat video ingest as a heavier pipeline:
- Read metadata and file paths.
- Decode frames or clips in controlled batches.
- Run model UDFs over frames or clips.
- Store frame-level or clip-level outputs with timestamps.
Avoid returning large Python objects from UDFs. Convert outputs to Arrow-compatible values such as strings, numbers, lists, bytes, or structured JSON strings.
Local vs Ray
Start locally:
out = rel.map_batches( decode_batch, schema={"id": "BIGINT", "feature": "VARCHAR"}, batch_size=32, execution_backend="subprocess_task", )
Move to Ray when input size, decoding cost, model throughput, or GPU placement requires it:
import vane vane.configure(runner="ray") out = rel.map_batches( ModelBatch, schema={"id": "BIGINT", "label": "VARCHAR"}, batch_size=64, execution_backend="ray_actor", gpus=1, concurrency=4, )
Replace decode_batch and ModelBatch with your own function or callable class. Return stable IDs from each UDF when later stages need to join model outputs back to source records. Every Ray worker must be able to import the callable and access the same storage, model files, and credentials.
Practical checks
- Keep stable IDs before decoding or model inference.
- Store source URI and enough metadata to reproduce a row.
- Filter and project before loading bytes when possible.
- Keep UDF outputs small and typed.
- Validate a small sample before running a large media job.
- Write outputs to a filesystem that all workers can access when using Ray.