Vane Data / Overview

Vane Data Docs

Vane Data helps teams build AI-ready datasets with DuckDB-compatible SQL, Python/Arrow UDFs, Ray-backed execution, and AI helper functions.

Use it when structured or semi-structured data needs SQL transformations plus Python libraries, model inference, embeddings, prompting, or distributed execution before it moves to training, search, analytics, or serving systems.

SQL is a first-class interface through con.sql(...) and con.execute(...); Python is available through UDFs and AI helpers when the pipeline needs custom logic.

Install vane-ai and import vane.

Vane Data is a data processing layer, not a model server, vector database, workflow scheduler, or transactional database.

Choose a starting point

A. Python and performance workflows

For pipelines where Python code, model inference, GPU work, or provider-backed AI calls dominate cost and runtime.

B. SQL and lightweight workflows

For relational pipelines that should stay close to DuckDB-compatible SQL and add Python only where it is useful.

Advanced: distributed execution

For workloads where data size, scan parallelism, GPU placement, or multi-node execution becomes the main design question.

Documentation map

Getting started: What is Vane Data?, Installation, SQL quickstart, Python quickstart
Concepts: Architecture, Execution model, SQL vs Python, UDFs, AI functions
Task guides: Multimodal ingest, Embeddings at scale, Dedup and clean, SQL multimodal pipeline, Doris integration, Iceberg and lakehouse, GPU inference UDF, Performance tuning
Examples: Examples index, Training data pipeline, Insurance document audit, Tender compliance check, Multimodal data lake
Deploy: Single node, Ray cluster, Sizing
Contributing: Development
Machine-readable entries: docs/llms.txt, docs/llms-full.txt