Vane Data Docs
Vane Data helps teams build AI-ready datasets with DuckDB-compatible SQL, Python/Arrow UDFs, Ray-backed execution, and AI helper functions.
Use it when structured or semi-structured data needs SQL transformations plus Python libraries, model inference, embeddings, prompting, or distributed execution before it moves to training, search, analytics, or serving systems.
SQL is a first-class interface through con.sql(...) and con.execute(...); Python is available through UDFs and AI helpers when the pipeline needs custom logic.
Install vane-ai and import vane.
Vane Data is a data processing layer, not a model server, vector database, workflow scheduler, or transactional database.
Choose a starting point
A. Python and performance workflows
For pipelines where Python code, model inference, GPU work, or provider-backed AI calls dominate cost and runtime.
B. SQL and lightweight workflows
For relational pipelines that should stay close to DuckDB-compatible SQL and add Python only where it is useful.
Advanced: distributed execution
For workloads where data size, scan parallelism, GPU placement, or multi-node execution becomes the main design question.
Documentation map
- Getting started: What is Vane Data?, Installation, SQL quickstart, Python quickstart
- Concepts: Architecture, Execution model, SQL vs Python, UDFs, AI functions
- Task guides: Multimodal ingest, Embeddings at scale, Dedup and clean, SQL multimodal pipeline, Doris integration, Iceberg and lakehouse, GPU inference UDF, Performance tuning
- Examples: Examples index, Training data pipeline, Insurance document audit, Tender compliance check, Multimodal data lake
- Deploy: Single node, Ray cluster, Sizing
- Contributing: Development
- Machine-readable entries: docs/llms.txt, docs/llms-full.txt