Vane Data / Examples

Examples

The examples show how to compose Vane Data SQL, Python UDFs, AI helpers, and Ray-backed execution into practical data preparation workflows.

Start with the documentation pages below when you want the workflow shape. Read the source scripts when you need runnable arguments, sample data behavior, or implementation details.

Example pages

Example	Use it for	Typical resources
Training data pipeline	Filtering text, chunking, embeddings, and duplicate-removal patterns	CPU by default; optional embedding model dependencies
Insurance document audit	SQL-first document checks with optional structured LLM review	CPU; optional AI provider
Tender compliance check	Rule-table joins and structured compliance prompts	CPU; optional AI provider
Multimodal data lake	Media enrichment before lakehouse or warehouse handoff	CPU or GPU depending on UDFs

Source scripts

The repository includes script-first examples that can be adapted into your own projects:

examples/common_crawl.py
examples/minhash_dedupe.py
examples/llms_red_pajamas.py
examples/querying_images.py
examples/image_generation.py
examples/voice_ai_analytics.py
examples/multimodal_structured_outputs.py

Several scripts provide built-in sample data modes so you can test the pipeline shape before connecting external data.

How to adapt an example

When adapting an example:

Start with the sample or synthetic source mode when the script provides one.
Keep source loading separate from transformation.
Keep every UDF output schema explicit.
Preserve stable identifiers so model outputs can be joined back to source rows.
Move to real data only after the local sample works.
Treat benchmark scripts as performance references, not as product APIs.
Do not infer unsupported connectors from example names.

Scaling path

Most examples should be developed locally first. Move selected stages to Ray only after the local data contract is correct:

example.py

import vane


vane.configure(runner="ray")

Then use ray_task or ray_actor on the UDF or AI helper stage that needs distributed execution.