Use Cases

AI pipelines Vane is built for

Real user scenarios, not just examples. Each one is the same shape: the problem, the pipeline, the code, what goes in and out, and when to reach for it.

Web Text to Embeddings

embeddings
read_parquetfilter SQLchunk_textembed_textwrite_parquet
Problem

Turning web-scale crawl dumps into clean, chunked embeddings usually means stitching SQL filtering, Python chunking, a GPU embedding model and Parquet output across separate systems.

Input / Output
input
2.1 TB Common Crawl parquet
output
480M chunk embeddings · 768-d
When to use it

Building a retrieval corpus or a pretraining filter from raw crawl data.

common_crawl.py
docs = conn.sql("SELECT url, text FROM read_parquet('s3://cc/*.parquet')")
chunks = docs.map_batches(chunk_text, execution_backend="ray_task")
emb = embed_text(chunks, "text", provider="transformers", batch_size=64)
emb.write_parquet("s3://corpus/embeddings/")
Example: examples/common_crawl.pyOpen example

Text Deduplication

preprocessing
normalizeminhashlsh_bands (flat_map)keep one
Problem

Near-duplicate dedup at scale needs MinHash signatures and LSH bucketing wired into your data pipeline — not a one-off notebook.

Input / Output
input
900M documents
output
612M unique (32% removed)
When to use it

Cleaning a training set before tokenization or embedding.

minhash_dedupe.py
rel = conn.sql("SELECT id, text FROM read_parquet('s3://raw/*.parquet')")
sig = rel.map_batches(minhash, num_perm=128)
buckets = sig.flat_map(lsh_bands, bands=16)
buckets.map_batches(keep_one_per_cluster).write_parquet("s3://clean/")
Example: examples/minhash_dedupe.pyOpen example

Image Pipelines

vision
manifest sqldecode_imageDetectFeatures (actor)write
Problem

Decoding millions of images and running a vision model means juggling IO, CPU decode and GPU inference with the right batch sizes by hand.

Input / Output
input
12M images
output
detections + CLIP features
When to use it

Tagging, filtering, or feature-extracting a large image dataset.

querying_images.py
rel = conn.sql("SELECT path FROM read_parquet('s3://images/manifest.parquet')")
imgs = rel.map_batches(decode_image, batch_size=128)
feats = imgs.map_batches(DetectFeatures, num_gpus=1, batch_size=64)
feats.write_parquet("s3://features/")
Example: examples/querying_images.pyOpen example

Image Generation

generation
prompts sqlDiffusion (model actor)write
Problem

Generating images for a whole prompt table means managing a GPU model actor, batching, and writing results back — repeatedly.

Input / Output
input
50K prompts
output
50K images · 1024²
When to use it

Synthetic-data generation or bulk creative rendering.

image_generation.py
prompts = conn.sql("SELECT id, prompt FROM read_parquet('s3://prompts.parquet')")
images = prompts.map_batches(
    Diffusion, num_gpus=1, batch_size=16, steps=30)
images.write_parquet("s3://generated/")
Example: examples/image_generation.pyOpen example

Multimodal Structured Output

multimodal
image+text sqlVLM (schema)Judgewrite
Problem

Getting reliable structured fields out of a vision-language model — and grading them — needs schema enforcement plus a second judge pass.

Input / Output
input
300K document images
output
typed JSON + judge score
When to use it

Extracting structured data from documents or images at scale.

multimodal_structured_outputs.py
rel = conn.sql("SELECT id, image, question FROM 's3://docs/*.parquet'")
ans = rel.map_batches(VLM, schema=Receipt, num_gpus=1)
graded = ans.map_batches(Judge, batch_size=32)
graded.write_parquet("s3://extracted/")
Example: examples/multimodal_structured_outputs.pyOpen example

Voice AI Analytics

audio
audio sqlTranscribeSummarizeembed_text
Problem

A voice-analytics pipeline chains transcription, summarization, captioning and embedding — each a different model, each needing batching on GPUs.

Input / Output
input
120K call recordings
output
transcript · summary · embedding
When to use it

Call analytics, meeting summaries, or audio search.

voice_ai_analytics.py
rel = conn.sql("SELECT id, audio FROM read_parquet('s3://calls/*.parquet')")
out = (rel
   .map_batches(Transcribe, num_gpus=1)
   .map_batches(Summarize, batch_size=32))
embed_text(out, "summary").write_parquet("s3://analytics/")
Example: examples/voice_ai_analytics.pyOpen example

Build your first multimodal AI pipeline with Vane.