Use Cases

AI pipelines Vane is built for

Real user scenarios, not just examples. Each one is the same shape: the problem, the pipeline, the code, what goes in and out, and when to reach for it.

Web Text to Embeddings Semantic Search Text Deduplication Image Pipelines Image Generation Multimodal Structured Output Voice AI Analytics

Web Text to Embeddings

embeddings

read_parquet→filter SQL→chunk_text→embed_text→write_parquet

Problem

Turning web-scale crawl dumps into clean, chunked embeddings usually means stitching SQL filtering, Python chunking, a GPU embedding model and Parquet output across separate systems.

Input / Output

input

2.1 TB Common Crawl parquet

output

480M chunk embeddings · 768-d

When to use it

Building a retrieval corpus or a pretraining filter from raw crawl data.

common_crawl.py

docs = conn.sql("SELECT url, text FROM read_parquet('s3://cc/*.parquet')")
chunks = docs.map_batches(chunk_text, execution_backend="ray_task")
emb = embed_text(chunks, "text", provider="transformers", batch_size=64)
emb.write_parquet("s3://corpus/embeddings/")

Example: examples/common_crawl.pyOpen example →

Semantic Search

retrieval

sql→embed_text→write index→cosine query

Problem

You need an offline index of a large Q&A corpus and a way to match related records without standing up a vector DB just to experiment.

Input / Output

input

14M StackExchange questions

output

top-k similar per query

When to use it

Prototyping retrieval or near-duplicate matching over a static corpus.

semantic_search.py

rel = conn.sql("SELECT id, title, body FROM read_parquet('s3://qa/*.parquet')")
idx = embed_text(rel, "body", provider="transformers")
idx.write_parquet("s3://index/qa/")
hits = conn.sql("SELECT id FROM 's3://index/qa/' ORDER BY cosine(vec,$q) LIMIT 10")

Example: examples/semantic_search.pyOpen example →

Text Deduplication

preprocessing

normalize→minhash→lsh_bands (flat_map)→keep one

Problem

Near-duplicate dedup at scale needs MinHash signatures and LSH bucketing wired into your data pipeline — not a one-off notebook.

Input / Output

input

900M documents

output

612M unique (32% removed)

When to use it

Cleaning a training set before tokenization or embedding.

minhash_dedupe.py

rel = conn.sql("SELECT id, text FROM read_parquet('s3://raw/*.parquet')")
sig = rel.map_batches(minhash, num_perm=128)
buckets = sig.flat_map(lsh_bands, bands=16)
buckets.map_batches(keep_one_per_cluster).write_parquet("s3://clean/")

Example: examples/minhash_dedupe.pyOpen example →

Image Pipelines

vision

manifest sql→decode_image→DetectFeatures (actor)→write

Problem

Decoding millions of images and running a vision model means juggling IO, CPU decode and GPU inference with the right batch sizes by hand.

Input / Output

input

12M images

output

detections + CLIP features

When to use it

Tagging, filtering, or feature-extracting a large image dataset.

querying_images.py

rel = conn.sql("SELECT path FROM read_parquet('s3://images/manifest.parquet')")
imgs = rel.map_batches(decode_image, batch_size=128)
feats = imgs.map_batches(DetectFeatures, num_gpus=1, batch_size=64)
feats.write_parquet("s3://features/")

Example: examples/querying_images.pyOpen example →

Image Generation

generation

prompts sql→Diffusion (model actor)→write

Problem

Generating images for a whole prompt table means managing a GPU model actor, batching, and writing results back — repeatedly.

Input / Output

input

50K prompts

output

50K images · 1024²

When to use it

Synthetic-data generation or bulk creative rendering.

image_generation.py

prompts = conn.sql("SELECT id, prompt FROM read_parquet('s3://prompts.parquet')")
images = prompts.map_batches(
    Diffusion, num_gpus=1, batch_size=16, steps=30)
images.write_parquet("s3://generated/")

Example: examples/image_generation.pyOpen example →

Multimodal Structured Output

multimodal

image+text sql→VLM (schema)→Judge→write

Problem

Getting reliable structured fields out of a vision-language model — and grading them — needs schema enforcement plus a second judge pass.

Input / Output

input

300K document images

output

typed JSON + judge score

When to use it

Extracting structured data from documents or images at scale.

multimodal_structured_outputs.py

rel = conn.sql("SELECT id, image, question FROM 's3://docs/*.parquet'")
ans = rel.map_batches(VLM, schema=Receipt, num_gpus=1)
graded = ans.map_batches(Judge, batch_size=32)
graded.write_parquet("s3://extracted/")

Example: examples/multimodal_structured_outputs.pyOpen example →

Voice AI Analytics

audio

audio sql→Transcribe→Summarize→embed_text

Problem

A voice-analytics pipeline chains transcription, summarization, captioning and embedding — each a different model, each needing batching on GPUs.

Input / Output

input

120K call recordings

output

transcript · summary · embedding

When to use it

Call analytics, meeting summaries, or audio search.

voice_ai_analytics.py

rel = conn.sql("SELECT id, audio FROM read_parquet('s3://calls/*.parquet')")
out = (rel
   .map_batches(Transcribe, num_gpus=1)
   .map_batches(Summarize, batch_size=32))
embed_text(out, "summary").write_parquet("s3://analytics/")

Example: examples/voice_ai_analytics.pyOpen example →

Build your first multimodal AI pipeline with Vane.

Read the Docs →See benchmarks