Real user scenarios, not just examples. Each one is the same shape: the problem, the pipeline, the code, what goes in and out, and when to reach for it.
Turning web-scale crawl dumps into clean, chunked embeddings usually means stitching SQL filtering, Python chunking, a GPU embedding model and Parquet output across separate systems.
Building a retrieval corpus or a pretraining filter from raw crawl data.
docs = conn.sql("SELECT url, text FROM read_parquet('s3://cc/*.parquet')") chunks = docs.map_batches(chunk_text, execution_backend="ray_task") emb = embed_text(chunks, "text", provider="transformers", batch_size=64) emb.write_parquet("s3://corpus/embeddings/")
You need an offline index of a large Q&A corpus and a way to match related records without standing up a vector DB just to experiment.
Prototyping retrieval or near-duplicate matching over a static corpus.
rel = conn.sql("SELECT id, title, body FROM read_parquet('s3://qa/*.parquet')") idx = embed_text(rel, "body", provider="transformers") idx.write_parquet("s3://index/qa/") hits = conn.sql("SELECT id FROM 's3://index/qa/' ORDER BY cosine(vec,$q) LIMIT 10")
Near-duplicate dedup at scale needs MinHash signatures and LSH bucketing wired into your data pipeline — not a one-off notebook.
Cleaning a training set before tokenization or embedding.
rel = conn.sql("SELECT id, text FROM read_parquet('s3://raw/*.parquet')") sig = rel.map_batches(minhash, num_perm=128) buckets = sig.flat_map(lsh_bands, bands=16) buckets.map_batches(keep_one_per_cluster).write_parquet("s3://clean/")
Decoding millions of images and running a vision model means juggling IO, CPU decode and GPU inference with the right batch sizes by hand.
Tagging, filtering, or feature-extracting a large image dataset.
rel = conn.sql("SELECT path FROM read_parquet('s3://images/manifest.parquet')") imgs = rel.map_batches(decode_image, batch_size=128) feats = imgs.map_batches(DetectFeatures, num_gpus=1, batch_size=64) feats.write_parquet("s3://features/")
Generating images for a whole prompt table means managing a GPU model actor, batching, and writing results back — repeatedly.
Synthetic-data generation or bulk creative rendering.
prompts = conn.sql("SELECT id, prompt FROM read_parquet('s3://prompts.parquet')") images = prompts.map_batches( Diffusion, num_gpus=1, batch_size=16, steps=30) images.write_parquet("s3://generated/")
Getting reliable structured fields out of a vision-language model — and grading them — needs schema enforcement plus a second judge pass.
Extracting structured data from documents or images at scale.
rel = conn.sql("SELECT id, image, question FROM 's3://docs/*.parquet'") ans = rel.map_batches(VLM, schema=Receipt, num_gpus=1) graded = ans.map_batches(Judge, batch_size=32) graded.write_parquet("s3://extracted/")
A voice-analytics pipeline chains transcription, summarization, captioning and embedding — each a different model, each needing batching on GPUs.
Call analytics, meeting summaries, or audio search.
rel = conn.sql("SELECT id, audio FROM read_parquet('s3://calls/*.parquet')") out = (rel .map_batches(Transcribe, num_gpus=1) .map_batches(Summarize, batch_size=32)) embed_text(out, "summary").write_parquet("s3://analytics/")