Dedup and Clean
Use Vane Data for deduplication and cleaning when you want SQL filtering, Python text processing, and table outputs in one pipeline.
Vane Data does not provide a built-in MinHash or data-quality framework. It gives you relation management, SQL, UDF execution, and write paths; the cleaning and deduplication logic remains explicit Python or SQL code.
Pipeline shape
A typical text-cleaning pipeline follows this shape:
- Load text rows with stable IDs and source metadata.
- Normalize text with SQL or a Python UDF.
- Remove exact duplicates when normalized text is identical.
- Use approximate similarity logic, such as MinHash, only when exact dedup is not enough.
- Keep one representative per duplicate group.
- Write retained rows and audit columns.
Minimal cleaning UDF
import re import pyarrow as pa def clean_text(batch: pa.Table) -> pa.Table: ids = batch.column("id").to_pylist() text = batch.column("text").to_pylist() cleaned = [] for value in text: s = str(value or "") s = re.sub(r"\s+", " ", s).strip().lower() cleaned.append(s) return pa.table({ "id": ids, "text": text, "clean_text": cleaned, }) cleaned = rel.map_batches( clean_text, schema={ "id": "BIGINT", "text": "VARCHAR", "clean_text": "VARCHAR", }, batch_size=1024, execution_backend="subprocess_task", )
Keep the original text column available until validation is complete. Cleaning rules can remove useful information if they are too aggressive.
Exact dedup with SQL
For small or highly normalized datasets, SQL exact dedup is often enough:
cleaned.to_table("cleaned") deduped = con.sql(""" select any_value(id) as id, any_value(text) as text, clean_text, count(*) as duplicate_count from cleaned group by clean_text """)
This keeps one representative per normalized text value and records how many rows collapsed into each group.
Near-duplicate strategy
For near-duplicate text, use a Python UDF to compute signatures, then use SQL for grouping and scoring.
Common approach:
- Normalize text.
- Compute a signature, such as MinHash, in map_batches.
- Generate candidate pairs with bounded LSH buckets.
- Score candidates.
- Build duplicate components.
- Keep one representative per component.
Keep candidate expansion bounded. Approximate dedup pipelines can become expensive when a bucket grows too large.
Columns to keep
For training data and retrieval datasets, keep audit-friendly columns:
- stable source ID
- original URI or file path
- raw text
- normalized text
- duplicate component ID
- representative flag
- duplicate count or similarity score
- filter reason or quality score
These columns make it possible to explain why a row was retained or removed.
Scaling
Start with local execution:
signatures = cleaned.map_batches( signature_batch, schema={ "id": "BIGINT", "clean_text": "VARCHAR", "signature": "UBIGINT[]", }, batch_size=1024, execution_backend="subprocess_task", )
Move CPU-heavy signature generation to Ray when distribution helps:
import vane vane.configure(runner="ray") signatures = cleaned.map_batches( signature_batch, schema={ "id": "BIGINT", "clean_text": "VARCHAR", "signature": "UBIGINT[]", }, batch_size=2048, execution_backend="ray_task", )
Replace signature_batch with your own deterministic signature function. Return stable IDs and normalized text along with the signature so later SQL stages can audit and group rows.
Practical checks
- Sample removed rows before trusting a rule.
- Track row counts at each stage.
- Keep raw text until the final validation step.
- Make normalization deterministic.
- Keep representative selection deterministic.
- Write retained rows and rejected-row summaries separately when auditing matters.