Vane Data / Guides

Dedup and Clean

Use Vane Data for deduplication and cleaning when you want SQL filtering, Python text processing, and table outputs in one pipeline.

Vane Data does not provide a built-in MinHash or data-quality framework. It gives you relation management, SQL, UDF execution, and write paths; the cleaning and deduplication logic remains explicit Python or SQL code.

Pipeline shape

A typical text-cleaning pipeline follows this shape:

Load text rows with stable IDs and source metadata.
Normalize text with SQL or a Python UDF.
Remove exact duplicates when normalized text is identical.
Use approximate similarity logic, such as MinHash, only when exact dedup is not enough.
Keep one representative per duplicate group.
Write retained rows and audit columns.

Minimal cleaning UDF

example.py

import re


import pyarrow as pa


def clean_text(batch: pa.Table) -> pa.Table:
    ids = batch.column("id").to_pylist()
    text = batch.column("text").to_pylist()
    cleaned = []


    for value in text:
        s = str(value or "")
        s = re.sub(r"\s+", " ", s).strip().lower()
        cleaned.append(s)


    return pa.table({
        "id": ids,
        "text": text,
        "clean_text": cleaned,
    })


cleaned = rel.map_batches(
    clean_text,
    schema={
        "id": "BIGINT",
        "text": "VARCHAR",
        "clean_text": "VARCHAR",
    },
    batch_size=1024,
    execution_backend="subprocess_task",
)

Keep the original text column available until validation is complete. Cleaning rules can remove useful information if they are too aggressive.

Exact dedup with SQL

For small or highly normalized datasets, SQL exact dedup is often enough:

example.py

cleaned.to_table("cleaned")


deduped = con.sql("""
    select
        any_value(id) as id,
        any_value(text) as text,
        clean_text,
        count(*) as duplicate_count
    from cleaned
    group by clean_text
""")

This keeps one representative per normalized text value and records how many rows collapsed into each group.

Near-duplicate strategy

For near-duplicate text, use a Python UDF to compute signatures, then use SQL for grouping and scoring.

Common approach:

Normalize text.
Compute a signature, such as MinHash, in map_batches.
Generate candidate pairs with bounded LSH buckets.
Score candidates.
Build duplicate components.
Keep one representative per component.

Keep candidate expansion bounded. Approximate dedup pipelines can become expensive when a bucket grows too large.

Columns to keep

For training data and retrieval datasets, keep audit-friendly columns:

stable source ID
original URI or file path
raw text
normalized text
duplicate component ID
representative flag
duplicate count or similarity score
filter reason or quality score

These columns make it possible to explain why a row was retained or removed.

Scaling

Start with local execution:

example.py

signatures = cleaned.map_batches(
    signature_batch,
    schema={
        "id": "BIGINT",
        "clean_text": "VARCHAR",
        "signature": "UBIGINT[]",
    },
    batch_size=1024,
    execution_backend="subprocess_task",
)

Move CPU-heavy signature generation to Ray when distribution helps:

example.py

import vane


vane.configure(runner="ray")


signatures = cleaned.map_batches(
    signature_batch,
    schema={
        "id": "BIGINT",
        "clean_text": "VARCHAR",
        "signature": "UBIGINT[]",
    },
    batch_size=2048,
    execution_backend="ray_task",
)

Replace signature_batch with your own deterministic signature function. Return stable IDs and normalized text along with the signature so later SQL stages can audit and group rows.

Practical checks

Sample removed rows before trusting a rule.
Track row counts at each stage.
Keep raw text until the final validation step.
Make normalization deterministic.
Keep representative selection deterministic.
Write retained rows and rejected-row summaries separately when auditing matters.