Vane Data / Deploy

Sizing

Sizing Vane Data jobs is mostly about data volume, UDF cost, and model resources.

Start with a small run

Record:

Then scale one variable at a time.

Use:

example.py

execution_backend="subprocess_task"

Start with:

Adjust based on memory and latency.

Use:

example.py

execution_backend="subprocess_actor"

Start with:

Increase concurrency only if CPU/GPU resources remain idle.

Use:

example.py

execution_backend="ray_task"

Good for CPU-heavy stateless work. Watch scheduling overhead if batches are too small.

Use:

example.py

execution_backend="ray_actor"

Good for stateful model inference.

Sizing rule:

text

required GPUs = concurrency * gpus

If concurrency=4 and gpus=1, the Ray cluster needs at least 4 available GPUs for full parallelism.

Relevant public settings:

Use scan task grouping when many small files would otherwise create too many scan tasks.

Embeddings can dominate output size.

Estimate:

text

rows * dimensions * 4 bytes

For 10 million rows and 768 float32 dimensions, raw vector values alone are about 30 GB before file format overhead.

Fix projection, actor reuse, shared storage, provider limits, and partitioning before adding more machines.