Vane Data / Guides

Performance Tuning

Start with a correct local pipeline, then tune the smallest number of variables.

Runner

Local:

example.py
vane.configure(runner="native")

Ray:

example.py
vane.configure(runner="ray")

The native runner streams Arrow record batches. The Ray runner initializes Ray when needed and sends a logical plan to a Ray query driver.

Batch size

For map_batches, batch_size controls how many rows are passed to one UDF call.

Small batches:

  • lower memory
  • worse model throughput
  • more scheduling overhead

Large batches:

  • better throughput
  • higher memory
  • slower retries

Tune with representative data.

Actor reuse

Use actor backends for model inference:

  • subprocess_actor for local stateful reuse.
  • ray_actor for distributed stateful reuse.

Use task backends for light stateless work:

  • subprocess_task
  • ray_task

CPU and GPU resources

map_batches accepts:

  • cpus
  • gpus
  • concurrency

Ray uses resource requests for placement. Make sure the Ray cluster advertises matching resources before increasing concurrency.

Ray scan task grouping

Public config:

example.py
vane.configure(ray_scan_task_size_grouping=True)

Environment variable:

shell
export VANE_RAY_SCAN_TASK_SIZE_GROUPING=true

The registry describes this as size-based scan task grouping, useful for merging small files into larger scan tasks.

Related public config:

  • ray_max_task_backlog
  • ray_scan_task_open_cost_bytes
  • ray_scan_task_min_partition_num
  • ray_init_sql

Backpressure

VANE_RAY_MAX_TASK_BACKLOG limits pending Ray tasks when set to a positive value. 0 means unlimited in the public registry.

Use it when the driver can submit faster than workers can consume.

Native runner batch size

VANE_NATIVE_RUNNER_BATCH_SIZE overrides the native runner batch size when you need an explicit value for measurement or memory control.

shell
export VANE_NATIVE_RUNNER_BATCH_SIZE=4096

Avoid common bottlenecks

  • Project only columns needed by the UDF.
  • Filter before model calls.
  • Keep image/audio/video decoding separate from model inference when tuning.
  • Write to a filesystem all workers can access.
  • Avoid one row per UDF call for model workloads.
  • Use structured output schemas instead of Python object columns.

Measure

Keep a simple run record:

  • input rows
  • input bytes
  • batch size
  • backend
  • concurrency
  • GPUs per actor
  • elapsed time
  • output rows
  • failure count

Benchmark workloads can be useful references for environment-driven performance experiments, but benchmark-specific file locations, storage endpoints, and model defaults are not general deployment defaults.