Vane Data / Guides

GPU Inference UDF

Use GPU UDFs when a model should be loaded once, kept warm, and reused across many Arrow batches.

Vane Data provides the relation boundary, actor-style UDF execution, and Ray resource requests. Your UDF or AI provider is responsible for loading the model, selecting the device, batching inputs, and returning Arrow-compatible columns.

When to use this pattern

Use a GPU actor when:

  • Model initialization is expensive.
  • Each batch needs the same model or tokenizer state.
  • The model library benefits from batched inputs.
  • Ray should place model actors on GPU workers.

Do not use a GPU actor for lightweight row-wise logic, simple SQL transforms, or provider calls that are rate-limit bound rather than compute bound.

Actor contract

Pass a callable class, not an already-created instance, to map_batches(...). Vane Data constructs the class inside the executor so each actor owns its model state.

example.py
import pyarrow as pa


class ModelBatch:
    def __init__(self) -> None:
        self.model = load_model_on_worker()


    def __call__(self, batch: pa.Table) -> pa.Table:
        ids = batch.column("id").to_pylist()
        values = batch.column("input").to_pylist()
        outputs = self.model.predict(values)


        return pa.table({
            "id": ids,
            "output": outputs,
        })


out = rel.map_batches(
    ModelBatch,
    schema={
        "id": "BIGINT",
        "output": "VARCHAR",
    },
    batch_size=32,
    execution_backend="ray_actor",
    gpus=1,
    concurrency=4,
)

gpus=1 asks Ray to reserve one GPU per actor. It does not automatically move tensors or models to the GPU; do that inside the model initialization code for the library you use.

SentenceTransformer example

This example uses a stateful actor and explicitly selects a CUDA device. Adapt the model name, output type, and device selection to your runtime.

example.py
import pyarrow as pa


class SentenceEmbedder:
    def __init__(self) -> None:
        from sentence_transformers import SentenceTransformer


        self.model = SentenceTransformer(
            "sentence-transformers/all-MiniLM-L6-v2",
            device="cuda",
        )


    def __call__(self, batch: pa.Table) -> pa.Table:
        ids = batch.column("id").to_pylist()
        texts = batch.column("text").to_pylist()
        vectors = self.model.encode(texts, convert_to_numpy=True).tolist()


        return pa.table({
            "id": ids,
            "embedding": vectors,
        })


embedded = rel.map_batches(
    SentenceEmbedder,
    schema={
        "id": "BIGINT",
        "embedding": "FLOAT[]",
    },
    batch_size=64,
    execution_backend="ray_actor",
    gpus=1,
    concurrency=2,
)

Return identifiers such as id when downstream stages need to join model outputs back to source rows.

vLLM provider

The vllm AI provider supports prompt(...) and maps provider options into UDF execution options.

example.py
response_only = rel.prompt(
    "prompt",
    provider="vllm",
    execution_backend="ray_actor",
    batch_size=32,
    concurrency=2,
    gpus_per_actor=1,
)

gpus_per_actor is a vLLM provider option. The provider maps it to the UDF GPU request internally. Validate the selected model, vLLM options, and GPU memory requirements with a small batch before scaling.

Local first

Debug the batch contract locally before moving to Ray:

example.py
out = rel.map_batches(
    ModelBatch,
    schema={"id": "BIGINT", "output": "VARCHAR"},
    batch_size=8,
    execution_backend="subprocess_actor",
)

Then configure Ray before creating and executing the distributed pipeline:

example.py
import vane


vane.configure(runner="ray")

Use execution_backend="ray_actor" for the GPU actor stage.

Tuning order

Tune in this order:

  1. Project only the columns required by the model.
  2. Choose a batch size that keeps the GPU busy without exhausting memory.
  3. Set gpus per actor.
  4. Increase concurrency only when the cluster has enough available GPUs.
  5. Keep the output schema narrow and Arrow-compatible.

Sizing rule:

text
required GPUs = concurrency * gpus

If concurrency=4 and gpus=1, the Ray cluster needs at least 4 available GPUs for full parallelism.

Failure boundaries

Avoid loading models at module import time. Load them in __init__ so workers own their model state.

Avoid returning Python objects that Arrow cannot serialize. Convert outputs to strings, numbers, lists, bytes, or Arrow arrays.

Make sure every Ray worker can access:

  • The Python module that defines the UDF class.
  • Model files or model registry credentials.
  • Storage credentials for input and output data.
  • The same package versions needed by the model library.