GPU Inference UDF
Use GPU UDFs when a model should be loaded once, kept warm, and reused across many Arrow batches.
Vane Data provides the relation boundary, actor-style UDF execution, and Ray resource requests. Your UDF or AI provider is responsible for loading the model, selecting the device, batching inputs, and returning Arrow-compatible columns.
When to use this pattern
Use a GPU actor when:
- Model initialization is expensive.
- Each batch needs the same model or tokenizer state.
- The model library benefits from batched inputs.
- Ray should place model actors on GPU workers.
Do not use a GPU actor for lightweight row-wise logic, simple SQL transforms, or provider calls that are rate-limit bound rather than compute bound.
Actor contract
Pass a callable class, not an already-created instance, to map_batches(...). Vane Data constructs the class inside the executor so each actor owns its model state.
import pyarrow as pa class ModelBatch: def __init__(self) -> None: self.model = load_model_on_worker() def __call__(self, batch: pa.Table) -> pa.Table: ids = batch.column("id").to_pylist() values = batch.column("input").to_pylist() outputs = self.model.predict(values) return pa.table({ "id": ids, "output": outputs, }) out = rel.map_batches( ModelBatch, schema={ "id": "BIGINT", "output": "VARCHAR", }, batch_size=32, execution_backend="ray_actor", gpus=1, concurrency=4, )
gpus=1 asks Ray to reserve one GPU per actor. It does not automatically move tensors or models to the GPU; do that inside the model initialization code for the library you use.
SentenceTransformer example
This example uses a stateful actor and explicitly selects a CUDA device. Adapt the model name, output type, and device selection to your runtime.
import pyarrow as pa class SentenceEmbedder: def __init__(self) -> None: from sentence_transformers import SentenceTransformer self.model = SentenceTransformer( "sentence-transformers/all-MiniLM-L6-v2", device="cuda", ) def __call__(self, batch: pa.Table) -> pa.Table: ids = batch.column("id").to_pylist() texts = batch.column("text").to_pylist() vectors = self.model.encode(texts, convert_to_numpy=True).tolist() return pa.table({ "id": ids, "embedding": vectors, }) embedded = rel.map_batches( SentenceEmbedder, schema={ "id": "BIGINT", "embedding": "FLOAT[]", }, batch_size=64, execution_backend="ray_actor", gpus=1, concurrency=2, )
Return identifiers such as id when downstream stages need to join model outputs back to source rows.
vLLM provider
The vllm AI provider supports prompt(...) and maps provider options into UDF execution options.
response_only = rel.prompt( "prompt", provider="vllm", execution_backend="ray_actor", batch_size=32, concurrency=2, gpus_per_actor=1, )
gpus_per_actor is a vLLM provider option. The provider maps it to the UDF GPU request internally. Validate the selected model, vLLM options, and GPU memory requirements with a small batch before scaling.
Local first
Debug the batch contract locally before moving to Ray:
out = rel.map_batches( ModelBatch, schema={"id": "BIGINT", "output": "VARCHAR"}, batch_size=8, execution_backend="subprocess_actor", )
Then configure Ray before creating and executing the distributed pipeline:
import vane vane.configure(runner="ray")
Use execution_backend="ray_actor" for the GPU actor stage.
Tuning order
Tune in this order:
- Project only the columns required by the model.
- Choose a batch size that keeps the GPU busy without exhausting memory.
- Set gpus per actor.
- Increase concurrency only when the cluster has enough available GPUs.
- Keep the output schema narrow and Arrow-compatible.
Sizing rule:
required GPUs = concurrency * gpusIf concurrency=4 and gpus=1, the Ray cluster needs at least 4 available GPUs for full parallelism.
Failure boundaries
Avoid loading models at module import time. Load them in __init__ so workers own their model state.
Avoid returning Python objects that Arrow cannot serialize. Convert outputs to strings, numbers, lists, bytes, or Arrow arrays.
Make sure every Ray worker can access:
- The Python module that defines the UDF class.
- Model files or model registry credentials.
- Storage credentials for input and output data.
- The same package versions needed by the model library.