Pipeline & Performance

Pipeline batching

By default, each kernel’s sync method (apply, compute, detect) creates its own command buffer and waits for completion. For multi-stage pipelines, this means N GPU round-trips.

Pipeline batches everything into a single command buffer:

#![allow(unused)]
fn main() {
use vx_vision::Pipeline;

let pipe = Pipeline::begin(&ctx)?;
let cmd = pipe.cmd_buf();

let s1 = blur.encode(&ctx, cmd, &input, &temp1, &blur_cfg)?;
bilateral.encode(cmd, &temp1, &temp2, &bilateral_cfg)?;
morph.encode_dilate(cmd, &temp2, &output, &morph_cfg)?;

let _retained = pipe.commit_and_wait();
}

Encoded state (like s1 above) holds intermediate textures that must outlive the command buffer.

Which kernels support encoding?

Encodable	Not encodable (multi-pass)
Gaussian, Bilateral, Sobel, Canny, Morphology, Threshold, Color, Warp, Integral, Dense Flow, FAST, Harris, NMS, ORB, KLT, Resize, Undistort, DoG (subtract only)	Histogram, Homography, Connected Components, Hough

Multi-pass kernels require CPU readback between GPU passes, so they can’t be batched.

TexturePool

GPU texture allocation is expensive. Reuse textures across frames:

#![allow(unused)]
fn main() {
use vx_vision::TexturePool;

let mut pool = TexturePool::new();

for frame in frames {
    let temp = pool.acquire_gray8(&ctx, w, h)?;  // reuses cached texture
    blur.apply(&ctx, &frame, &temp, &cfg)?;
    // ... process ...
    pool.release(temp);  // return to pool
}

println!("Hit rate: {:.0}%", pool.hit_rate() * 100.0);
}

The pool keys by (width, height, format). All pool textures have ShaderRead | ShaderWrite flags.

Optimization tips

Reuse kernel structs. Creating a kernel compiles the Metal pipeline. Do it once at startup.

#![allow(unused)]
fn main() {
let blur = GaussianBlur::new(&ctx)?;  // once
for frame in frames {
    blur.apply(&ctx, &frame, &output, &cfg)?;  // reuse
}
}

Avoid unnecessary readbacks. read_gray8() forces GPU sync. If the output feeds another kernel, pass the texture directly.

Downsample first. Run feature detection on half-resolution images when full resolution isn’t needed:

#![allow(unused)]
fn main() {
let levels = pyr.build(&ctx, &input, 3)?;
let corners = fast.detect(&ctx, &levels[0], &cfg)?;  // half-res
}

Batch with Pipeline. One command buffer is faster than five:

#![allow(unused)]
fn main() {
let pipe = Pipeline::begin(&ctx)?;
// encode 5 kernels into pipe.cmd_buf()
pipe.commit_and_wait();
}

Memory model

On Apple Silicon (UMA), CPU and GPU share physical memory. VX uses MTLStorageModeShared — no copies, no uploads, no downloads. waitUntilCompleted() is the only synchronization needed.

GpuGuard<T> in vx-gpu prevents CPU reads of a UnifiedBuffer<T> while the GPU is using it, catching race conditions at runtime.

Benchmarking

Run the built-in criterion benchmarks:

cargo bench -p vx-vision

Benchmarks include:

FAST at 752x480 and 1920x1080
Full FAST → Harris → NMS → ORB pipeline at both resolutions
Pipeline vs individual dispatch comparison (3x Gaussian)

Keyboard shortcuts