VX

GPU-accelerated computer vision for Rust on Apple Silicon.

VX talks directly to the Metal GPU through compute shaders, using Apple Silicon’s Unified Memory Architecture to eliminate the CPU-GPU copy overhead that plagues traditional vision libraries.

What it does

42 GPU kernels covering classical computer vision and 3D reconstruction: feature detection (FAST, Harris, ORB, SIFT), image processing (Gaussian, bilateral, Canny, morphology, thresholding), geometry (pyramids, warping, homography), motion (KLT tracking, dense flow), stereo matching, 3D reconstruction (SGM stereo, TSDF fusion, marching cubes, point cloud processing), visualization (mesh and point cloud renderers), and analysis (Hough lines, template matching, distance transforms, connected components).

Why it exists

OpenCV and similar libraries treat the GPU as a separate device. Data gets copied from CPU memory to GPU memory and back, repeatedly. On Apple Silicon this is wasteful — the CPU and GPU share the same physical memory. VX skips the copies entirely.

The library uses Rust bindings to Metal via objc2-metal, giving type-safe GPU access with Rust’s ownership model enforcing buffer safety at compile time. Metal Shading Language (MSL) kernels run the pixel-level computation on the GPU while Rust handles orchestration and the public API.

Quick taste

#![allow(unused)]
fn main() {
use vx_vision::Context;
use vx_vision::kernels::fast::{FastDetector, FastDetectConfig};

let ctx = Context::new()?;
let texture = ctx.texture_gray8(&pixels, width, height)?;

let fast = FastDetector::new(&ctx)?;
let result = fast.detect(&ctx, &texture, &FastDetectConfig::default())?;

println!("Found {} corners", result.corners.len());
}

No unsafe in user code. No Metal imports. No GPU boilerplate.

Getting Started

Requirements

macOS on Apple Silicon (M1/M2/M3/M4) or any Mac with a Metal-capable GPU
Rust stable toolchain
Xcode command line tools: xcode-select --install

Installation

[dependencies]
vx-vision = "0.3"

For 3D reconstruction and visualization, enable the feature flags:

[dependencies]
vx-vision = { version = "0.3", features = ["full"] }

Core concepts

Context

Context initializes the Metal device, command queue, and shader library. Create one at startup and pass references to kernels.

#![allow(unused)]
fn main() {
let ctx = vx_vision::Context::new()?;
}

Texture

Texture wraps a Metal texture with a known format. Three formats are supported:

Format	Create from data	Create empty	Read back
R8Unorm (grayscale)	`ctx.texture_gray8(&pixels, w, h)`	`ctx.texture_output_gray8(w, h)`	`tex.read_gray8()`
R32Float	`ctx.texture_r32float(&data, w, h)`	`ctx.texture_output_r32float(w, h)`	`tex.read_r32float()`
RGBA8Unorm (color)	`ctx.texture_rgba8(&pixels, w, h)`	`ctx.texture_output_rgba8(w, h)`	`tex.read_rgba8()`

On Apple Silicon, textures live in unified memory — no hidden copies between CPU and GPU.

Kernels

Each GPU operation is a struct. The pattern is always:

Create the kernel (compiles the Metal pipeline — do this once)
Call the kernel method (dispatches GPU work)
Read results back

#![allow(unused)]
fn main() {
let blur = GaussianBlur::new(&ctx)?;
let output = ctx.texture_output_gray8(w, h)?;
blur.apply(&ctx, &input, &output, &GaussianConfig::default())?;
let result = output.read_gray8();
}

First program

Load an image, blur it, detect edges:

use vx_vision::Context;
use vx_vision::kernels::gaussian::{GaussianBlur, GaussianConfig};
use vx_vision::kernels::sobel::SobelFilter;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let ctx = Context::new()?;

    let img = image::open("input.png")?.to_luma8();
    let (w, h) = img.dimensions();
    let texture = ctx.texture_gray8(img.as_raw(), w, h)?;

    let blur = GaussianBlur::new(&ctx)?;
    let sobel = SobelFilter::new(&ctx)?;

    let blurred = ctx.texture_output_gray8(w, h)?;
    blur.apply(&ctx, &texture, &blurred, &GaussianConfig::default())?;

    let result = sobel.compute(&ctx, &blurred)?;
    let edges = result.magnitude.read_r32float();

    let max_val = edges.iter().cloned().fold(0.0f32, f32::max);
    let output: Vec<u8> = edges.iter()
        .map(|&v| ((v / max_val) * 255.0) as u8)
        .collect();

    image::save_buffer("edges.png", &output, w, h, image::ColorType::L8)?;
    Ok(())
}

Add image = "0.25" to your Cargo.toml dependencies.

Running examples

cargo run --example fast_demo -- path/to/image.png
cargo run --example edge_detection_demo -- path/to/image.png
cargo run --example threshold_demo -- path/to/image.png
cargo run --example advanced_cv_demo -- path/to/image.png
cargo run --example feature_matching_demo -- path/to/image.png
cargo run --example pipeline_pool_demo -- path/to/image.png

Running tests

cargo test                            # everything
cargo test -p vx-vision               # kernel tests only
cargo test -p vx-vision -- gaussian   # specific test

Architecture

Three-layer stack

VX Architecture

Memory layer (`vx-gpu`)

The vx-core/ directory, published as the vx-gpu crate. Manages shared GPU/CPU buffers.

UnifiedBuffer<T> — Type-safe wrapper around MTLBuffer with StorageModeShared. Provides write(), as_slice(), as_mut_slice(). All element types must implement bytemuck::Pod + Zeroable.
GpuGuard<T> — RAII guard that prevents CPU mutation while a buffer is in-flight on the GPU. Create before commit(), drop after waitUntilCompleted().
Device helpers — default_device(), new_queue(), load_library_from_bytes().

Kernel layer (`vx-vision`)

The vx-vision/ directory. Contains Rust bindings for each Metal shader.

Context — Holds the Metal device, command queue, and compiled shader library. Entry point for everything.
Texture — GPU texture with tracked dimensions and format. Provides readback methods and zero-copy wrapping of external Metal textures.
Pipeline — Batches multiple kernel dispatches into a single Metal command buffer.
TexturePool — Recycles textures by (width, height, format) to avoid repeated allocation.
Kernel structs — One per algorithm (e.g., FastDetector, GaussianBlur, CannyDetector). Each holds compiled MTLComputePipelineState objects, constructed once and reused.

Shader-to-kernel contract

Each algorithm has two sides:

Component	Location	Naming
Metal shader	`vx-vision/shaders/PascalCase.metal`	kernel function: `snake_case`
Rust binding	`vx-vision/src/kernels/snake_case.rs`	struct: `PascalCase`

Example: FastDetect.metal defines kernel void fast_detect(...), and fast.rs defines FastDetector which compiles that function into a pipeline at construction.

Parameter structs

GPU parameter structs live in vx-vision/src/types.rs with #[repr(C)] layout. They must match the MSL struct field-by-field:

Rust	Metal
`u32`	`uint`
`i32`	`int`
`f32`	`float`
`[f32; 2]`	`float2`
`[f32; 3]` + `_pad: f32`	`float3` (16-byte aligned)
`[f32; 4]`	`float4`

Any mismatch causes silent data corruption.

Build system

vx-vision/build.rs auto-discovers all .metal files in vx-vision/shaders/, compiles each to .air via xcrun metal, links into vx.metallib via xcrun metallib, and embeds it via include_bytes!. Adding a new .metal file triggers automatic recompilation.

Thread dispatch patterns

2D per-pixel (image filters): grid = (width, height, 1), threadgroup computed from threadExecutionWidth() and maxTotalThreadsPerThreadgroup()
1D per-element (feature operations): grid = (n, 1, 1), threadgroup = (threadExecutionWidth, 1, 1)
Always uses dispatchThreads:threadsPerThreadgroup: (non-uniform dispatch)

Thread safety

All kernel structs, Context, and Texture implement Send + Sync. Metal pipeline state objects are immutable after creation. MTLCommandQueue is thread-safe, but each thread should create its own command buffers.

Memory model

On Apple Silicon (UMA), CPU and GPU share physical memory. VX uses MTLStorageModeShared for all buffers:

No copies — data written by CPU is immediately visible to GPU and vice versa
Synchronization — waitUntilCompleted() on the command buffer is sufficient
Safety — GpuGuard<T> prevents CPU mutation while GPU is reading

Context & Texture

Context

Entry point for all GPU operations. Holds the Metal device, command queue, and compiled shader library.

#![allow(unused)]
fn main() {
use vx_vision::Context;

let ctx = Context::new()?;
}

Texture creation

#![allow(unused)]
fn main() {
// From pixel data (ShaderRead)
let gray  = ctx.texture_gray8(&pixels, w, h)?;
let float = ctx.texture_r32float(&data, w, h)?;
let color = ctx.texture_rgba8(&pixels, w, h)?;

// Empty output (ShaderWrite)
let out = ctx.texture_output_gray8(w, h)?;
let out = ctx.texture_output_r32float(w, h)?;
let out = ctx.texture_output_rgba8(w, h)?;

// Pipeline intermediates (ShaderRead | ShaderWrite)
let tmp = ctx.texture_intermediate_gray8(w, h)?;
let tmp = ctx.texture_intermediate_r32float(w, h)?;
}

Use output_* when a texture is only written to by a kernel. Use intermediate_* when a texture is written by one kernel and read by the next in a pipeline chain.

Texture

Wraps a Metal texture with tracked dimensions and format.

Readback

#![allow(unused)]
fn main() {
let pixels: Vec<u8>  = tex.read_gray8();      // R8Unorm
let data:   Vec<f32> = tex.read_r32float();    // R32Float
let pixels: Vec<u8>  = tex.read_rgba8();       // RGBA8Unorm (4 bytes/pixel)
}

Call readback only after the GPU command buffer has completed. Reading while the GPU is still writing produces undefined results.

Properties

#![allow(unused)]
fn main() {
let w = tex.width();       // u32
let h = tex.height();      // u32
let f = tex.format();      // TextureFormat enum
}

External textures

For AVFoundation or Core Video integration, wrap an existing Metal texture without copying:

#![allow(unused)]
fn main() {
use vx_vision::{Texture, TextureFormat};

let tex = Texture::from_metal_texture(metal_tex, w, h, TextureFormat::RGBA8Unorm);
}

Pipeline

Batches multiple kernel dispatches into a single Metal command buffer.

#![allow(unused)]
fn main() {
use vx_vision::Pipeline;

let pipe = Pipeline::begin(&ctx)?;
let cmd = pipe.cmd_buf();

let _s1 = blur.encode(&ctx, cmd, &input, &temp, &cfg)?;
sobel.encode(&ctx, cmd, &temp)?;

let _retained = pipe.commit_and_wait();
}

Intermediate textures and encoded state must outlive the command buffer. The commit_and_wait() return value holds retained textures.

For CPU/GPU overlap:

#![allow(unused)]
fn main() {
let mut pipe = Pipeline::begin(&ctx)?;
blur.encode(&ctx, pipe.cmd_buf(), &input, &output, &cfg)?;
pipe.commit();       // non-blocking
// ... CPU work ...
pipe.wait();         // block until GPU done
}

TexturePool

Recycles GPU textures by (width, height, format) to avoid repeated allocation.

#![allow(unused)]
fn main() {
use vx_vision::TexturePool;

let mut pool = TexturePool::new();
let tex = pool.acquire_gray8(&ctx, 1920, 1080)?;
// ... use tex ...
pool.release(tex);

// Second acquire reuses the cached texture
let tex = pool.acquire_gray8(&ctx, 1920, 1080)?;
}

All pool textures have ShaderRead | ShaderWrite usage flags.

#![allow(unused)]
fn main() {
let pool = TexturePool::with_capacity(4);   // max 4 per bucket
pool.hit_rate();                             // cache efficiency
pool.cached_count();                         // total cached
pool.clear();                                // free all
}

Error handling

All fallible operations return Result<T, vx_vision::Error>. Error variants:

Variant	Meaning
`DeviceNotFound`	No Metal GPU available
`ShaderMissing(String)`	Named shader function not in metallib
`PipelineCompile(String)`	Metal failed to compile a pipeline
`BufferAlloc { bytes }`	GPU buffer allocation failed
`TextureSizeMismatch`	Texture dimensions don’t match
`InvalidConfig(String)`	Parameter out of range
`Gpu(String)`	Runtime GPU error

Feature Detection

FAST-9 Corner Detector

Detects corners using the FAST-9 algorithm. Tests 16 pixels on a Bresenham circle — a pixel is a corner if 9 contiguous pixels are all brighter or darker than the center by a threshold.

#![allow(unused)]
fn main() {
use vx_vision::kernels::fast::{FastDetector, FastDetectConfig};

let fast = FastDetector::new(&ctx)?;
let mut cfg = FastDetectConfig::default();  // threshold: 20, max_corners: 2048
cfg.threshold = 30;

let result = fast.detect(&ctx, &input, &cfg)?;
for corner in &result.corners {
    println!("({}, {}) score={}", corner.position[0], corner.position[1], corner.response);
}
}

Also supports pipeline encoding via fast.encode().

Harris Corner Response

Computes the Harris response R = det(M) - k * trace(M)^2 for each keypoint. Use after FAST to rank corners by quality.

#![allow(unused)]
fn main() {
use vx_vision::kernels::harris::{HarrisScorer, HarrisConfig};

let harris = HarrisScorer::new(&ctx)?;
let scored = harris.compute(&ctx, &input, &corners, &HarrisConfig::default())?;
// scored: Vec<CornerPoint> with updated response values
}

Config: k (sensitivity, default 0.04), patch_radius (neighborhood size, default 3).

Non-Maximum Suppression

Filters keypoints so no two are within min_distance of each other. Keeps the highest-response point in each neighborhood.

#![allow(unused)]
fn main() {
use vx_vision::kernels::nms::{NmsSuppressor, NmsConfig};

let nms = NmsSuppressor::new(&ctx)?;
let filtered = nms.run(&ctx, &corners, &NmsConfig::default())?;
}

Config: min_distance (default 10.0 pixels).

ORB Descriptors

Computes 256-bit binary descriptors for keypoints using oriented BRIEF test pairs.

#![allow(unused)]
fn main() {
use vx_vision::kernels::orb::{OrbDescriptor, OrbConfig};

let orb = OrbDescriptor::new(&ctx)?;
let result = orb.compute(&ctx, &input, &keypoints, &pattern, &OrbConfig::default())?;
// result.descriptors: Vec<ORBOutput> (256-bit descriptors as 8 x u32)
// result.orientations: Vec<f32>
}

The pattern is 1024 i32 values (256 test pairs, each with 4 offsets: dx1, dy1, dx2, dy2).

DoG Keypoint Detector

Difference-of-Gaussians scale-space extrema detection.

#![allow(unused)]
fn main() {
use vx_vision::kernels::dog::{DoGDetector, DoGConfig};
use vx_vision::kernels::gaussian::GaussianBlur;

let blur = GaussianBlur::new(&ctx)?;
let dog = DoGDetector::new(&ctx)?;
let mut cfg = DoGConfig::default();
cfg.n_levels = 5;

let keypoints = dog.detect(&ctx, &blur, &input, &cfg)?;
}

Each keypoint has position, scale, and response. Full pipelining isn’t practical due to the iterative blur-subtract-extrema pattern, but encode_subtract() exposes the subtraction step for custom pipelines.

SIFT Pipeline

Full SIFT-like pipeline: multi-octave pyramid, DoG detection, orientation assignment, 128-dimensional descriptors.

#![allow(unused)]
fn main() {
use vx_vision::kernels::sift::{SiftPipeline, SiftConfig};

let sift = SiftPipeline::new(&ctx)?;
let features = sift.detect_and_describe(&ctx, &input, &SiftConfig::default())?;

for f in &features {
    println!("({}, {}) scale={:.2} orient={:.2}", f.x, f.y, f.scale, f.orientation);
    // f.descriptor: [f32; 128]
}
}

Matching between two feature sets:

#![allow(unused)]
fn main() {
let matches = SiftPipeline::match_features(&features_a, &features_b, 0.75);
}

Typical detection pipeline

A common pattern chains FAST → Harris → NMS → ORB:

#![allow(unused)]
fn main() {
let corners = fast.detect(&ctx, &texture, &fast_cfg)?;
let scored  = harris.compute(&ctx, &texture, &corners.corners, &harris_cfg)?;
let best    = nms.run(&ctx, &scored, &nms_cfg)?;
let descs   = orb.compute(&ctx, &texture, &best, &pattern, &orb_cfg)?;
}

For single-submission batching, each of these kernels provides an encode() method that writes into a shared command buffer via Pipeline.

Image Processing

Gaussian Blur

Separable two-pass blur (horizontal then vertical).

#![allow(unused)]
fn main() {
use vx_vision::kernels::gaussian::{GaussianBlur, GaussianConfig};

let blur = GaussianBlur::new(&ctx)?;
let output = ctx.texture_output_gray8(w, h)?;
let mut cfg = GaussianConfig::default();  // sigma: 1.0, radius: 3
cfg.sigma = 2.0;

blur.apply(&ctx, &input, &output, &cfg)?;
}

Config: sigma (standard deviation), radius (kernel half-width, full kernel = 2*radius + 1).

Pipeline encoding returns a GaussianEncodedState that holds the intermediate texture:

#![allow(unused)]
fn main() {
let state = blur.encode(&ctx, cmd_buf, &input, &output, &cfg)?;
// state must outlive the command buffer
}

Bilateral Filter

Edge-preserving smoothing. Smooths flat regions while keeping edges sharp.

#![allow(unused)]
fn main() {
use vx_vision::kernels::bilateral::{BilateralFilter, BilateralConfig};

let bilateral = BilateralFilter::new(&ctx)?;
let output = ctx.texture_output_gray8(w, h)?;
bilateral.apply(&ctx, &input, &output, &BilateralConfig::new(5, 10.0, 0.1))?;
}

Config: radius, sigma_spatial, sigma_range. Larger sigma_range allows more intensity variation.

Sobel Edge Detection

Computes gradient magnitude and direction.

#![allow(unused)]
fn main() {
use vx_vision::kernels::sobel::SobelFilter;

let sobel = SobelFilter::new(&ctx)?;
let result = sobel.compute(&ctx, &input)?;
// result.magnitude: R32Float texture
// result.direction: R32Float texture (radians)
// result.grad_x, result.grad_y: R32Float gradient components
}

Canny Edge Detection

Multi-stage: Gaussian blur → Sobel → non-maximum suppression → hysteresis thresholding.

#![allow(unused)]
fn main() {
use vx_vision::kernels::canny::{CannyDetector, CannyConfig};

let canny = CannyDetector::new(&ctx)?;
let mut cfg = CannyConfig::default();
cfg.low_threshold = 0.04;
cfg.high_threshold = 0.12;

let edges = canny.detect(&ctx, &input, &cfg)?;
// edges: R32Float texture (1.0 = edge, 0.0 = non-edge)
}

Config: low_threshold, high_threshold (hysteresis), blur_sigma, blur_radius.

Supports pipeline encoding via canny.encode().

Morphology

Binary operations with a rectangular structuring element.

#![allow(unused)]
fn main() {
use vx_vision::kernels::morphology::{Morphology, MorphConfig};

let morph = Morphology::new(&ctx)?;
let cfg = MorphConfig::default();  // radius_x: 1, radius_y: 1 (3x3 kernel)
let output = ctx.texture_output_gray8(w, h)?;

morph.erode(&ctx, &input, &output, &cfg)?;
morph.dilate(&ctx, &input, &output, &cfg)?;
morph.open(&ctx, &input, &output, &cfg)?;    // erode then dilate
morph.close(&ctx, &input, &output, &cfg)?;   // dilate then erode
}

All four operations support pipeline encoding: encode_erode, encode_dilate, encode_open, encode_close.

Threshold

Binary, adaptive, and automatic (Otsu) thresholding.

#![allow(unused)]
fn main() {
use vx_vision::kernels::threshold::{Threshold, AdaptiveThresholdConfig};

let thresh = Threshold::new(&ctx)?;
let output = ctx.texture_output_gray8(w, h)?;

// Fixed binary (normalized 0.0-1.0 threshold)
thresh.binary(&ctx, &input, &output, 0.5, false)?;

// Otsu's method (auto-selects threshold, returns it)
let value = thresh.otsu(&ctx, &input, &output)?;

// Adaptive (requires integral image)
let cfg = AdaptiveThresholdConfig::new(15, 0.03, false);
thresh.adaptive_auto(&ctx, &input, &output, &cfg)?;
}

Pipeline encoding: encode_binary(), encode_adaptive().

Histogram

Compute 256-bin histogram and equalize contrast.

#![allow(unused)]
fn main() {
use vx_vision::kernels::histogram::Histogram;

let hist = Histogram::new(&ctx)?;
let bins: [u32; 256] = hist.compute(&ctx, &input)?;
let output = ctx.texture_output_gray8(w, h)?;
hist.equalize(&ctx, &input, &output)?;
}

Not pipeline-encodable — requires CPU readback of bin counts.

Color Conversion

Convert between RGBA, grayscale, and HSV.

#![allow(unused)]
fn main() {
use vx_vision::kernels::color::ColorConvert;

let color = ColorConvert::new(&ctx)?;
color.rgba_to_gray(&ctx, &rgba, &gray)?;
color.gray_to_rgba(&ctx, &gray, &rgba)?;
color.rgba_to_hsv(&ctx, &rgba, &hsv)?;
color.hsv_to_rgba(&ctx, &hsv, &rgba)?;
}

All four conversions support pipeline encoding.

Geometry & Transforms

Image Pyramid

Builds successive half-resolution levels in a single GPU submission.

#![allow(unused)]
fn main() {
use vx_vision::kernels::pyramid::PyramidBuilder;

let pyr = PyramidBuilder::new(&ctx)?;
let levels = pyr.build(&ctx, &input, 4)?;
// levels[0] = half, levels[1] = quarter, levels[2] = eighth

let half = pyr.downsample(&ctx, &input)?;  // single level
}

Resize

Bilinear interpolation resize to arbitrary dimensions.

#![allow(unused)]
fn main() {
use vx_vision::kernels::resize::ImageResize;

let resizer = ImageResize::new(&ctx)?;
let output = resizer.apply(&ctx, &input, new_w, new_h)?;
}

Warp

Affine and perspective warping.

#![allow(unused)]
fn main() {
use vx_vision::kernels::warp::ImageWarp;

let warp = ImageWarp::new(&ctx)?;

// Affine: 2x3 matrix as [f32; 6]
let output = ctx.texture_output_gray8(out_w, out_h)?;
warp.affine(&ctx, &input, &output, &matrix_2x3)?;

// Perspective: 3x3 matrix as [f32; 9]
warp.perspective(&ctx, &input, &output, &matrix_3x3)?;
}

Both support pipeline encoding via encode_affine() and encode_perspective().

Lens Undistortion

Corrects radial and tangential lens distortion using camera intrinsics.

#![allow(unused)]
fn main() {
use vx_vision::kernels::undistort::Undistorter;

let undistort = Undistorter::new(&ctx)?;
let output = undistort.apply(&ctx, &input, &camera_params)?;
}

Homography Estimation

RANSAC-based homography from point correspondences. GPU-accelerated scoring with CPU-side model selection.

#![allow(unused)]
fn main() {
use vx_vision::kernels::homography::{HomographyEstimator, RansacConfig};

let estimator = HomographyEstimator::new(&ctx)?;
let mut cfg = RansacConfig::default();
cfg.max_iterations = 1000;
cfg.inlier_threshold = 3.0;

let result = estimator.estimate(&ctx, &point_pairs, &cfg)?;
// result.homography: [f32; 9]
// result.n_inliers: u32
// result.inlier_mask: Vec<bool>
}

Not pipeline-encodable — RANSAC iterates with CPU readback between GPU scoring passes.

Motion & Stereo

KLT Optical Flow

Sparse Kanade-Lucas-Tomasi tracker. Tracks keypoints across two frames using iterative Lucas-Kanade with image pyramids.

#![allow(unused)]
fn main() {
use vx_vision::kernels::klt::{KltTracker, KltConfig};

let klt = KltTracker::new(&ctx)?;
let mut cfg = KltConfig::default();
cfg.max_iterations = 30;
cfg.win_radius = 7;
cfg.max_level = 3;

let tracked = klt.track(&ctx, &prev_frame, &curr_frame, &keypoints, &cfg)?;
// tracked: Vec<KltResult> with new positions, status, and error
}

Config: max_iterations, epsilon (convergence threshold), win_radius, max_level (pyramid levels), min_eigenvalue.

Dense Optical Flow

Horn-Schunck per-pixel flow estimation using iterative Jacobi relaxation.

#![allow(unused)]
fn main() {
use vx_vision::kernels::dense_flow::{DenseFlow, DenseFlowConfig};

let flow = DenseFlow::new(&ctx)?;
let mut cfg = DenseFlowConfig::default();
cfg.alpha = 0.012;
cfg.iterations = 50;

let result = flow.compute(&ctx, &frame0, &frame1, &cfg)?;
// result.flow_u: R32Float texture (horizontal displacement)
// result.flow_v: R32Float texture (vertical displacement)
}

Supports pipeline encoding via flow.encode().

Stereo Matching

Matches ORB features between rectified stereo image pairs using Hamming distance, epipolar constraints, and disparity bounds. Triangulates 3D positions from disparities.

#![allow(unused)]
fn main() {
use vx_vision::kernels::stereomatch::{StereoMatcher, StereoConfig};

let stereo = StereoMatcher::new(&ctx)?;
let mut cfg = StereoConfig::default();
cfg.max_disparity = 64.0;
cfg.baseline = 0.12;     // meters between cameras
cfg.fx = 500.0;          // focal length in pixels

let result = stereo.run(
    &ctx,
    &left_kpts, &right_kpts,
    &left_descs, &right_descs,
    &cfg,
)?;

for m in &result.matches {
    println!("3D: ({:.2}, {:.2}, {:.2})", m.point_3d[0], m.point_3d[1], m.point_3d[2]);
}
}

Config: max_epipolar, min_disparity, max_disparity, max_hamming, ratio_thresh, fx, fy, cx, cy, baseline.

Brute-Force Descriptor Matching

Matches ORB binary descriptors using Hamming distance with Lowe’s ratio test.

#![allow(unused)]
fn main() {
use vx_vision::kernels::matcher::{BruteMatcher, MatchConfig};

let matcher = BruteMatcher::new(&ctx)?;
let mut cfg = MatchConfig::default();
cfg.max_hamming = 64;
cfg.ratio_thresh = 0.75;

let matches = matcher.match_descriptors(&ctx, &query_desc, &train_desc, &cfg)?;
for m in &matches {
    println!("query[{}] → train[{}] dist={}", m.query_idx, m.train_idx, m.distance);
}
}

Descriptors are flat &[u32] arrays where every 8 consecutive values form one 256-bit ORB descriptor.

Analysis

Template Matching

Normalized cross-correlation (NCC). Finds the best match location for a small template within a larger image.

#![allow(unused)]
fn main() {
use vx_vision::kernels::template_match::TemplateMatcher;

let tm = TemplateMatcher::new(&ctx)?;
let result = tm.match_template(&ctx, &image, &template)?;
println!("Best at ({}, {}) score={:.4}", result.best_x, result.best_y, result.best_score);
}

The template must have non-zero variance (not a uniform color) for NCC to produce meaningful results.

Hough Line Detection

Detects lines via the Hough transform. Works best on binary edge images (e.g., output of Canny).

#![allow(unused)]
fn main() {
use vx_vision::kernels::hough::{HoughLines, HoughConfig};

let hough = HoughLines::new(&ctx)?;
let mut cfg = HoughConfig::default();
cfg.vote_threshold = 50;
cfg.max_lines = 100;

let lines = hough.detect(&ctx, &edge_image, &cfg)?;
for line in &lines {
    println!("rho={:.1} theta={:.1}° votes={}", line.rho, line.theta.to_degrees(), line.votes);
}
}

Config: n_theta, edge_threshold, vote_threshold, max_lines, nms_radius.

Not pipeline-encodable — requires CPU readback of the accumulator between voting and peak-finding.

Integral Image

Summed area table for O(1) region-sum queries.

#![allow(unused)]
fn main() {
use vx_vision::kernels::integral::IntegralImage;

let integral = IntegralImage::new(&ctx)?;
let sat = integral.compute(&ctx, &input)?;
// sat: R32Float texture
}

Supports pipeline encoding via integral.encode(). Used internally by adaptive thresholding.

Distance Transform

Euclidean distance from each pixel to the nearest seed pixel, computed via Jump Flooding Algorithm (JFA).

#![allow(unused)]
fn main() {
use vx_vision::kernels::distance::{DistanceTransform, DistanceConfig};

let dt = DistanceTransform::new(&ctx)?;
let mut cfg = DistanceConfig::default();
cfg.threshold = 0.5;

let distances = dt.compute(&ctx, &binary_input, &cfg)?;
// distances: R32Float texture with per-pixel Euclidean distance
}

Connected Components

Labels connected regions in a binary image using iterative min-label propagation.

#![allow(unused)]
fn main() {
use vx_vision::kernels::connected::{ConnectedComponents, CCLConfig};

let ccl = ConnectedComponents::new(&ctx)?;
let mut cfg = CCLConfig::default();
cfg.threshold = 0.5;

let result = ccl.label(&ctx, &binary_input, &cfg)?;
println!("{} components in {} iterations", result.n_components, result.iterations);
// result.labels: R32Float texture with integer label per pixel
}

Not pipeline-encodable — iterative convergence requires CPU readback between passes.

3D Reconstruction

All kernels require the reconstruction feature flag: cargo build --features reconstruction

SGM Stereo

Semi-Global Matching stereo produces dense disparity maps from rectified stereo pairs. Uses Census transform with 8-path cost aggregation and sub-pixel refinement.

#![allow(unused)]
fn main() {
use vx_vision::kernels::sgm::{SGMStereo, SGMStereoConfig};

let sgm = SGMStereo::new(&ctx)?;
let config = SGMStereoConfig::new(128); // 128 disparity levels
let output = ctx.texture_output_r32float(w, h)?;

sgm.compute_disparity(&ctx, &left, &right, &output, &config)?;
// output: R32Float texture with per-pixel disparity values
}

Config: num_disparities (search range), p1 (penalty for ±1 disparity change), p2 (penalty for larger changes), census_radius_x, census_radius_y.

Depth Filter

Depth-aware bilateral filter and median filter for cleaning up noisy depth/disparity maps. Preserves depth edges while smoothing.

#![allow(unused)]
fn main() {
use vx_vision::kernels::depth_filter::{DepthFilter, DepthFilterConfig, DepthMedianConfig};

let filter = DepthFilter::new(&ctx)?;

// Bilateral: edge-preserving smoothing
let config = DepthFilterConfig::new(3, 5.0, 0.05);
filter.apply_bilateral(&ctx, &input, &output, &config)?;

// Median: salt-and-pepper noise removal
let med_config = DepthMedianConfig::default(); // 3×3
filter.apply_median(&ctx, &input, &output, &med_config)?;
}

Supports pipeline encoding via filter.encode_bilateral() and filter.encode_median().

Depth Inpaint

Fills holes in depth maps using iterative nearest-neighbor propagation (jump-flooding pattern).

#![allow(unused)]
fn main() {
use vx_vision::kernels::depth_inpaint::{DepthInpaint, DepthInpaintConfig};

let inpaint = DepthInpaint::new(&ctx)?;
let mut config = DepthInpaintConfig::default();
config.max_iterations = 6; // doubles search radius each iteration

inpaint.apply(&ctx, &input, &output, &config)?;
}

Depth-to-Point-Cloud

GPU-accelerated unprojection of depth maps to 3D point clouds. One thread per pixel, with atomic compaction to skip invalid pixels.

#![allow(unused)]
fn main() {
use vx_vision::kernels::depth_to_cloud::{DepthToCloud, DepthToCloudConfig};
use vx_vision::types_3d::{CameraIntrinsics, DepthMap};

let d2c = DepthToCloud::new(&ctx)?;
let intrinsics = CameraIntrinsics::new(500.0, 500.0, 320.0, 240.0, 640, 480);
let depth_map = DepthMap::new(depth_texture, intrinsics, 0.1, 10.0)?;

let config = DepthToCloudConfig::new(0.1, 10.0);
let cloud = d2c.compute(&ctx, &depth_map, Some(&color_texture), &config)?;
// cloud: PointCloud with XYZ + optional RGB per point
}

Config: min_depth, max_depth, depth_scale, max_points.

Depth Colorize

Maps depth values to RGBA colors using a colormap for visualization.

#![allow(unused)]
fn main() {
use vx_vision::kernels::depth_colorize::{DepthColorize, DepthColorizeConfig};

let colorize = DepthColorize::new(&ctx)?;
let config = DepthColorizeConfig::new(0.5, 5.0); // min/max depth range

colorize.apply(&ctx, &depth_r32, &rgba_output, &config)?;
}

Supports Turbo (default), Jet, and Inferno colormaps. Pipeline encoding via colorize.encode().

Normal Estimation

Estimates surface normals for point clouds. Two modes: organized (fast, from depth maps) and unorganized (brute-force k-NN PCA).

#![allow(unused)]
fn main() {
use vx_vision::kernels::normal_estimation::{NormalEstimator, NormalEstimatorConfig};

let estimator = NormalEstimator::new(&ctx)?;

// Organized: from depth map (fast — cross product of adjacent pixels)
estimator.compute_from_depth(&ctx, &depth_tex, &normal_out, fx, fy, cx, cy)?;

// Unorganized: from arbitrary point cloud
let mut config = NormalEstimatorConfig::default();
config.radius = 0.1;
let normals = estimator.compute(&ctx, &point_cloud, &config)?;
// normals: Vec<[f32; 3]>
}

Outlier Filter

Statistical outlier removal. Computes mean distance to k-nearest neighbors per point, then rejects points beyond mean + std_ratio × stddev.

#![allow(unused)]
fn main() {
use vx_vision::kernels::outlier_filter::{OutlierFilter, OutlierFilterConfig};

let filter = OutlierFilter::new(&ctx)?;
let config = OutlierFilterConfig::default(); // k=10, std_ratio=2.0

let filtered = filter.filter(&ctx, &cloud, &config)?;
}

Voxel Downsample

Reduces point cloud density by averaging points within each voxel cell. Uses GPU hash table with atomic accumulation.

#![allow(unused)]
fn main() {
use vx_vision::kernels::voxel_downsample::{VoxelDownsample, VoxelDownsampleConfig};

let ds = VoxelDownsample::new(&ctx)?;
let config = VoxelDownsampleConfig::new(0.05); // 5cm voxel size

let downsampled = ds.downsample(&ctx, &cloud, &config)?;
}

TSDF Volume Fusion

Truncated Signed Distance Function — the core of real-time 3D reconstruction. Integrates sequential depth frames into a volumetric representation and raycasts synthetic views.

#![allow(unused)]
fn main() {
use vx_vision::kernels::tsdf::{TSDFVolume, TSDFConfig};
use vx_vision::types_3d::{CameraExtrinsics, CameraIntrinsics};

let mut config = TSDFConfig::default();
config.resolution = [256, 256, 256];
config.voxel_size = 0.005; // 5mm
let tsdf = TSDFVolume::new(&ctx, config)?;

// Integrate a depth frame
tsdf.integrate(&ctx, &depth_map, &camera_pose)?;

// Raycast synthetic view
let (depth_out, normal_out) = tsdf.raycast(&ctx, &pose, &intrinsics)?;

// Extract surface points
let cloud = tsdf.extract_cloud();
}

Config: resolution, voxel_size, truncation_distance, max_weight, origin.

Marching Cubes

Extracts a triangle mesh from a TSDF volume at the zero-crossing surface. CPU-side with full 256-entry lookup table, reading directly from GPU shared memory (zero-copy UMA).

#![allow(unused)]
fn main() {
use vx_vision::kernels::marching_cubes::{MarchingCubes, MarchingCubesConfig};

let config = MarchingCubesConfig::default(); // iso_level = 0.0
let mut mesh = MarchingCubes::extract(tsdf.volume(), &config);

mesh.compute_normals();
mesh.weld_vertices(0.001);
}

Triangulation

GPU-accelerated midpoint triangulation from two-view 2D-2D correspondences.

#![allow(unused)]
fn main() {
use vx_vision::kernels::triangulate::{Triangulator, Match2D};

let tri = Triangulator::new(&ctx)?;
let matches = vec![
    Match2D { u1: 320.0, v1: 240.0, u2: 280.0, v2: 240.0 },
];

let cloud = tri.triangulate(&ctx, &matches, &intrinsics1, &intrinsics2, &pose1, &pose2)?;
}

Mesh Operations (CPU)

#![allow(unused)]
fn main() {
use vx_vision::mesh_ops;

// Decimate to target face count (edge-collapse)
let decimated = mesh_ops::decimate(&mesh, 5000);
}

Mesh types also provide:

mesh.compute_normals() — per-vertex normals from face normals
mesh.weld_vertices(tolerance) — merge duplicate vertices

Export Formats

#![allow(unused)]
fn main() {
use vx_vision::export;

export::write_ply_ascii("cloud.ply", &point_cloud)?;
export::write_ply_binary("cloud.ply", &point_cloud)?;
export::write_obj("mesh.obj", &mesh)?;
export::write_mesh_ply("mesh.ply", &mesh)?;
}

Core 3D Types

Type	Description
`Point3D`	Position + color + normal
`Vertex3D`	Position + normal + UV
`PointCloud`	Collection of `Point3D` with `bounds()`, `len()`, `positions()`
`Mesh`	Indexed triangle mesh with `compute_normals()`, `weld_vertices()`
`DepthMap`	R32Float texture + intrinsics + depth range
`CameraIntrinsics`	Pinhole model: `fx, fy, cx, cy, width, height`
`CameraExtrinsics`	Rotation (3x3) + translation, with `transform_point()`, `inverse()`, `to_gpu_rows()`
`VoxelGrid`	TSDF + weights backed by `UnifiedBuffer`, with `voxel_to_world()`, `reset()`

Visualization

Requires the visualization feature flag (and reconstruction for 3D types): cargo build --features "reconstruction,visualization"

Point Cloud Renderer

Renders point clouds as colored circle splats using Metal render pipelines. Outputs to an offscreen RenderTarget.

#![allow(unused)]
fn main() {
use vx_vision::render_context::{Camera, RenderTarget};
use vx_vision::renderers::point_cloud_renderer::PointCloudRenderer;

let renderer = PointCloudRenderer::new(&ctx)?;
let target = RenderTarget::new(&ctx, 1920, 1080)?;
let camera = Camera {
    position: [0.0, 0.0, 3.0],
    look_at: [0.0, 0.0, 0.0],
    up: [0.0, 1.0, 0.0],
    fov_y: 60.0_f32.to_radians(),
    near: 0.01,
    far: 100.0,
};

renderer.render(&ctx, &cloud, &camera, &target, 5.0)?;

// Read back as RGBA pixels
let pixels = target.read_rgba8();
}

Parameters: point_size controls the rendered diameter of each point in pixels.

Mesh Renderer

Renders triangle meshes with Phong shading (ambient + diffuse). Default light direction is (0.5, 0.7, 1.0).

#![allow(unused)]
fn main() {
use vx_vision::render_context::{Camera, RenderTarget};
use vx_vision::renderers::mesh_renderer::MeshRenderer;

let renderer = MeshRenderer::new(&ctx)?;
let target = RenderTarget::new(&ctx, 1920, 1080)?;
let camera = Camera::default();

renderer.render(&ctx, &mesh, &camera, &target)?;
let pixels = target.read_rgba8();
}

RenderTarget

Offscreen render target with RGBA8 color and Depth32Float attachments.

#![allow(unused)]
fn main() {
use vx_vision::render_context::RenderTarget;

let target = RenderTarget::new(&ctx, width, height)?;

// After rendering:
let rgba_pixels = target.read_rgba8();   // Vec<u8>, 4 bytes per pixel
let color_tex = target.color_texture();  // &Texture for further processing
}

Camera

Camera parameters for 3D rendering. Computes the MVP (model-view-projection) matrix.

#![allow(unused)]
fn main() {
use vx_vision::render_context::Camera;

let camera = Camera {
    position: [2.0, 1.5, 3.0],
    look_at: [0.0, 0.0, 0.0],
    up: [0.0, 1.0, 0.0],
    fov_y: 45.0_f32.to_radians(),
    near: 0.1,
    far: 50.0,
};

let mvp = camera.mvp_matrix(width as f32 / height as f32);
}

Depth Colorize

See 3D Reconstruction → Depth Colorize — available with only the reconstruction feature.

Stereo-to-Mesh Pipeline

This guide walks through the complete 3D reconstruction pipeline available in VX: from stereo images to a triangle mesh, covering every kernel in the reconstruction module.

Reconstruction Pipeline

Step 1: Depth Estimation

Start with a rectified stereo pair and produce a dense disparity map using Semi-Global Matching:

#![allow(unused)]
fn main() {
use vx_vision::kernels::sgm::{SGMStereo, SGMStereoConfig};

let sgm = SGMStereo::new(&ctx)?;
let config = SGMStereoConfig::new(128);
let disparity = ctx.texture_output_r32float(width, height)?;

sgm.compute_disparity(&ctx, &left_image, &right_image, &disparity, &config)?;
}

To convert disparity to depth: depth = focal_length × baseline / disparity.

Step 2: Depth Cleanup

Apply edge-preserving bilateral filtering and hole filling:

#![allow(unused)]
fn main() {
use vx_vision::kernels::depth_filter::{DepthFilter, DepthFilterConfig};
use vx_vision::kernels::depth_inpaint::{DepthInpaint, DepthInpaintConfig};

let filter = DepthFilter::new(&ctx)?;
let inpaint = DepthInpaint::new(&ctx)?;

// Bilateral: smooth while preserving edges
let filtered = ctx.texture_output_r32float(w, h)?;
filter.apply_bilateral(&ctx, &depth, &filtered, &DepthFilterConfig::new(3, 5.0, 0.05))?;

// Fill holes
let filled = ctx.texture_intermediate_r32float(w, h)?;
inpaint.apply(&ctx, &filtered, &filled, &DepthInpaintConfig::default())?;
}

Step 3: Point Cloud Generation

Unproject the depth map to 3D:

#![allow(unused)]
fn main() {
use vx_vision::kernels::depth_to_cloud::{DepthToCloud, DepthToCloudConfig};
use vx_vision::types_3d::{CameraIntrinsics, DepthMap};

let d2c = DepthToCloud::new(&ctx)?;
let intrinsics = CameraIntrinsics::new(fx, fy, cx, cy, width, height);
let depth_map = DepthMap::new(depth_texture, intrinsics, 0.1, 10.0)?;

let cloud = d2c.compute(&ctx, &depth_map, Some(&rgb_texture), &DepthToCloudConfig::new(0.1, 10.0))?;
}

Step 4: Point Cloud Processing

Clean the cloud: estimate normals, remove outliers, downsample:

#![allow(unused)]
fn main() {
use vx_vision::kernels::normal_estimation::{NormalEstimator, NormalEstimatorConfig};
use vx_vision::kernels::outlier_filter::{OutlierFilter, OutlierFilterConfig};
use vx_vision::kernels::voxel_downsample::{VoxelDownsample, VoxelDownsampleConfig};

let estimator = NormalEstimator::new(&ctx)?;
let normals = estimator.compute(&ctx, &cloud, &NormalEstimatorConfig::default())?;

let filter = OutlierFilter::new(&ctx)?;
let clean = filter.filter(&ctx, &cloud, &OutlierFilterConfig::default())?;

let ds = VoxelDownsample::new(&ctx)?;
let downsampled = ds.downsample(&ctx, &clean, &VoxelDownsampleConfig::new(0.01))?;
}

Step 5: TSDF Fusion (Multi-Frame)

For multiple depth frames, fuse them into a volumetric representation:

#![allow(unused)]
fn main() {
use vx_vision::kernels::tsdf::{TSDFVolume, TSDFConfig};

let mut config = TSDFConfig::default();
config.resolution = [256, 256, 256];
config.voxel_size = 0.005;
let tsdf = TSDFVolume::new(&ctx, config)?;

for (depth_frame, camera_pose) in frames.iter() {
    tsdf.integrate(&ctx, depth_frame, camera_pose)?;
}
}

Step 6: Mesh Extraction

Extract a triangle mesh from the TSDF volume using Marching Cubes:

#![allow(unused)]
fn main() {
use vx_vision::kernels::marching_cubes::{MarchingCubes, MarchingCubesConfig};

let mut mesh = MarchingCubes::extract(tsdf.volume(), &MarchingCubesConfig::default());
mesh.compute_normals();
mesh.weld_vertices(0.001);
}

Step 7: Export

#![allow(unused)]
fn main() {
use vx_vision::export;

export::write_obj("reconstruction.obj", &mesh)?;
export::write_ply_ascii("cloud.ply", &cloud)?;
}

Visualization

Render results to offscreen textures for inspection:

#![allow(unused)]
fn main() {
use vx_vision::render_context::{Camera, RenderTarget};
use vx_vision::renderers::mesh_renderer::MeshRenderer;

let renderer = MeshRenderer::new(&ctx)?;
let target = RenderTarget::new(&ctx, 1920, 1080)?;
let camera = Camera { position: [0.0, 0.0, 3.0], ..Camera::default() };

renderer.render(&ctx, &mesh, &camera, &target)?;
let pixels = target.read_rgba8();
}

Feature Flags

Flag	What it enables
`reconstruction`	All 3D types, depth kernels, point cloud ops, TSDF, meshing
`visualization`	Point cloud and mesh renderers, render targets
`datasets`	TUM, EuRoC, KITTI dataset loaders
`full`	Everything

Performance Notes

SGM stereo is the most expensive kernel — O(width × height × disparities). Use smaller disparity ranges when possible.
TSDF integration is fast (~1ms per frame for 128³ volumes) thanks to UMA zero-copy.
Marching Cubes runs on CPU but reads directly from GPU shared memory. A 256³ volume takes ~500ms.
Point cloud operations (normals, outliers) are O(N²) brute-force for k-NN. For clouds >100K points, consider reducing with voxel downsampling first.

Pipeline & Performance

Pipeline batching

By default, each kernel’s sync method (apply, compute, detect) creates its own command buffer and waits for completion. For multi-stage pipelines, this means N GPU round-trips.

Pipeline batches everything into a single command buffer:

#![allow(unused)]
fn main() {
use vx_vision::Pipeline;

let pipe = Pipeline::begin(&ctx)?;
let cmd = pipe.cmd_buf();

let s1 = blur.encode(&ctx, cmd, &input, &temp1, &blur_cfg)?;
bilateral.encode(cmd, &temp1, &temp2, &bilateral_cfg)?;
morph.encode_dilate(cmd, &temp2, &output, &morph_cfg)?;

let _retained = pipe.commit_and_wait();
}

Encoded state (like s1 above) holds intermediate textures that must outlive the command buffer.

Which kernels support encoding?

Encodable	Not encodable (multi-pass)
Gaussian, Bilateral, Sobel, Canny, Morphology, Threshold, Color, Warp, Integral, Dense Flow, FAST, Harris, NMS, ORB, KLT, Resize, Undistort, DoG (subtract only)	Histogram, Homography, Connected Components, Hough

Multi-pass kernels require CPU readback between GPU passes, so they can’t be batched.

TexturePool

GPU texture allocation is expensive. Reuse textures across frames:

#![allow(unused)]
fn main() {
use vx_vision::TexturePool;

let mut pool = TexturePool::new();

for frame in frames {
    let temp = pool.acquire_gray8(&ctx, w, h)?;  // reuses cached texture
    blur.apply(&ctx, &frame, &temp, &cfg)?;
    // ... process ...
    pool.release(temp);  // return to pool
}

println!("Hit rate: {:.0}%", pool.hit_rate() * 100.0);
}

The pool keys by (width, height, format). All pool textures have ShaderRead | ShaderWrite flags.

Optimization tips

Reuse kernel structs. Creating a kernel compiles the Metal pipeline. Do it once at startup.

#![allow(unused)]
fn main() {
let blur = GaussianBlur::new(&ctx)?;  // once
for frame in frames {
    blur.apply(&ctx, &frame, &output, &cfg)?;  // reuse
}
}

Avoid unnecessary readbacks. read_gray8() forces GPU sync. If the output feeds another kernel, pass the texture directly.

Downsample first. Run feature detection on half-resolution images when full resolution isn’t needed:

#![allow(unused)]
fn main() {
let levels = pyr.build(&ctx, &input, 3)?;
let corners = fast.detect(&ctx, &levels[0], &cfg)?;  // half-res
}

Batch with Pipeline. One command buffer is faster than five:

#![allow(unused)]
fn main() {
let pipe = Pipeline::begin(&ctx)?;
// encode 5 kernels into pipe.cmd_buf()
pipe.commit_and_wait();
}

Memory model

On Apple Silicon (UMA), CPU and GPU share physical memory. VX uses MTLStorageModeShared — no copies, no uploads, no downloads. waitUntilCompleted() is the only synchronization needed.

GpuGuard<T> in vx-gpu prevents CPU reads of a UnifiedBuffer<T> while the GPU is using it, catching race conditions at runtime.

Benchmarking

Run the built-in criterion benchmarks:

cargo bench -p vx-vision

Benchmarks include:

FAST at 752x480 and 1920x1080
Full FAST → Harris → NMS → ORB pipeline at both resolutions
Pipeline vs individual dispatch comparison (3x Gaussian)

Examples Walkthrough

All examples take an image path as argument:

cargo run --release --example <name> -- path/to/image.png

fast_demo

Detects FAST corners, scores with Harris, suppresses with NMS. Prints corner count and timing.

Demonstrates: Feature detection pipeline, FastDetector, HarrisScorer, NmsSuppressor.

edge_detection_demo

Runs Gaussian blur → Sobel gradients → Canny edges. Reports timing for each stage and pixel statistics.

Demonstrates: Image processing chain, GaussianBlur, SobelFilter, CannyDetector.

threshold_demo

Compares thresholding methods: histogram analysis, Otsu’s automatic threshold, fixed binary, integral image computation, and adaptive threshold. Prints timing and foreground percentages.

Demonstrates: Histogram, Threshold (all modes), IntegralImage.

advanced_cv_demo

Runs five algorithms on one image: bilateral filter, Canny + Hough line detection, Otsu + distance transform, connected components, and template matching (self-patch). Prints detailed results for each.

Demonstrates: Full range of analysis kernels.

feature_matching_demo

Detects ORB features in two images, matches with brute-force Hamming distance, and reports match statistics. Also runs SIFT detection for comparison.

Demonstrates: OrbDescriptor, BruteMatcher, SiftPipeline.

klt_benchmark

Loads a sequence of PNG frames (e.g., from EuRoC dataset), detects FAST corners on the first frame, then tracks them through subsequent frames using KLT optical flow. Reports per-frame timing and track survival rate.

Demonstrates: KltTracker, multi-frame processing, re-detection strategy.

orb_stereo_benchmark

Runs the full stereo pipeline on synthetic or real stereo pairs: FAST detection, Harris scoring, NMS, ORB descriptors, stereo matching with epipolar constraints. Reports 3D point triangulation results.

Demonstrates: StereoMatcher, full detection-to-3D pipeline.

pipeline_pool_demo

Benchmarks three approaches to multi-frame processing: individual dispatches, pipeline batching, and pipeline + TexturePool. Reports timing comparison and pool hit rates.

Demonstrates: Pipeline, TexturePool, performance comparison.

3D Reconstruction Examples

These require the reconstruction feature: cargo run --release --features reconstruction --example <name>

depth_to_cloud_demo

Takes a grayscale image, creates a synthetic stereo pair, runs SGM stereo matching, colorizes the depth map, unprojects to a 3D point cloud, and exports to PLY.

cargo run --release --features reconstruction --example depth_to_cloud_demo -- path/to/image.png

Demonstrates: SGMStereo, DepthColorize, DepthToCloud, PointCloud, PLY export.

point_cloud_processing_demo

Generates a synthetic sphere point cloud with noise and outliers, then demonstrates the full processing pipeline: normal estimation, outlier removal, voxel downsampling, and PLY export. No image input needed.

cargo run --release --features reconstruction --example point_cloud_processing_demo

Demonstrates: NormalEstimator, OutlierFilter, VoxelDownsample, PointCloud.

tsdf_fusion_demo

Creates a TSDF volume, generates synthetic depth frames of a sphere from multiple views, integrates them into the volume, extracts surface points and a triangle mesh via Marching Cubes, and exports to OBJ and PLY. No image input needed.

cargo run --release --features reconstruction --example tsdf_fusion_demo

Demonstrates: TSDFVolume, MarchingCubes, Mesh, OBJ/PLY export, the complete depth→volume→mesh pipeline.

Adding a Kernel

Five steps to add a new GPU kernel to VX.

1. Write the Metal shader

Create vx-vision/shaders/YourKernel.metal:

#include <metal_stdlib>
using namespace metal;

struct YourParams {
    uint width;
    uint height;
    float some_param;
};

kernel void your_kernel(
    texture2d<float, access::read>  input  [[texture(0)]],
    texture2d<float, access::write> output [[texture(1)]],
    constant YourParams& params            [[buffer(0)]],
    uint2 gid                              [[thread_position_in_grid]]
) {
    if (gid.x >= params.width || gid.y >= params.height) return;

    float4 pixel = input.read(gid);
    // ... your computation ...
    output.write(result, gid);
}

The build system auto-discovers .metal files — no build.rs changes needed.

2. Add the parameter struct

In vx-vision/src/types.rs:

#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Clone, Copy, Debug, Pod, Zeroable)]
pub struct YourParams {
    pub width: u32,
    pub height: u32,
    pub some_param: f32,
}
}

Must match the Metal struct field-by-field. Same types, same order, same padding. See the Architecture page for type mapping.

3. Write the Rust kernel

Create vx-vision/src/kernels/your_kernel.rs:

#![allow(unused)]
fn main() {
use crate::context::Context;
use crate::error::{Error, Result};
use crate::texture::Texture;
use crate::types::YourParams;
// ... Metal imports ...

#[derive(Clone, Debug)]
#[non_exhaustive]
pub struct YourConfig {
    pub some_param: f32,
}

impl Default for YourConfig {
    fn default() -> Self {
        Self { some_param: 1.0 }
    }
}

pub struct YourKernel {
    pipeline: Retained<ProtocolObject<dyn MTLComputePipelineState>>,
}

impl YourKernel {
    pub fn new(ctx: &Context) -> Result<Self> {
        let name = objc2_foundation::ns_string!("your_kernel");
        let func = ctx.library().newFunctionWithName(name)
            .ok_or(Error::ShaderMissing("your_kernel".into()))?;
        let pipeline = ctx.device()
            .newComputePipelineStateWithFunction_error(&func)
            .map_err(|e| Error::PipelineCompile(format!("your_kernel: {e}")))?;
        Ok(Self { pipeline })
    }

    /// Sync method: creates command buffer, dispatches, waits.
    pub fn apply(
        &self, ctx: &Context, input: &Texture, output: &Texture, config: &YourConfig,
    ) -> Result<()> {
        let cmd_buf = ctx.queue().commandBuffer()
            .ok_or(Error::Gpu("failed to create command buffer".into()))?;
        self.encode_pass(&cmd_buf, input, output, config)?;
        cmd_buf.commit();
        cmd_buf.waitUntilCompleted();
        Ok(())
    }

    /// Pipeline encoding: writes into existing command buffer.
    pub fn encode(
        &self, cmd_buf: &ProtocolObject<dyn MTLCommandBuffer>,
        input: &Texture, output: &Texture, config: &YourConfig,
    ) -> Result<()> {
        self.encode_pass(cmd_buf, input, output, config)
    }

    fn encode_pass(/* ... */) -> Result<()> {
        // set pipeline, textures, params, dispatch
    }
}

unsafe impl Send for YourKernel {}
unsafe impl Sync for YourKernel {}
}

4. Register the module

In vx-vision/src/kernels/mod.rs:

#![allow(unused)]
fn main() {
pub mod your_kernel;
}

5. Add tests

In vx-vision/tests/test_kernels.rs, add a test that creates a synthetic image, runs the kernel, and verifies output properties.

Checklist

Metal shader compiles (check cargo build output)
#[repr(C)] struct matches MSL struct exactly
Kernel has both sync method and encode() for pipelining
Config struct has Default, #[non_exhaustive], Clone, Debug
Send + Sync implemented on kernel struct
Module registered in mod.rs
Test passes

Design Document

Internal engineering reference. Assumes familiarity with Rust and GPU compute concepts.

1. System Overview

Crate Topology

Two workspace crates. vx-vision imports vx-core as vx_gpu. Both crates depend on objc2-metal for Metal bindings and bytemuck for safe #[repr(C)] transmutes. No other runtime dependencies — no wgpu, no ash, no cross-platform abstraction. Metal-only by design: this eliminates translation layers and enables direct access to Apple Silicon UMA semantics that cross-platform APIs abstract away.

2. Metal–Rust Bridge

objc2-metal

Metal is an Objective-C framework. Its API surface is protocol-based — MTLDevice, MTLCommandBuffer, etc. are protocols, not concrete classes. The runtime returns opaque objects conforming to these protocols.

objc2-metal generates Rust trait definitions from Metal’s protocol headers. Each protocol becomes a Rust trait. Since the concrete type behind the protocol is unknown at compile time, the binding uses type-erased wrappers:

#![allow(unused)]
fn main() {
Retained<ProtocolObject<dyn MTLComputePipelineState>>
}

Three layers here:

dyn MTLComputePipelineState — the Metal protocol, expressed as a Rust trait object
ProtocolObject<_> — type erasure wrapper for ObjC protocol conformance (analogous to dyn Trait but bridging ObjC’s protocol dispatch, not Rust’s vtable dispatch)
Retained<_> — ARC-compatible smart pointer. Holds a +1 reference count. Sends release on drop. This is the bridge between ObjC’s reference counting and Rust’s ownership semantics.

Retained means Rust owns the object. When a kernel struct holds Retained<ProtocolObject<dyn MTLComputePipelineState>>, the pipeline state lives exactly as long as the kernel struct. No manual retain/release, no leak risk, no double-free — Rust’s drop semantics handle it.

Metal Object Roles

Object	Created by	Role	Cost	Reuse
`MTLDevice`	`MTLCreateSystemDefaultDevice()`	GPU handle, factory for everything	One-time	Always
`MTLLibrary`	`device.newLibraryWithURL_error()`	Container of compiled shader functions	One-time (loads metallib)	Always
`MTLComputePipelineState`	`device.newComputePipelineStateWithFunction_error()`	Compiled, optimized kernel ready for dispatch	Expensive (shader compilation, register allocation, occupancy calculation)	Always
`MTLCommandQueue`	`device.newCommandQueue()`	Serial scheduler that submits command buffers to GPU	One-time	Always
`MTLCommandBuffer`	`queue.commandBuffer()`	Single batch of GPU work	Cheap (pool-allocated internally by Metal)	Never (one-shot)
`MTLComputeCommandEncoder`	`cmd_buf.computeCommandEncoder()`	Records bind + dispatch commands into a command buffer	Cheap	Never (one-shot)

The key insight: objects above the line (Device through Queue) are created once and stored for app lifetime. Objects below the line (CommandBuffer, Encoder) are created per-dispatch and discarded after use. This is why kernel structs store pipelines but create command buffers on every call.

Send + Sync

ObjC objects don’t carry Send/Sync in Rust’s type system. Every kernel struct, Context, and Texture needs explicit unsafe impls:

#![allow(unused)]
fn main() {
unsafe impl Send for FastDetector {}
unsafe impl Sync for FastDetector {}
}

Justification: Metal pipeline states are immutable after creation — Apple documents them as thread-safe. MTLDevice and MTLCommandQueue are thread-safe. MTLCommandBuffer is not thread-safe, but it’s never stored in a struct — it’s created and consumed within a single method call.

Without these impls, kernel structs couldn’t be stored in Arc, passed across threads, or used in async contexts.

GPU Command Lifecycle

The Rust code mapping each step:

#![allow(unused)]
fn main() {
// ── App lifetime (Context::new) ──────────────────────────────────
let device  = MTLCreateSystemDefaultDevice();           // GPU handle
let queue   = device.newCommandQueue();                 // submission queue
let library = device.newLibraryWithURL_error(&url);     // compiled shaders

// ── App lifetime (Kernel::new) ───────────────────────────────────
let func     = library.newFunctionWithName(ns_string!("fast_detect"));  // function handle
let pipeline = device.newComputePipelineStateWithFunction_error(&func); // compiled kernel
// ↑ This is the expensive step: Metal compiles the function to GPU ISA,
//   determines register usage, calculates max occupancy. ~1-10ms per kernel.

// ── Per dispatch (Kernel::detect / apply / compute) ──────────────
let cmd_buf = queue.commandBuffer();                    // lightweight, pool-allocated
let encoder = cmd_buf.computeCommandEncoder();          // begins a compute pass

encoder.setComputePipelineState(&pipeline);             // which kernel to run
encoder.setTexture_atIndex(Some(tex.raw()), 0);         // [[texture(0)]] in shader
encoder.setBuffer_offset_atIndex(                       // [[buffer(0)]] in shader
    Some(buf.metal_buffer()), 0, 0);
encoder.setBytes_length_atIndex(                        // [[buffer(2)]] — inline constant data
    NonNull::new_unchecked(params_ptr), size, 2);       // memcpy'd into argument buffer
encoder.dispatchThreads_threadsPerThreadgroup(grid, tg); // launch the grid
encoder.endEncoding();                                   // seal this compute pass

cmd_buf.commit();              // submit to GPU — non-blocking, returns immediately
cmd_buf.waitUntilCompleted();  // block CPU until GPU finishes this command buffer
// After this point: all buffer/texture writes by the GPU are visible to CPU.
}

setBytes copies the struct directly into Metal’s argument buffer (max 4KB), avoiding a MTLBuffer allocation for small constant data. Every kernel uses this for its Params struct. For large data (keypoint arrays, descriptors), setBuffer is required — it binds an existing MTLBuffer by reference.

3. Memory Model

Apple Silicon UMA

Apple Silicon uses Unified Memory Architecture — CPU and GPU share the same physical DRAM. No PCIe bus. No DMA copies. A pointer to shared memory is valid on both processors.

MTLStorageModeShared exposes this: one allocation, one virtual address space, zero-copy access from both CPU and GPU. This is the only storage mode vx-rs uses.

Trade-off: On discrete GPUs (NVIDIA/AMD), StorageModeShared would require PCIe transfers on every access. On Apple Silicon, it’s free — both CPU and GPU access the same cache hierarchy. This is a deliberate platform bet: zero-copy UMA is the core performance advantage that makes the “no cross-platform” design worthwhile.

Synchronization model: After cmd_buf.waitUntilCompleted(), all GPU writes are visible to CPU. No explicit cache flush needed — Metal’s command buffer completion acts as a full memory fence. Before that fence, CPU reads may see stale data (writes can be in GPU caches or reorder buffers).

UnifiedBuffer<T>

vx-core/src/buffer.rs

#![allow(unused)]
fn main() {
pub struct UnifiedBuffer<T: Pod> {
    raw: Retained<ProtocolObject<dyn MTLBuffer>>,  // the Metal buffer (+1 refcount)
    count: usize,                                   // number of T elements
    in_flight: Arc<AtomicBool>,                     // prevents CPU mutation during GPU work
    _marker: PhantomData<T>,                        // carries T without owning one
}
}

The Pod bound (bytemuck::Pod) guarantees the type has no padding traps, no invalid bit patterns, and is safe to interpret from arbitrary bytes. Metal writes raw bytes into the buffer — Pod ensures the CPU can safely interpret them as T after GPU completion. Every #[repr(C)] struct in types.rs derives Pod.

Allocation: newBufferWithLength_options(count * size_of::<T>(), StorageModeShared). Metal zero-initializes shared buffers on allocation. This is relied upon — atomic counter buffers start at zero by default (though kernels explicitly write [0u32] before dispatch for clarity).

CPU access paths:

as_slice() -> &[T] — casts MTLBuffer.contents() pointer to &[T]. No copy. The slice directly aliases GPU-visible memory.
as_mut_slice() -> &mut [T] — same, but asserts !in_flight. Panics if a GpuGuard exists. This prevents writing to memory the GPU is reading from.
write(&[T]) — copy_from_slice into the buffer. Panics if in-flight.
to_vec() -> Vec<T> — copies out. Use after GPU completion for safe owned data.

GpuGuard<T>

vx-core/src/buffer.rs

RAII guard that blocks CPU mutation while the buffer is in-flight on the GPU.

#![allow(unused)]
fn main() {
pub struct GpuGuard<T: Pod> {
    in_flight: Arc<AtomicBool>,  // shared with the UnifiedBuffer
    _marker: PhantomData<T>,
}

// Created: buf.gpu_guard() → sets in_flight = true (Release)
// Dropped: sets in_flight = false (Release)
// Checked: as_mut_slice() loads in_flight (Acquire) → panics if true
}

Atomic ordering:

store(true, Release) on guard creation — ensures all CPU writes to the buffer before creating the guard are visible before the flag becomes true. The GPU sees consistent data.
load(Acquire) in as_mut_slice() — ensures the CPU doesn’t reorder reads after the check to before it. If the flag is false, the GPU is done and all its writes are visible.
store(false, Release) on drop — pairs with the Acquire in as_mut_slice() to establish happens-before.

This is not a mutex — it’s a one-way gate. No contention, no blocking, just a panic if the contract is violated. The real synchronization is waitUntilCompleted() on the command buffer.

Memory Lifecycle

Invariant: Guard created before commit. Dropped after waitUntilCompleted. Reversing this allows CPU reads of partially-written GPU data — silent corruption, not a crash.

4. Texture Subsystem

vx-vision/src/texture.rs

#![allow(unused)]
fn main() {
pub struct Texture {
    raw: Retained<ProtocolObject<dyn MTLTextureTrait>>,
    width: u32,
    height: u32,
    format: TextureFormat,
}
}

Textures are used over buffers for image data because Metal’s texture hardware provides spatial locality optimization (tiled/swizzled memory layout), hardware bilinear interpolation, and automatic [0,255] → [0.0,1.0] normalization on R8Unorm reads. A buffer has linear memory layout — worse cache behavior for 2D access patterns like convolution kernels.

Formats

Format	`MTLPixelFormat`	Bytes/px	Readback	Shader behavior
`R8Unorm`	`R8Unorm`	1	`read_gray8() → Vec<u8>`	`image.read(gid).r` returns `[0.0, 1.0]`
`R32Float`	`R32Float`	4	`read_r32float() → Vec<f32>`	`image.read(gid).r` returns raw float
`RGBA8Unorm`	`RGBA8Unorm`	4	`read_rgba8() → Vec<u8>`	`.rgba` returns 4 normalized channels

R8Unorm normalization matters: FAST multiplies by 255 to get integer intensity for threshold comparison. Sobel operates directly in [0,1] space. Every shader must account for the format it reads.

Usage Flags

Role	Flag	Why
Input	`ShaderRead`	GPU reads, CPU uploads via `replaceRegion`. Cheapest — Metal can optimize read-only layout.
Output	`ShaderWrite`	GPU writes. Cannot be sampled in the same pass.
Intermediate	`ShaderRead \| ShaderWrite`	Read in one pass, written in another. Required for multi-pass kernels (Gaussian H→V, Sobel→magnitude).

Writing to a ShaderRead-only texture is undefined behavior — no GPU error, just corrupted output. Metal does not validate usage flags at dispatch time.

Upload & Readback

Both use MTLTexture’s replaceRegion (upload) and getBytes (readback). These are synchronous CPU-side memcpy operations. bytesPerRow must match: width * 1 for R8, width * 4 for R32Float and RGBA8.

Performance note: These operations copy data. For input textures, the copy happens once at creation. For output readback, it happens after GPU completion. In a real-time pipeline, prefer keeping data in textures across kernel stages rather than reading back to CPU between stages — use Pipeline for this.

TexturePool

vx-vision/src/pool.rs

#![allow(unused)]
fn main() {
pub struct TexturePool {
    buckets: HashMap<(u32, u32, TextureFormat), Vec<Texture>>,
    max_per_bucket: usize,  // default: 8
}
}

MTLTexture allocation is non-trivial — Metal must find contiguous GPU memory, configure tiling parameters, and set up page tables. In a real-time pipeline processing 30fps video, allocating and freeing intermediate textures per frame wastes time.

The pool caches by (width, height, format). acquire() pops from the matching bucket or allocates fresh. release() pushes back. Pool textures always have ShaderRead | ShaderWrite — they may serve as either input or output in different pipeline stages.

hit_rate() tracks effectiveness. If it’s low, the pipeline is using too many distinct texture sizes.

5. Context & Pipeline

Context

vx-vision/src/context.rs

#![allow(unused)]
fn main() {
pub struct Context {
    device: Retained<ProtocolObject<dyn MTLDevice>>,
    queue: Retained<ProtocolObject<dyn MTLCommandQueue>>,
    library: Retained<ProtocolObject<dyn MTLLibrary>>,
}
}

Single entry point. Context::new() does three things:

MTLCreateSystemDefaultDevice() — gets the system GPU
device.newCommandQueue() — creates the submission queue
Loads the embedded metallib (see Build System)

device(), queue(), library() are pub(crate) — kernel code uses them internally, but downstream consumers only interact through Context and kernel structs.

One queue per Context. Metal command queues are serial — command buffers submitted to the same queue execute in order. This simplifies synchronization: no explicit fences between dependent dispatches. For parallel kernel execution, create multiple Context instances (each with its own queue).

Pipeline

vx-vision/src/pipeline.rs

#![allow(unused)]
fn main() {
pub struct Pipeline {
    cmd_buf: Retained<ProtocolObject<dyn MTLCommandBuffer>>,
    retained: Vec<Texture>,
    committed: bool,
}
}

Batches multiple kernel dispatches into a single MTLCommandBuffer. This matters because command buffer creation has overhead — Metal must acquire a buffer from its internal pool, set up completion handlers, and track resources.

Texture retention problem: When a kernel’s encode() creates an intermediate texture, that texture would be dropped when encode() returns. But the GPU hasn’t run yet — the command buffer only records commands, it doesn’t execute them until commit(). If the texture is freed, the GPU reads garbage.

retain(tex) moves the texture into the pipeline’s Vec<Texture>, keeping it alive until commit_and_wait() returns.

Pipeline Batching Flow

6. Kernel Taxonomy

Every kernel struct follows the same construction pattern:

new(&Context) — compile pipeline(s) from the metallib. Store as Retained<...>.
Sync method (detect / apply / compute) — allocate per-dispatch resources, encode, commit, wait, readback.
encode(...) — record commands without committing. For Pipeline batching.
unsafe impl Send + Sync — pipeline states are immutable, thread-safe.

The difference between kernel categories is what goes in and what comes out.

6.1 Buffer-Output Kernels

Variable-length structured output. The shader decides at runtime how many results to emit.

Mechanism: A device atomic_uint* counter buffer starts at zero. Each thread that produces a result atomically increments the counter and writes to that slot:

uint slot = atomic_fetch_add_explicit(count, 1, memory_order_relaxed);
if (slot < params.max_corners) {
    corners[slot] = result;
}

memory_order_relaxed is sufficient — we only need the atomic increment to be unique, not ordered relative to other atomics. The slot assignment is the synchronization.

An alternative is prefix scan (parallel exclusive scan) for deterministic output ordering, but that requires two passes and shared memory coordination. Atomic append is one pass, simpler, and sufficient — output ordering isn’t needed since downstream kernels (Harris, NMS) operate on the full set regardless of order.

Annotated walkthrough: FastDetector::detect() (vx-vision/src/kernels/fast.rs)

#![allow(unused)]
fn main() {
pub fn detect(&self, ctx: &Context, texture: &Texture, config: &FastDetectConfig)
    -> Result<FastDetectResult>
{
    let w = texture.width();
    let h = texture.height();

    // Pre-allocate max capacity. The GPU fills [0..actual_count].
    // Over-allocation is cheap on UMA — no copy, just page table entries.
    let corner_buf = UnifiedBuffer::<CornerPoint>::new(ctx.device(), config.max_corners as usize)?;
    let mut count_buf = UnifiedBuffer::<u32>::new(ctx.device(), 1)?;
    count_buf.write(&[0u32]);  // Zero the atomic counter. Forgetting this → stale count from
                                // previous dispatch or random memory, causing buffer overread.

    // Build the params struct. Field order must exactly match FASTParams in FastDetect.metal.
    let params = FASTParams { threshold: config.threshold, max_corners: config.max_corners,
                              width: w, height: h };

    // Guards BEFORE dispatch — marks buffers as in-flight.
    let _corner_guard = corner_buf.gpu_guard();
    let _count_guard  = count_buf.gpu_guard();

    // Create command buffer and encoder.
    let cmd_buf = ctx.queue().commandBuffer()
        .ok_or(Error::Gpu("failed to create command buffer".into()))?;
    let encoder = cmd_buf.computeCommandEncoder()
        .ok_or(Error::Gpu("failed to create compute encoder".into()))?;

    // Bind pipeline, textures, buffers, params.
    // Index arguments (0, 1, 2) must match [[texture(0)]], [[buffer(0)]], [[buffer(1)]], [[buffer(2)]]
    // in the .metal file. A mismatch binds the wrong data — silent corruption, not a crash.
    encoder.setComputePipelineState(&self.pipeline);
    encoder.setTexture_atIndex(Some(texture.raw()), 0);
    encoder.setBuffer_offset_atIndex(Some(corner_buf.metal_buffer()), 0, 0);
    encoder.setBuffer_offset_atIndex(Some(count_buf.metal_buffer()), 0, 1);
    encoder.setBytes_length_atIndex(
        NonNull::new_unchecked(&params as *const _ as *mut c_void),
        mem::size_of::<FASTParams>(), 2);

    // 2D dispatch: one thread per pixel. threadExecutionWidth is the SIMD width (32 on Apple Silicon).
    // maxTotalThreadsPerThreadgroup is typically 1024. So threadgroup = 32x32 = 1024 threads.
    // Edge threadgroups are partial — the shader's bounds check handles out-of-range gids.
    let tew = self.pipeline.threadExecutionWidth();
    let max_tg = self.pipeline.maxTotalThreadsPerThreadgroup();
    let grid = MTLSize { width: w as usize, height: h as usize, depth: 1 };
    let tg = MTLSize { width: tew, height: (max_tg / tew).max(1), depth: 1 };
    encoder.dispatchThreads_threadsPerThreadgroup(grid, tg);
    encoder.endEncoding();

    cmd_buf.commit();
    cmd_buf.waitUntilCompleted();  // blocks until GPU finishes

    // Drop guards → unlocks buffers for CPU access.
    drop(_corner_guard);
    drop(_count_guard);

    // Read actual count, clamp to max (shader may have over-incremented the atomic
    // if more corners found than buffer capacity), slice the buffer.
    let n = (count_buf.as_slice()[0] as usize).min(config.max_corners as usize);
    Ok(FastDetectResult { corners: corner_buf.as_slice()[..n].to_vec() })
}
}

Buffer-output kernel catalog:

Kernel	Struct	Shader	Dispatch	Output type
FAST-9	`FastDetector`	`fast_detect`	2D per-pixel	`Vec<CornerPoint>`
Harris	`HarrisScorer`	`harris_response`	1D per-corner	`Vec<CornerPoint>` (scored)
NMS	`NmsSuppressor`	`nms_suppress`	1D per-corner	`Vec<CornerPoint>` (filtered)
ORB	`OrbDescriptor`	`orb_compute`	1D per-keypoint	`Vec<ORBOutput>`
Matcher	`BruteMatcher`	`hamming_distance` + `extract_matches`	2D + 1D	`Vec<MatchResult>`
StereoMatch	`StereoMatcher`	`stereo_match`	2D	`Vec<StereoMatchResult>`
Histogram	`HistogramComputer`	`histogram_compute`	2D per-pixel	`Vec<u32>` (256 bins)
Hough	`HoughDetector`	`hough_vote` + `hough_peaks`	2D + 1D	`Vec<HoughLine>`
Homography	`HomographyScorer`	`score_homography`	1D per-point	`Vec<ScoreResult>`

6.2 Texture-to-Texture Kernels

Fixed-size output. One output pixel per input pixel (or per output pixel for resize/warp). No atomic counters, no variable-length output.

Annotated walkthrough: GaussianBlur::apply() (vx-vision/src/kernels/gaussian.rs)

Separable 2-pass convolution. A 2D Gaussian with radius r requires (2r+1)^2 reads per pixel. Separating into H+V passes reduces to 2*(2r+1) reads — O(r) instead of O(r²).

#![allow(unused)]
fn main() {
pub fn apply(&self, ctx: &Context, input: &Texture, output: &Texture,
             config: &GaussianConfig) -> Result<()>
{
    let w = input.width();
    let h = input.height();

    // Intermediate texture for horizontal pass output / vertical pass input.
    // Must be ShaderRead|ShaderWrite — written by pass 1, read by pass 2.
    // R32Float avoids precision loss during intermediate accumulation.
    let intermediate = Texture::intermediate_r32float(ctx.device(), w, h)?;

    let params = GaussianParams { width: w, height: h, sigma: config.sigma, radius: config.radius };
    let cmd_buf = ctx.queue().commandBuffer()?;

    // Pass 1: horizontal blur. Each thread reads (2*radius+1) horizontal neighbors.
    // Clamp-to-edge boundary: shader does clamp(x + dx, 0, width-1).
    // Weights computed inline: exp(-dx²/(2σ²)). Not precomputed — ALU is faster than
    // the memory load that a weight LUT would require at these kernel sizes.
    {
        let enc = cmd_buf.computeCommandEncoder()?;
        enc.setComputePipelineState(&self.h_pipeline);
        enc.setTexture_atIndex(Some(input.raw()), 0);        // source
        enc.setTexture_atIndex(Some(intermediate.raw()), 1);  // dest
        enc.setBytes_length_atIndex(/* params */, /* size */, 0);
        enc.dispatchThreads_threadsPerThreadgroup(grid_2d(w, h), tg_2d(&self.h_pipeline));
        enc.endEncoding();  // seal pass 1
    }

    // Pass 2: vertical blur on the horizontally-blurred intermediate.
    // Metal guarantees pass 2 sees pass 1's writes — encoders within a command buffer
    // execute in order with implicit barriers between them.
    {
        let enc = cmd_buf.computeCommandEncoder()?;
        enc.setComputePipelineState(&self.v_pipeline);
        enc.setTexture_atIndex(Some(intermediate.raw()), 0);  // source (pass 1 output)
        enc.setTexture_atIndex(Some(output.raw()), 1);         // final dest
        enc.setBytes_length_atIndex(/* params */, /* size */, 0);
        enc.dispatchThreads_threadsPerThreadgroup(grid_2d(w, h), tg_2d(&self.v_pipeline));
        enc.endEncoding();
    }

    cmd_buf.commit();
    cmd_buf.waitUntilCompleted();
    // intermediate texture dropped here — its only purpose was the H→V handoff.
    Ok(())
}
}

Multi-pass flow:

Texture-to-texture kernel catalog:

Kernel	Struct	Shader(s)	Passes	Notes
Gaussian	`GaussianBlur`	`gaussian_blur_h`, `gaussian_blur_v`	2	Separable, O(r) vs O(r²)
Sobel	`SobelFilter`	`sobel_3x3`, `gradient_magnitude`	2	Outputs: gx, gy, magnitude, direction
Canny	`CannyDetector`	Sobel + `canny_hysteresis`	3	Composes Sobel internally
Threshold	`ThresholdFilter`	`threshold_binary` / `adaptive` / `otsu`	1–2	Otsu needs histogram first
Color	`ColorConverter`	`rgba_to_gray`, `gray_to_rgba`, `rgba_to_hsv`, `hsv_to_rgba`	1	Per-pixel, no neighbors
Morphology	`MorphFilter`	`morph_erode`, `morph_dilate`	1–2	Open = erode+dilate
Pyramid	`ImagePyramidBuilder`	`pyramid_downsample`	N	4 levels = 3 downsamples
Resize	`ResizeFilter`	`bilinear_resize`	1	Grid = output dimensions
Warp	`WarpFilter`	`warp_affine`, `warp_perspective`	1	Inverse transform per pixel
Bilateral	`BilateralFilter`	`bilateral_filter`	1	O(r²) — not separable
Dense Flow	`DenseFlowComputer`	`dense_flow`	1	Horn-Schunck
Connected	`ConnectedComponents`	`ccl_*`	iterative	Label propagation until convergence
Distance	`DistanceTransform`	`jfa_seed`, `jfa_step`, `jfa_distance`	2+N	Jump Flooding, O(log n) passes
Template	`TemplateMatcher`	`template_match_ncc`	1	NCC score map
Integral	`IntegralImage`	`integral_*`	multi	Prefix sum (row then column)

6.3 Hybrid Kernels

Consume textures, produce buffer output — or both.

KLT Optical Flow (vx-vision/src/kernels/klt.rs): Binds 8 textures (4-level pyramid × 2 frames) at indices 0–7, plus 3 buffers (prev_points, curr_points, status). Dispatch is 1D: one thread per point. Each thread does iterative Lucas-Kanade at each pyramid level (coarse-to-fine), reading texture neighborhoods and writing final position + tracking status to buffers.

DoG (Difference-of-Gaussians): Two-phase. First phase subtracts adjacent scale-space textures (texture→texture). Second phase finds 3D extrema across scale and space, appending DoGKeypoint to a buffer via atomic counter (texture→buffer).

6.4 Utility Kernels

IndirectDispatch (vx-vision/src/kernels/indirect.rs): Solves the “CPU round-trip” problem in FAST→Harris chaining.

FAST produces N corners (unknown until GPU completes). Without indirect dispatch: commit FAST → wait → read count → dispatch Harris with count. Two command buffers, one CPU stall.

With indirect dispatch: prepare_indirect_args reads the atomic counter and computes MTLDispatchThreadgroupsIndirectArguments on the GPU. Harris then dispatches using dispatchThreadgroupsWithIndirectBuffer — the GPU reads the thread count from the args buffer directly. Everything stays in one command buffer, zero CPU round-trips.

Implementation note: objc2-metal doesn’t bind dispatchThreadgroupsWithIndirectBuffer. The code uses raw msg_send! to call it.

6.5 3D Reconstruction Kernels

Feature-gated: #[cfg(feature = "reconstruction")].

Category	Kernels	Pattern
Depth processing	`DepthFilter` (bilateral + median), `DepthInpaint` (JFA hole-fill), `DepthColorize`	texture→texture
Stereo	`SGMStereo` (Semi-Global Matching)	texture→texture (disparity map)
Point cloud	`DepthToCloud` (unprojection), `OutlierFilter`, `VoxelDownsample`, `NormalEstimation`	texture/buffer→buffer
Volumetric	`TSDFIntegrate`, `TSDFRaycast`, `MarchingCubes`	buffer→buffer (3D voxel grids)
Geometry	`Triangulate`	buffer→buffer

Same patterns as core kernels. TSDF uses 3D buffer indexing (res_x * res_y * res_z voxels). MarchingCubes outputs triangle meshes to buffers with atomic vertex/index counters.

7. `#[repr(C)]` Contract

vx-vision/src/types.rs

Every params struct is passed to the GPU via setBytes_length_atIndex — a raw memcpy into the argument buffer. The GPU interprets the bytes according to the MSL struct layout. If the Rust and MSL layouts differ by even one byte, fields shift — silent data corruption, no error.

Rules

#[repr(C)] — C-compatible field ordering. Without this, Rust may reorder fields for alignment optimization.
Pod + Zeroable (bytemuck) — certifies the type is safe to transmute from raw bytes.
Field order matches MSL exactly. Same names, same order, same types.
Type mapping:

Rust	MSL	Size	Alignment	Notes
`u32`	`uint`	4	4
`i32`	`int`	4	4
`f32`	`float`	4	4
`[f32; 2]`	`float2`	8	8
`[f32; 3]` + `_pad: f32`	`float3`	16	16	Rust `[f32;3]` is only 4-byte aligned
`[f32; 4]`	`float4`	16	16
`[u8; 4]`	`uchar4`	4	4

float3 Alignment

MSL float3 occupies 16 bytes (12 data + 4 padding) and requires 16-byte alignment. Rust’s [f32; 3] is 12 bytes with 4-byte alignment. Without explicit padding, every field after [f32; 3] is offset by 4 bytes from what the GPU expects.

#![allow(unused)]
fn main() {
// Rust — matches MSL float3 layout
pub struct GpuPoint3D {
    pub position: [f32; 3],  // 12 bytes at offset 0
    pub _pad0: f32,          // 4 bytes → total 16, matching float3's footprint
    pub color: [u8; 4],      // offset 16
    pub normal: [f32; 3],    // 12 bytes at offset 20
    pub _pad1: f32,          // 4 bytes → total 16
}
}

struct GpuPoint3D {
    float3 position;  // 16 bytes (implicit padding to align)
    uchar4 color;     // 4 bytes
    float3 normal;    // 16 bytes
};

8. Dispatch Model

2D Per-Pixel

Image filters and per-pixel detectors. Grid = image dimensions.

#![allow(unused)]
fn main() {
let tew = pipeline.threadExecutionWidth();            // SIMD width: 32 on Apple Silicon
let max_tg = pipeline.maxTotalThreadsPerThreadgroup(); // typically 1024
let tg_h = (max_tg / tew).max(1);                     // 1024/32 = 32

let grid = MTLSize { width: w as usize, height: h as usize, depth: 1 };
let tg   = MTLSize { width: tew, height: tg_h, depth: 1 };  // 32x32 = 1024 threads
}

threadExecutionWidth is queried rather than hardcoded because future Apple Silicon may change the SIMD width — the query returns the hardware truth. The height is derived as max_tg / tew to maximize threadgroup occupancy: a 32x32 threadgroup fills 1024 threads (the hardware max). Smaller threadgroups waste SIMD lanes at threadgroup boundaries.

1D Per-Element

Buffer-output kernels processing variable-length arrays.

#![allow(unused)]
fn main() {
let grid = MTLSize { width: n_elements, height: 1, depth: 1 };
let tg   = MTLSize { width: tew, height: 1, depth: 1 };
}

Why `dispatchThreads` (Non-Uniform)

dispatchThreadgroups requires grid dimensions to be a multiple of threadgroup size — manual ceil-division, easy to get wrong.

dispatchThreads takes the exact thread count. Metal internally handles partial threadgroups. Every shader still bounds-checks (if gid.x >= width return) because threads in the partial edge threadgroup may exceed the intended grid.

Performance: threadgroup sizing and occupancy. Apple Silicon GPUs execute threads in SIMD groups (wavefronts) of 32. A 32x32 threadgroup = 1024 threads = 32 SIMD groups. The GPU schedules multiple threadgroups per compute unit to hide memory latency. Smaller threadgroups reduce this parallelism. Querying maxTotalThreadsPerThreadgroup and filling it completely maximizes occupancy.

9. Build System

vx-vision/build.rs

Shader Compilation Pipeline

build.rs globs shaders/*.metal

Each .metal → .air (Apple Intermediate Representation):

xcrun -sdk macosx metal -c -target air64-apple-macos14.0 -std=metal3.1 -O2 file.metal -o file.air

All .air → linked vx.metallib:

xcrun -sdk macosx metallib *.air -o vx.metallib

Embedded at compile time:

#![allow(unused)]
fn main() {
static METALLIB_BYTES: &[u8] = include_bytes!(concat!(env!("OUT_DIR"), "/vx.metallib"));
}

At runtime: written to temp file → loaded via newLibraryWithURL_error. Metal has no “load from bytes” API — it requires a file URL. The temp file is deleted immediately after loading.

Auto-discovery: cargo:rerun-if-changed=shaders watches the directory. New .metal files are picked up automatically — no build.rs edits needed.

Compiler flags:

-target air64-apple-macos14.0 — AIR bytecode for macOS 14+
-std=metal3.1 — MSL 3.1 (mesh shaders, ray tracing intrinsics available)
-O2 — full optimization (loop unrolling, dead code elimination, register allocation)
-Wno-unused-variable — suppress warnings from template code

10. Error Handling

vx-vision/src/error.rs

#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum Error {
    DeviceNotFound,              // MTLCreateSystemDefaultDevice() returned nil
    ShaderMissing(String),       // newFunctionWithName() returned nil — typo in kernel name
    PipelineCompile(String),     // newComputePipelineStateWithFunction_error failed — MSL error
    BufferAlloc { bytes: usize },// newBufferWithLength returned nil — out of GPU memory
    TextureSizeMismatch,         // dimension mismatch between input/output textures
    InvalidConfig(String),       // bad parameter (negative radius, zero dimensions, etc.)
    Gpu(String),                 // runtime: commandBuffer/encoder creation failed, execution error
}
}

From<String> maps string errors to Error::Gpu. This bridges vx-core (which returns Result<T, String>) into vx-vision’s typed error enum.

#[non_exhaustive] allows adding variants without breaking downstream match statements.

11. Common Pitfalls

Symptom	Cause	Fix
Stale data on CPU readback	Guard dropped before `waitUntilCompleted`, or never created	Guard before commit, drop after wait
Garbage pixel output	Format mismatch (R32Float where R8Unorm expected)	Match shader’s `texture2d<float>` access to texture format
Silent write failure	Input texture used as output (ShaderRead only)	Use `intermediate_` or `output_`
Shifted struct fields	`[f32; 3]` without `_pad: f32` in repr(C) struct	Pad every `float3` to 16 bytes
All-zero results	Atomic counter not zeroed before dispatch	`count_buf.write(&[0u32])`
Use-after-free crash	`EncodedBuffers` dropped before GPU completes	Keep encoded state alive until after wait
“Missing shader function”	`ns_string!()` doesn’t match `kernel void` name	Exact string match between Rust and MSL
Threadgroup too large	Hardcoded threadgroup exceeds hardware max	Query `maxTotalThreadsPerThreadgroup()`
Incorrect buffer binding	Index in `setBuffer_atIndex` mismatches `[[buffer(N)]]`	Match Rust indices to MSL attribute indices

Appendix A: Shader Inventory

42 Metal shader files in vx-vision/shaders/:

File	Kernel Function(s)	Category
`FastDetect.metal`	`fast_detect`	Feature detection
`HarrisResponse.metal`	`harris_response`	Feature detection
`NMS.metal`	`nms_suppress`	Feature detection
`ORBDescriptor.metal`	`orb_compute`	Feature description
`BruteMatcher.metal`	`hamming_distance`, `extract_matches`	Feature matching
`StereoMatch.metal`	`stereo_match`	Stereo
`KLTTracker.metal`	`klt_track_forward`	Optical flow
`DenseFlow.metal`	`dense_flow`	Optical flow
`GaussianBlur.metal`	`gaussian_blur_h`, `gaussian_blur_v`	Filtering
`Sobel.metal`	`sobel_3x3`, `gradient_magnitude`	Edge detection
`Canny.metal`	`canny_hysteresis`	Edge detection
`Threshold.metal`	`threshold_binary`, `threshold_adaptive`, `threshold_otsu`	Segmentation
`Morphology.metal`	`morph_erode`, `morph_dilate`	Morphology
`BilateralFilter.metal`	`bilateral_filter`	Filtering
`ColorConvert.metal`	`rgba_to_gray`, `gray_to_rgba`, `rgba_to_hsv`, `hsv_to_rgba`	Color
`Pyramid.metal`	`pyramid_downsample`	Scale space
`Resize.metal`	`bilinear_resize`	Geometry
`Warp.metal`	`warp_affine`, `warp_perspective`	Geometry
`IntegralImage.metal`	`integral_*`	Analysis
`Histogram.metal`	`histogram_compute`	Analysis
`HoughLines.metal`	`hough_vote`, `hough_peaks`	Line detection
`TemplateMatch.metal`	`template_match_ncc`	Template matching
`ConnectedComponents.metal`	`ccl_*`	Labeling
`DistanceTransform.metal`	`jfa_seed`, `jfa_step`, `jfa_distance`	Distance
`Homography.metal`	`score_homography`	Geometry
`IndirectArgs.metal`	`prepare_indirect_args`	Utility
`DoG.metal`	`dog_subtract`, `dog_extrema`	Scale space
`undistort.metal`	`undistort`	Calibration
`SGMStereo.metal`	`sgm_*`	Stereo
`DepthFilter.metal`	`depth_bilateral`, `depth_median`	Depth
`DepthInpaint.metal`	`depth_inpaint_*`	Depth
`DepthColorize.metal`	`depth_colorize`	Visualization
`DepthToCloud.metal`	`depth_to_cloud`	3D reconstruction
`NormalEstimation.metal`	`normal_estimation_*`	3D reconstruction
`OutlierFilter.metal`	`outlier_filter_*`	3D reconstruction
`VoxelDownsample.metal`	`voxel_downsample_*`	3D reconstruction
`TSDFIntegrate.metal`	`tsdf_integrate`	Volumetric
`TSDFRaycast.metal`	`tsdf_raycast`	Volumetric
`MarchingCubes.metal`	`marching_cubes_*`	Mesh extraction
`Triangulate.metal`	`triangulate_*`	Geometry
`PointCloudRender.metal`	`point_cloud_*`	Visualization
`MeshRender.metal`	`mesh_*`	Visualization

Keyboard shortcuts

VX Documentation