Design Document

Internal engineering reference. Assumes familiarity with Rust and GPU compute concepts.

Two workspace crates. vx-vision imports vx-core as vx_gpu. Both crates depend on objc2-metal for Metal bindings and bytemuck for safe #[repr(C)] transmutes. No other runtime dependencies — no wgpu, no ash, no cross-platform abstraction. Metal-only by design: this eliminates translation layers and enables direct access to Apple Silicon UMA semantics that cross-platform APIs abstract away.

2. Metal–Rust Bridge

objc2-metal

Metal is an Objective-C framework. Its API surface is protocol-based — MTLDevice, MTLCommandBuffer, etc. are protocols, not concrete classes. The runtime returns opaque objects conforming to these protocols.

objc2-metal generates Rust trait definitions from Metal’s protocol headers. Each protocol becomes a Rust trait. Since the concrete type behind the protocol is unknown at compile time, the binding uses type-erased wrappers:

#![allow(unused)]
fn main() {
Retained<ProtocolObject<dyn MTLComputePipelineState>>
}

Three layers here:

dyn MTLComputePipelineState — the Metal protocol, expressed as a Rust trait object
ProtocolObject<_> — type erasure wrapper for ObjC protocol conformance (analogous to dyn Trait but bridging ObjC’s protocol dispatch, not Rust’s vtable dispatch)
Retained<_> — ARC-compatible smart pointer. Holds a +1 reference count. Sends release on drop. This is the bridge between ObjC’s reference counting and Rust’s ownership semantics.

Retained means Rust owns the object. When a kernel struct holds Retained<ProtocolObject<dyn MTLComputePipelineState>>, the pipeline state lives exactly as long as the kernel struct. No manual retain/release, no leak risk, no double-free — Rust’s drop semantics handle it.

Metal Object Roles

Object	Created by	Role	Cost	Reuse
`MTLDevice`	`MTLCreateSystemDefaultDevice()`	GPU handle, factory for everything	One-time	Always
`MTLLibrary`	`device.newLibraryWithURL_error()`	Container of compiled shader functions	One-time (loads metallib)	Always
`MTLComputePipelineState`	`device.newComputePipelineStateWithFunction_error()`	Compiled, optimized kernel ready for dispatch	Expensive (shader compilation, register allocation, occupancy calculation)	Always
`MTLCommandQueue`	`device.newCommandQueue()`	Serial scheduler that submits command buffers to GPU	One-time	Always
`MTLCommandBuffer`	`queue.commandBuffer()`	Single batch of GPU work	Cheap (pool-allocated internally by Metal)	Never (one-shot)
`MTLComputeCommandEncoder`	`cmd_buf.computeCommandEncoder()`	Records bind + dispatch commands into a command buffer	Cheap	Never (one-shot)

The key insight: objects above the line (Device through Queue) are created once and stored for app lifetime. Objects below the line (CommandBuffer, Encoder) are created per-dispatch and discarded after use. This is why kernel structs store pipelines but create command buffers on every call.

Send + Sync

ObjC objects don’t carry Send/Sync in Rust’s type system. Every kernel struct, Context, and Texture needs explicit unsafe impls:

#![allow(unused)]
fn main() {
unsafe impl Send for FastDetector {}
unsafe impl Sync for FastDetector {}
}

Justification: Metal pipeline states are immutable after creation — Apple documents them as thread-safe. MTLDevice and MTLCommandQueue are thread-safe. MTLCommandBuffer is not thread-safe, but it’s never stored in a struct — it’s created and consumed within a single method call.

Without these impls, kernel structs couldn’t be stored in Arc, passed across threads, or used in async contexts.

GPU Command Lifecycle

The Rust code mapping each step:

#![allow(unused)]
fn main() {
// ── App lifetime (Context::new) ──────────────────────────────────
let device  = MTLCreateSystemDefaultDevice();           // GPU handle
let queue   = device.newCommandQueue();                 // submission queue
let library = device.newLibraryWithURL_error(&url);     // compiled shaders

// ── App lifetime (Kernel::new) ───────────────────────────────────
let func     = library.newFunctionWithName(ns_string!("fast_detect"));  // function handle
let pipeline = device.newComputePipelineStateWithFunction_error(&func); // compiled kernel
// ↑ This is the expensive step: Metal compiles the function to GPU ISA,
//   determines register usage, calculates max occupancy. ~1-10ms per kernel.

// ── Per dispatch (Kernel::detect / apply / compute) ──────────────
let cmd_buf = queue.commandBuffer();                    // lightweight, pool-allocated
let encoder = cmd_buf.computeCommandEncoder();          // begins a compute pass

encoder.setComputePipelineState(&pipeline);             // which kernel to run
encoder.setTexture_atIndex(Some(tex.raw()), 0);         // [[texture(0)]] in shader
encoder.setBuffer_offset_atIndex(                       // [[buffer(0)]] in shader
    Some(buf.metal_buffer()), 0, 0);
encoder.setBytes_length_atIndex(                        // [[buffer(2)]] — inline constant data
    NonNull::new_unchecked(params_ptr), size, 2);       // memcpy'd into argument buffer
encoder.dispatchThreads_threadsPerThreadgroup(grid, tg); // launch the grid
encoder.endEncoding();                                   // seal this compute pass

cmd_buf.commit();              // submit to GPU — non-blocking, returns immediately
cmd_buf.waitUntilCompleted();  // block CPU until GPU finishes this command buffer
// After this point: all buffer/texture writes by the GPU are visible to CPU.
}

setBytes copies the struct directly into Metal’s argument buffer (max 4KB), avoiding a MTLBuffer allocation for small constant data. Every kernel uses this for its Params struct. For large data (keypoint arrays, descriptors), setBuffer is required — it binds an existing MTLBuffer by reference.

3. Memory Model

Apple Silicon UMA

Apple Silicon uses Unified Memory Architecture — CPU and GPU share the same physical DRAM. No PCIe bus. No DMA copies. A pointer to shared memory is valid on both processors.

MTLStorageModeShared exposes this: one allocation, one virtual address space, zero-copy access from both CPU and GPU. This is the only storage mode vx-rs uses.

Trade-off: On discrete GPUs (NVIDIA/AMD), StorageModeShared would require PCIe transfers on every access. On Apple Silicon, it’s free — both CPU and GPU access the same cache hierarchy. This is a deliberate platform bet: zero-copy UMA is the core performance advantage that makes the “no cross-platform” design worthwhile.

Synchronization model: After cmd_buf.waitUntilCompleted(), all GPU writes are visible to CPU. No explicit cache flush needed — Metal’s command buffer completion acts as a full memory fence. Before that fence, CPU reads may see stale data (writes can be in GPU caches or reorder buffers).

UnifiedBuffer<T>

vx-core/src/buffer.rs

#![allow(unused)]
fn main() {
pub struct UnifiedBuffer<T: Pod> {
    raw: Retained<ProtocolObject<dyn MTLBuffer>>,  // the Metal buffer (+1 refcount)
    count: usize,                                   // number of T elements
    in_flight: Arc<AtomicBool>,                     // prevents CPU mutation during GPU work
    _marker: PhantomData<T>,                        // carries T without owning one
}
}

The Pod bound (bytemuck::Pod) guarantees the type has no padding traps, no invalid bit patterns, and is safe to interpret from arbitrary bytes. Metal writes raw bytes into the buffer — Pod ensures the CPU can safely interpret them as T after GPU completion. Every #[repr(C)] struct in types.rs derives Pod.

Allocation: newBufferWithLength_options(count * size_of::<T>(), StorageModeShared). Metal zero-initializes shared buffers on allocation. This is relied upon — atomic counter buffers start at zero by default (though kernels explicitly write [0u32] before dispatch for clarity).

CPU access paths:

as_slice() -> &[T] — casts MTLBuffer.contents() pointer to &[T]. No copy. The slice directly aliases GPU-visible memory.
as_mut_slice() -> &mut [T] — same, but asserts !in_flight. Panics if a GpuGuard exists. This prevents writing to memory the GPU is reading from.
write(&[T]) — copy_from_slice into the buffer. Panics if in-flight.
to_vec() -> Vec<T> — copies out. Use after GPU completion for safe owned data.

GpuGuard<T>

vx-core/src/buffer.rs

RAII guard that blocks CPU mutation while the buffer is in-flight on the GPU.

#![allow(unused)]
fn main() {
pub struct GpuGuard<T: Pod> {
    in_flight: Arc<AtomicBool>,  // shared with the UnifiedBuffer
    _marker: PhantomData<T>,
}

// Created: buf.gpu_guard() → sets in_flight = true (Release)
// Dropped: sets in_flight = false (Release)
// Checked: as_mut_slice() loads in_flight (Acquire) → panics if true
}

Atomic ordering:

store(true, Release) on guard creation — ensures all CPU writes to the buffer before creating the guard are visible before the flag becomes true. The GPU sees consistent data.
load(Acquire) in as_mut_slice() — ensures the CPU doesn’t reorder reads after the check to before it. If the flag is false, the GPU is done and all its writes are visible.
store(false, Release) on drop — pairs with the Acquire in as_mut_slice() to establish happens-before.

This is not a mutex — it’s a one-way gate. No contention, no blocking, just a panic if the contract is violated. The real synchronization is waitUntilCompleted() on the command buffer.

Memory Lifecycle

Invariant: Guard created before commit. Dropped after waitUntilCompleted. Reversing this allows CPU reads of partially-written GPU data — silent corruption, not a crash.

4. Texture Subsystem

vx-vision/src/texture.rs

#![allow(unused)]
fn main() {
pub struct Texture {
    raw: Retained<ProtocolObject<dyn MTLTextureTrait>>,
    width: u32,
    height: u32,
    format: TextureFormat,
}
}

Textures are used over buffers for image data because Metal’s texture hardware provides spatial locality optimization (tiled/swizzled memory layout), hardware bilinear interpolation, and automatic [0,255] → [0.0,1.0] normalization on R8Unorm reads. A buffer has linear memory layout — worse cache behavior for 2D access patterns like convolution kernels.

Formats

Format	`MTLPixelFormat`	Bytes/px	Readback	Shader behavior
`R8Unorm`	`R8Unorm`	1	`read_gray8() → Vec<u8>`	`image.read(gid).r` returns `[0.0, 1.0]`
`R32Float`	`R32Float`	4	`read_r32float() → Vec<f32>`	`image.read(gid).r` returns raw float
`RGBA8Unorm`	`RGBA8Unorm`	4	`read_rgba8() → Vec<u8>`	`.rgba` returns 4 normalized channels

R8Unorm normalization matters: FAST multiplies by 255 to get integer intensity for threshold comparison. Sobel operates directly in [0,1] space. Every shader must account for the format it reads.

Usage Flags

Role	Flag	Why
Input	`ShaderRead`	GPU reads, CPU uploads via `replaceRegion`. Cheapest — Metal can optimize read-only layout.
Output	`ShaderWrite`	GPU writes. Cannot be sampled in the same pass.
Intermediate	`ShaderRead \| ShaderWrite`	Read in one pass, written in another. Required for multi-pass kernels (Gaussian H→V, Sobel→magnitude).

Writing to a ShaderRead-only texture is undefined behavior — no GPU error, just corrupted output. Metal does not validate usage flags at dispatch time.

Upload & Readback

Both use MTLTexture’s replaceRegion (upload) and getBytes (readback). These are synchronous CPU-side memcpy operations. bytesPerRow must match: width * 1 for R8, width * 4 for R32Float and RGBA8.

Performance note: These operations copy data. For input textures, the copy happens once at creation. For output readback, it happens after GPU completion. In a real-time pipeline, prefer keeping data in textures across kernel stages rather than reading back to CPU between stages — use Pipeline for this.

TexturePool

vx-vision/src/pool.rs

#![allow(unused)]
fn main() {
pub struct TexturePool {
    buckets: HashMap<(u32, u32, TextureFormat), Vec<Texture>>,
    max_per_bucket: usize,  // default: 8
}
}

MTLTexture allocation is non-trivial — Metal must find contiguous GPU memory, configure tiling parameters, and set up page tables. In a real-time pipeline processing 30fps video, allocating and freeing intermediate textures per frame wastes time.

The pool caches by (width, height, format). acquire() pops from the matching bucket or allocates fresh. release() pushes back. Pool textures always have ShaderRead | ShaderWrite — they may serve as either input or output in different pipeline stages.

hit_rate() tracks effectiveness. If it’s low, the pipeline is using too many distinct texture sizes.

5. Context & Pipeline

Context

vx-vision/src/context.rs

#![allow(unused)]
fn main() {
pub struct Context {
    device: Retained<ProtocolObject<dyn MTLDevice>>,
    queue: Retained<ProtocolObject<dyn MTLCommandQueue>>,
    library: Retained<ProtocolObject<dyn MTLLibrary>>,
}
}

Single entry point. Context::new() does three things:

MTLCreateSystemDefaultDevice() — gets the system GPU
device.newCommandQueue() — creates the submission queue
Loads the embedded metallib (see Build System)

device(), queue(), library() are pub(crate) — kernel code uses them internally, but downstream consumers only interact through Context and kernel structs.

One queue per Context. Metal command queues are serial — command buffers submitted to the same queue execute in order. This simplifies synchronization: no explicit fences between dependent dispatches. For parallel kernel execution, create multiple Context instances (each with its own queue).

Pipeline

vx-vision/src/pipeline.rs

#![allow(unused)]
fn main() {
pub struct Pipeline {
    cmd_buf: Retained<ProtocolObject<dyn MTLCommandBuffer>>,
    retained: Vec<Texture>,
    committed: bool,
}
}

Batches multiple kernel dispatches into a single MTLCommandBuffer. This matters because command buffer creation has overhead — Metal must acquire a buffer from its internal pool, set up completion handlers, and track resources.

Texture retention problem: When a kernel’s encode() creates an intermediate texture, that texture would be dropped when encode() returns. But the GPU hasn’t run yet — the command buffer only records commands, it doesn’t execute them until commit(). If the texture is freed, the GPU reads garbage.

retain(tex) moves the texture into the pipeline’s Vec<Texture>, keeping it alive until commit_and_wait() returns.

Pipeline Batching Flow

6. Kernel Taxonomy

Every kernel struct follows the same construction pattern:

new(&Context) — compile pipeline(s) from the metallib. Store as Retained<...>.
Sync method (detect / apply / compute) — allocate per-dispatch resources, encode, commit, wait, readback.
encode(...) — record commands without committing. For Pipeline batching.
unsafe impl Send + Sync — pipeline states are immutable, thread-safe.

The difference between kernel categories is what goes in and what comes out.

6.1 Buffer-Output Kernels

Variable-length structured output. The shader decides at runtime how many results to emit.

Mechanism: A device atomic_uint* counter buffer starts at zero. Each thread that produces a result atomically increments the counter and writes to that slot:

uint slot = atomic_fetch_add_explicit(count, 1, memory_order_relaxed);
if (slot < params.max_corners) {
    corners[slot] = result;
}

memory_order_relaxed is sufficient — we only need the atomic increment to be unique, not ordered relative to other atomics. The slot assignment is the synchronization.

An alternative is prefix scan (parallel exclusive scan) for deterministic output ordering, but that requires two passes and shared memory coordination. Atomic append is one pass, simpler, and sufficient — output ordering isn’t needed since downstream kernels (Harris, NMS) operate on the full set regardless of order.

Annotated walkthrough: FastDetector::detect() (vx-vision/src/kernels/fast.rs)

#![allow(unused)]
fn main() {
pub fn detect(&self, ctx: &Context, texture: &Texture, config: &FastDetectConfig)
    -> Result<FastDetectResult>
{
    let w = texture.width();
    let h = texture.height();

    // Pre-allocate max capacity. The GPU fills [0..actual_count].
    // Over-allocation is cheap on UMA — no copy, just page table entries.
    let corner_buf = UnifiedBuffer::<CornerPoint>::new(ctx.device(), config.max_corners as usize)?;
    let mut count_buf = UnifiedBuffer::<u32>::new(ctx.device(), 1)?;
    count_buf.write(&[0u32]);  // Zero the atomic counter. Forgetting this → stale count from
                                // previous dispatch or random memory, causing buffer overread.

    // Build the params struct. Field order must exactly match FASTParams in FastDetect.metal.
    let params = FASTParams { threshold: config.threshold, max_corners: config.max_corners,
                              width: w, height: h };

    // Guards BEFORE dispatch — marks buffers as in-flight.
    let _corner_guard = corner_buf.gpu_guard();
    let _count_guard  = count_buf.gpu_guard();

    // Create command buffer and encoder.
    let cmd_buf = ctx.queue().commandBuffer()
        .ok_or(Error::Gpu("failed to create command buffer".into()))?;
    let encoder = cmd_buf.computeCommandEncoder()
        .ok_or(Error::Gpu("failed to create compute encoder".into()))?;

    // Bind pipeline, textures, buffers, params.
    // Index arguments (0, 1, 2) must match [[texture(0)]], [[buffer(0)]], [[buffer(1)]], [[buffer(2)]]
    // in the .metal file. A mismatch binds the wrong data — silent corruption, not a crash.
    encoder.setComputePipelineState(&self.pipeline);
    encoder.setTexture_atIndex(Some(texture.raw()), 0);
    encoder.setBuffer_offset_atIndex(Some(corner_buf.metal_buffer()), 0, 0);
    encoder.setBuffer_offset_atIndex(Some(count_buf.metal_buffer()), 0, 1);
    encoder.setBytes_length_atIndex(
        NonNull::new_unchecked(&params as *const _ as *mut c_void),
        mem::size_of::<FASTParams>(), 2);

    // 2D dispatch: one thread per pixel. threadExecutionWidth is the SIMD width (32 on Apple Silicon).
    // maxTotalThreadsPerThreadgroup is typically 1024. So threadgroup = 32x32 = 1024 threads.
    // Edge threadgroups are partial — the shader's bounds check handles out-of-range gids.
    let tew = self.pipeline.threadExecutionWidth();
    let max_tg = self.pipeline.maxTotalThreadsPerThreadgroup();
    let grid = MTLSize { width: w as usize, height: h as usize, depth: 1 };
    let tg = MTLSize { width: tew, height: (max_tg / tew).max(1), depth: 1 };
    encoder.dispatchThreads_threadsPerThreadgroup(grid, tg);
    encoder.endEncoding();

    cmd_buf.commit();
    cmd_buf.waitUntilCompleted();  // blocks until GPU finishes

    // Drop guards → unlocks buffers for CPU access.
    drop(_corner_guard);
    drop(_count_guard);

    // Read actual count, clamp to max (shader may have over-incremented the atomic
    // if more corners found than buffer capacity), slice the buffer.
    let n = (count_buf.as_slice()[0] as usize).min(config.max_corners as usize);
    Ok(FastDetectResult { corners: corner_buf.as_slice()[..n].to_vec() })
}
}

Buffer-output kernel catalog:

Kernel	Struct	Shader	Dispatch	Output type
FAST-9	`FastDetector`	`fast_detect`	2D per-pixel	`Vec<CornerPoint>`
Harris	`HarrisScorer`	`harris_response`	1D per-corner	`Vec<CornerPoint>` (scored)
NMS	`NmsSuppressor`	`nms_suppress`	1D per-corner	`Vec<CornerPoint>` (filtered)
ORB	`OrbDescriptor`	`orb_compute`	1D per-keypoint	`Vec<ORBOutput>`
Matcher	`BruteMatcher`	`hamming_distance` + `extract_matches`	2D + 1D	`Vec<MatchResult>`
StereoMatch	`StereoMatcher`	`stereo_match`	2D	`Vec<StereoMatchResult>`
Histogram	`HistogramComputer`	`histogram_compute`	2D per-pixel	`Vec<u32>` (256 bins)
Hough	`HoughDetector`	`hough_vote` + `hough_peaks`	2D + 1D	`Vec<HoughLine>`
Homography	`HomographyScorer`	`score_homography`	1D per-point	`Vec<ScoreResult>`

6.2 Texture-to-Texture Kernels

Fixed-size output. One output pixel per input pixel (or per output pixel for resize/warp). No atomic counters, no variable-length output.

Annotated walkthrough: GaussianBlur::apply() (vx-vision/src/kernels/gaussian.rs)

Separable 2-pass convolution. A 2D Gaussian with radius r requires (2r+1)^2 reads per pixel. Separating into H+V passes reduces to 2*(2r+1) reads — O(r) instead of O(r²).

#![allow(unused)]
fn main() {
pub fn apply(&self, ctx: &Context, input: &Texture, output: &Texture,
             config: &GaussianConfig) -> Result<()>
{
    let w = input.width();
    let h = input.height();

    // Intermediate texture for horizontal pass output / vertical pass input.
    // Must be ShaderRead|ShaderWrite — written by pass 1, read by pass 2.
    // R32Float avoids precision loss during intermediate accumulation.
    let intermediate = Texture::intermediate_r32float(ctx.device(), w, h)?;

    let params = GaussianParams { width: w, height: h, sigma: config.sigma, radius: config.radius };
    let cmd_buf = ctx.queue().commandBuffer()?;

    // Pass 1: horizontal blur. Each thread reads (2*radius+1) horizontal neighbors.
    // Clamp-to-edge boundary: shader does clamp(x + dx, 0, width-1).
    // Weights computed inline: exp(-dx²/(2σ²)). Not precomputed — ALU is faster than
    // the memory load that a weight LUT would require at these kernel sizes.
    {
        let enc = cmd_buf.computeCommandEncoder()?;
        enc.setComputePipelineState(&self.h_pipeline);
        enc.setTexture_atIndex(Some(input.raw()), 0);        // source
        enc.setTexture_atIndex(Some(intermediate.raw()), 1);  // dest
        enc.setBytes_length_atIndex(/* params */, /* size */, 0);
        enc.dispatchThreads_threadsPerThreadgroup(grid_2d(w, h), tg_2d(&self.h_pipeline));
        enc.endEncoding();  // seal pass 1
    }

    // Pass 2: vertical blur on the horizontally-blurred intermediate.
    // Metal guarantees pass 2 sees pass 1's writes — encoders within a command buffer
    // execute in order with implicit barriers between them.
    {
        let enc = cmd_buf.computeCommandEncoder()?;
        enc.setComputePipelineState(&self.v_pipeline);
        enc.setTexture_atIndex(Some(intermediate.raw()), 0);  // source (pass 1 output)
        enc.setTexture_atIndex(Some(output.raw()), 1);         // final dest
        enc.setBytes_length_atIndex(/* params */, /* size */, 0);
        enc.dispatchThreads_threadsPerThreadgroup(grid_2d(w, h), tg_2d(&self.v_pipeline));
        enc.endEncoding();
    }

    cmd_buf.commit();
    cmd_buf.waitUntilCompleted();
    // intermediate texture dropped here — its only purpose was the H→V handoff.
    Ok(())
}
}

Multi-pass flow:

Texture-to-texture kernel catalog:

Kernel	Struct	Shader(s)	Passes	Notes
Gaussian	`GaussianBlur`	`gaussian_blur_h`, `gaussian_blur_v`	2	Separable, O(r) vs O(r²)
Sobel	`SobelFilter`	`sobel_3x3`, `gradient_magnitude`	2	Outputs: gx, gy, magnitude, direction
Canny	`CannyDetector`	Sobel + `canny_hysteresis`	3	Composes Sobel internally
Threshold	`ThresholdFilter`	`threshold_binary` / `adaptive` / `otsu`	1–2	Otsu needs histogram first
Color	`ColorConverter`	`rgba_to_gray`, `gray_to_rgba`, `rgba_to_hsv`, `hsv_to_rgba`	1	Per-pixel, no neighbors
Morphology	`MorphFilter`	`morph_erode`, `morph_dilate`	1–2	Open = erode+dilate
Pyramid	`ImagePyramidBuilder`	`pyramid_downsample`	N	4 levels = 3 downsamples
Resize	`ResizeFilter`	`bilinear_resize`	1	Grid = output dimensions
Warp	`WarpFilter`	`warp_affine`, `warp_perspective`	1	Inverse transform per pixel
Bilateral	`BilateralFilter`	`bilateral_filter`	1	O(r²) — not separable
Dense Flow	`DenseFlowComputer`	`dense_flow`	1	Horn-Schunck
Connected	`ConnectedComponents`	`ccl_*`	iterative	Label propagation until convergence
Distance	`DistanceTransform`	`jfa_seed`, `jfa_step`, `jfa_distance`	2+N	Jump Flooding, O(log n) passes
Template	`TemplateMatcher`	`template_match_ncc`	1	NCC score map
Integral	`IntegralImage`	`integral_*`	multi	Prefix sum (row then column)

6.3 Hybrid Kernels

Consume textures, produce buffer output — or both.

KLT Optical Flow (vx-vision/src/kernels/klt.rs): Binds 8 textures (4-level pyramid × 2 frames) at indices 0–7, plus 3 buffers (prev_points, curr_points, status). Dispatch is 1D: one thread per point. Each thread does iterative Lucas-Kanade at each pyramid level (coarse-to-fine), reading texture neighborhoods and writing final position + tracking status to buffers.

DoG (Difference-of-Gaussians): Two-phase. First phase subtracts adjacent scale-space textures (texture→texture). Second phase finds 3D extrema across scale and space, appending DoGKeypoint to a buffer via atomic counter (texture→buffer).

6.4 Utility Kernels

IndirectDispatch (vx-vision/src/kernels/indirect.rs): Solves the “CPU round-trip” problem in FAST→Harris chaining.

FAST produces N corners (unknown until GPU completes). Without indirect dispatch: commit FAST → wait → read count → dispatch Harris with count. Two command buffers, one CPU stall.

With indirect dispatch: prepare_indirect_args reads the atomic counter and computes MTLDispatchThreadgroupsIndirectArguments on the GPU. Harris then dispatches using dispatchThreadgroupsWithIndirectBuffer — the GPU reads the thread count from the args buffer directly. Everything stays in one command buffer, zero CPU round-trips.

Implementation note: objc2-metal doesn’t bind dispatchThreadgroupsWithIndirectBuffer. The code uses raw msg_send! to call it.

6.5 3D Reconstruction Kernels

Feature-gated: #[cfg(feature = "reconstruction")].

Category	Kernels	Pattern
Depth processing	`DepthFilter` (bilateral + median), `DepthInpaint` (JFA hole-fill), `DepthColorize`	texture→texture
Stereo	`SGMStereo` (Semi-Global Matching)	texture→texture (disparity map)
Point cloud	`DepthToCloud` (unprojection), `OutlierFilter`, `VoxelDownsample`, `NormalEstimation`	texture/buffer→buffer
Volumetric	`TSDFIntegrate`, `TSDFRaycast`, `MarchingCubes`	buffer→buffer (3D voxel grids)
Geometry	`Triangulate`	buffer→buffer

Same patterns as core kernels. TSDF uses 3D buffer indexing (res_x * res_y * res_z voxels). MarchingCubes outputs triangle meshes to buffers with atomic vertex/index counters.

7. `#[repr(C)]` Contract

vx-vision/src/types.rs

Every params struct is passed to the GPU via setBytes_length_atIndex — a raw memcpy into the argument buffer. The GPU interprets the bytes according to the MSL struct layout. If the Rust and MSL layouts differ by even one byte, fields shift — silent data corruption, no error.

Rules

#[repr(C)] — C-compatible field ordering. Without this, Rust may reorder fields for alignment optimization.
Pod + Zeroable (bytemuck) — certifies the type is safe to transmute from raw bytes.
Field order matches MSL exactly. Same names, same order, same types.
Type mapping:

Rust	MSL	Size	Alignment	Notes
`u32`	`uint`	4	4
`i32`	`int`	4	4
`f32`	`float`	4	4
`[f32; 2]`	`float2`	8	8
`[f32; 3]` + `_pad: f32`	`float3`	16	16	Rust `[f32;3]` is only 4-byte aligned
`[f32; 4]`	`float4`	16	16
`[u8; 4]`	`uchar4`	4	4

float3 Alignment

MSL float3 occupies 16 bytes (12 data + 4 padding) and requires 16-byte alignment. Rust’s [f32; 3] is 12 bytes with 4-byte alignment. Without explicit padding, every field after [f32; 3] is offset by 4 bytes from what the GPU expects.

#![allow(unused)]
fn main() {
// Rust — matches MSL float3 layout
pub struct GpuPoint3D {
    pub position: [f32; 3],  // 12 bytes at offset 0
    pub _pad0: f32,          // 4 bytes → total 16, matching float3's footprint
    pub color: [u8; 4],      // offset 16
    pub normal: [f32; 3],    // 12 bytes at offset 20
    pub _pad1: f32,          // 4 bytes → total 16
}
}

struct GpuPoint3D {
    float3 position;  // 16 bytes (implicit padding to align)
    uchar4 color;     // 4 bytes
    float3 normal;    // 16 bytes
};

8. Dispatch Model

2D Per-Pixel

Image filters and per-pixel detectors. Grid = image dimensions.

#![allow(unused)]
fn main() {
let tew = pipeline.threadExecutionWidth();            // SIMD width: 32 on Apple Silicon
let max_tg = pipeline.maxTotalThreadsPerThreadgroup(); // typically 1024
let tg_h = (max_tg / tew).max(1);                     // 1024/32 = 32

let grid = MTLSize { width: w as usize, height: h as usize, depth: 1 };
let tg   = MTLSize { width: tew, height: tg_h, depth: 1 };  // 32x32 = 1024 threads
}

threadExecutionWidth is queried rather than hardcoded because future Apple Silicon may change the SIMD width — the query returns the hardware truth. The height is derived as max_tg / tew to maximize threadgroup occupancy: a 32x32 threadgroup fills 1024 threads (the hardware max). Smaller threadgroups waste SIMD lanes at threadgroup boundaries.

1D Per-Element

Buffer-output kernels processing variable-length arrays.

#![allow(unused)]
fn main() {
let grid = MTLSize { width: n_elements, height: 1, depth: 1 };
let tg   = MTLSize { width: tew, height: 1, depth: 1 };
}

Why `dispatchThreads` (Non-Uniform)

dispatchThreadgroups requires grid dimensions to be a multiple of threadgroup size — manual ceil-division, easy to get wrong.

dispatchThreads takes the exact thread count. Metal internally handles partial threadgroups. Every shader still bounds-checks (if gid.x >= width return) because threads in the partial edge threadgroup may exceed the intended grid.

Performance: threadgroup sizing and occupancy. Apple Silicon GPUs execute threads in SIMD groups (wavefronts) of 32. A 32x32 threadgroup = 1024 threads = 32 SIMD groups. The GPU schedules multiple threadgroups per compute unit to hide memory latency. Smaller threadgroups reduce this parallelism. Querying maxTotalThreadsPerThreadgroup and filling it completely maximizes occupancy.

9. Build System

vx-vision/build.rs

Shader Compilation Pipeline

build.rs globs shaders/*.metal

Each .metal → .air (Apple Intermediate Representation):

xcrun -sdk macosx metal -c -target air64-apple-macos14.0 -std=metal3.1 -O2 file.metal -o file.air

All .air → linked vx.metallib:

xcrun -sdk macosx metallib *.air -o vx.metallib

Embedded at compile time:

#![allow(unused)]
fn main() {
static METALLIB_BYTES: &[u8] = include_bytes!(concat!(env!("OUT_DIR"), "/vx.metallib"));
}

At runtime: written to temp file → loaded via newLibraryWithURL_error. Metal has no “load from bytes” API — it requires a file URL. The temp file is deleted immediately after loading.

Auto-discovery: cargo:rerun-if-changed=shaders watches the directory. New .metal files are picked up automatically — no build.rs edits needed.

Compiler flags:

-target air64-apple-macos14.0 — AIR bytecode for macOS 14+
-std=metal3.1 — MSL 3.1 (mesh shaders, ray tracing intrinsics available)
-O2 — full optimization (loop unrolling, dead code elimination, register allocation)
-Wno-unused-variable — suppress warnings from template code

10. Error Handling

vx-vision/src/error.rs

#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum Error {
    DeviceNotFound,              // MTLCreateSystemDefaultDevice() returned nil
    ShaderMissing(String),       // newFunctionWithName() returned nil — typo in kernel name
    PipelineCompile(String),     // newComputePipelineStateWithFunction_error failed — MSL error
    BufferAlloc { bytes: usize },// newBufferWithLength returned nil — out of GPU memory
    TextureSizeMismatch,         // dimension mismatch between input/output textures
    InvalidConfig(String),       // bad parameter (negative radius, zero dimensions, etc.)
    Gpu(String),                 // runtime: commandBuffer/encoder creation failed, execution error
}
}

From<String> maps string errors to Error::Gpu. This bridges vx-core (which returns Result<T, String>) into vx-vision’s typed error enum.

#[non_exhaustive] allows adding variants without breaking downstream match statements.

11. Common Pitfalls

Symptom	Cause	Fix
Stale data on CPU readback	Guard dropped before `waitUntilCompleted`, or never created	Guard before commit, drop after wait
Garbage pixel output	Format mismatch (R32Float where R8Unorm expected)	Match shader’s `texture2d<float>` access to texture format
Silent write failure	Input texture used as output (ShaderRead only)	Use `intermediate_` or `output_`
Shifted struct fields	`[f32; 3]` without `_pad: f32` in repr(C) struct	Pad every `float3` to 16 bytes
All-zero results	Atomic counter not zeroed before dispatch	`count_buf.write(&[0u32])`
Use-after-free crash	`EncodedBuffers` dropped before GPU completes	Keep encoded state alive until after wait
“Missing shader function”	`ns_string!()` doesn’t match `kernel void` name	Exact string match between Rust and MSL
Threadgroup too large	Hardcoded threadgroup exceeds hardware max	Query `maxTotalThreadsPerThreadgroup()`
Incorrect buffer binding	Index in `setBuffer_atIndex` mismatches `[[buffer(N)]]`	Match Rust indices to MSL attribute indices

Appendix A: Shader Inventory

42 Metal shader files in vx-vision/shaders/:

File	Kernel Function(s)	Category
`FastDetect.metal`	`fast_detect`	Feature detection
`HarrisResponse.metal`	`harris_response`	Feature detection
`NMS.metal`	`nms_suppress`	Feature detection
`ORBDescriptor.metal`	`orb_compute`	Feature description
`BruteMatcher.metal`	`hamming_distance`, `extract_matches`	Feature matching
`StereoMatch.metal`	`stereo_match`	Stereo
`KLTTracker.metal`	`klt_track_forward`	Optical flow
`DenseFlow.metal`	`dense_flow`	Optical flow
`GaussianBlur.metal`	`gaussian_blur_h`, `gaussian_blur_v`	Filtering
`Sobel.metal`	`sobel_3x3`, `gradient_magnitude`	Edge detection
`Canny.metal`	`canny_hysteresis`	Edge detection
`Threshold.metal`	`threshold_binary`, `threshold_adaptive`, `threshold_otsu`	Segmentation
`Morphology.metal`	`morph_erode`, `morph_dilate`	Morphology
`BilateralFilter.metal`	`bilateral_filter`	Filtering
`ColorConvert.metal`	`rgba_to_gray`, `gray_to_rgba`, `rgba_to_hsv`, `hsv_to_rgba`	Color
`Pyramid.metal`	`pyramid_downsample`	Scale space
`Resize.metal`	`bilinear_resize`	Geometry
`Warp.metal`	`warp_affine`, `warp_perspective`	Geometry
`IntegralImage.metal`	`integral_*`	Analysis
`Histogram.metal`	`histogram_compute`	Analysis
`HoughLines.metal`	`hough_vote`, `hough_peaks`	Line detection
`TemplateMatch.metal`	`template_match_ncc`	Template matching
`ConnectedComponents.metal`	`ccl_*`	Labeling
`DistanceTransform.metal`	`jfa_seed`, `jfa_step`, `jfa_distance`	Distance
`Homography.metal`	`score_homography`	Geometry
`IndirectArgs.metal`	`prepare_indirect_args`	Utility
`DoG.metal`	`dog_subtract`, `dog_extrema`	Scale space
`undistort.metal`	`undistort`	Calibration
`SGMStereo.metal`	`sgm_*`	Stereo
`DepthFilter.metal`	`depth_bilateral`, `depth_median`	Depth
`DepthInpaint.metal`	`depth_inpaint_*`	Depth
`DepthColorize.metal`	`depth_colorize`	Visualization
`DepthToCloud.metal`	`depth_to_cloud`	3D reconstruction
`NormalEstimation.metal`	`normal_estimation_*`	3D reconstruction
`OutlierFilter.metal`	`outlier_filter_*`	3D reconstruction
`VoxelDownsample.metal`	`voxel_downsample_*`	3D reconstruction
`TSDFIntegrate.metal`	`tsdf_integrate`	Volumetric
`TSDFRaycast.metal`	`tsdf_raycast`	Volumetric
`MarchingCubes.metal`	`marching_cubes_*`	Mesh extraction
`Triangulate.metal`	`triangulate_*`	Geometry
`PointCloudRender.metal`	`point_cloud_*`	Visualization
`MeshRender.metal`	`mesh_*`	Visualization

Keyboard shortcuts

VX Documentation