Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Design Document

Internal engineering reference. Assumes familiarity with Rust and GPU compute concepts.


1. System Overview

Crate Topology

Application Layer examples/, downstream consumers vx-vision Context · Texture · Pipeline · TexturePool · 28 kernels 42 Metal shaders · #[repr(C)] types vx-core (published as vx-gpu) UnifiedBuffer<T> · GpuGuard<T> · device management objc2-metal objc2 objc2-foundation bytemuck

Two workspace crates. vx-vision imports vx-core as vx_gpu. Both crates depend on objc2-metal for Metal bindings and bytemuck for safe #[repr(C)] transmutes. No other runtime dependencies — no wgpu, no ash, no cross-platform abstraction. Metal-only by design: this eliminates translation layers and enables direct access to Apple Silicon UMA semantics that cross-platform APIs abstract away.


2. Metal–Rust Bridge

objc2-metal

Metal is an Objective-C framework. Its API surface is protocol-based — MTLDevice, MTLCommandBuffer, etc. are protocols, not concrete classes. The runtime returns opaque objects conforming to these protocols.

objc2-metal generates Rust trait definitions from Metal’s protocol headers. Each protocol becomes a Rust trait. Since the concrete type behind the protocol is unknown at compile time, the binding uses type-erased wrappers:

#![allow(unused)]
fn main() {
Retained<ProtocolObject<dyn MTLComputePipelineState>>
}

Three layers here:

  • dyn MTLComputePipelineState — the Metal protocol, expressed as a Rust trait object
  • ProtocolObject<_> — type erasure wrapper for ObjC protocol conformance (analogous to dyn Trait but bridging ObjC’s protocol dispatch, not Rust’s vtable dispatch)
  • Retained<_> — ARC-compatible smart pointer. Holds a +1 reference count. Sends release on drop. This is the bridge between ObjC’s reference counting and Rust’s ownership semantics.

Retained means Rust owns the object. When a kernel struct holds Retained<ProtocolObject<dyn MTLComputePipelineState>>, the pipeline state lives exactly as long as the kernel struct. No manual retain/release, no leak risk, no double-free — Rust’s drop semantics handle it.

Metal Object Roles

ObjectCreated byRoleCostReuse
MTLDeviceMTLCreateSystemDefaultDevice()GPU handle, factory for everythingOne-timeAlways
MTLLibrarydevice.newLibraryWithURL_error()Container of compiled shader functionsOne-time (loads metallib)Always
MTLComputePipelineStatedevice.newComputePipelineStateWithFunction_error()Compiled, optimized kernel ready for dispatchExpensive (shader compilation, register allocation, occupancy calculation)Always
MTLCommandQueuedevice.newCommandQueue()Serial scheduler that submits command buffers to GPUOne-timeAlways
MTLCommandBufferqueue.commandBuffer()Single batch of GPU workCheap (pool-allocated internally by Metal)Never (one-shot)
MTLComputeCommandEncodercmd_buf.computeCommandEncoder()Records bind + dispatch commands into a command bufferCheapNever (one-shot)

The key insight: objects above the line (Device through Queue) are created once and stored for app lifetime. Objects below the line (CommandBuffer, Encoder) are created per-dispatch and discarded after use. This is why kernel structs store pipelines but create command buffers on every call.

Send + Sync

ObjC objects don’t carry Send/Sync in Rust’s type system. Every kernel struct, Context, and Texture needs explicit unsafe impls:

#![allow(unused)]
fn main() {
unsafe impl Send for FastDetector {}
unsafe impl Sync for FastDetector {}
}

Justification: Metal pipeline states are immutable after creation — Apple documents them as thread-safe. MTLDevice and MTLCommandQueue are thread-safe. MTLCommandBuffer is not thread-safe, but it’s never stored in a struct — it’s created and consumed within a single method call.

Without these impls, kernel structs couldn’t be stored in Arc, passed across threads, or used in async contexts.

GPU Command Lifecycle

Device Queue CmdBuffer Encoder setPipeline setBuffers Dispatch commit() wait() app lifetime app lifetime per batch per kernel bind args launch grid fence + block per-dispatch scope (created and consumed in one method call)

The Rust code mapping each step:

#![allow(unused)]
fn main() {
// ── App lifetime (Context::new) ──────────────────────────────────
let device  = MTLCreateSystemDefaultDevice();           // GPU handle
let queue   = device.newCommandQueue();                 // submission queue
let library = device.newLibraryWithURL_error(&url);     // compiled shaders

// ── App lifetime (Kernel::new) ───────────────────────────────────
let func     = library.newFunctionWithName(ns_string!("fast_detect"));  // function handle
let pipeline = device.newComputePipelineStateWithFunction_error(&func); // compiled kernel
// ↑ This is the expensive step: Metal compiles the function to GPU ISA,
//   determines register usage, calculates max occupancy. ~1-10ms per kernel.

// ── Per dispatch (Kernel::detect / apply / compute) ──────────────
let cmd_buf = queue.commandBuffer();                    // lightweight, pool-allocated
let encoder = cmd_buf.computeCommandEncoder();          // begins a compute pass

encoder.setComputePipelineState(&pipeline);             // which kernel to run
encoder.setTexture_atIndex(Some(tex.raw()), 0);         // [[texture(0)]] in shader
encoder.setBuffer_offset_atIndex(                       // [[buffer(0)]] in shader
    Some(buf.metal_buffer()), 0, 0);
encoder.setBytes_length_atIndex(                        // [[buffer(2)]] — inline constant data
    NonNull::new_unchecked(params_ptr), size, 2);       // memcpy'd into argument buffer
encoder.dispatchThreads_threadsPerThreadgroup(grid, tg); // launch the grid
encoder.endEncoding();                                   // seal this compute pass

cmd_buf.commit();              // submit to GPU — non-blocking, returns immediately
cmd_buf.waitUntilCompleted();  // block CPU until GPU finishes this command buffer
// After this point: all buffer/texture writes by the GPU are visible to CPU.
}

setBytes copies the struct directly into Metal’s argument buffer (max 4KB), avoiding a MTLBuffer allocation for small constant data. Every kernel uses this for its Params struct. For large data (keypoint arrays, descriptors), setBuffer is required — it binds an existing MTLBuffer by reference.


3. Memory Model

Apple Silicon UMA

Apple Silicon uses Unified Memory Architecture — CPU and GPU share the same physical DRAM. No PCIe bus. No DMA copies. A pointer to shared memory is valid on both processors.

MTLStorageModeShared exposes this: one allocation, one virtual address space, zero-copy access from both CPU and GPU. This is the only storage mode vx-rs uses.

Trade-off: On discrete GPUs (NVIDIA/AMD), StorageModeShared would require PCIe transfers on every access. On Apple Silicon, it’s free — both CPU and GPU access the same cache hierarchy. This is a deliberate platform bet: zero-copy UMA is the core performance advantage that makes the “no cross-platform” design worthwhile.

Synchronization model: After cmd_buf.waitUntilCompleted(), all GPU writes are visible to CPU. No explicit cache flush needed — Metal’s command buffer completion acts as a full memory fence. Before that fence, CPU reads may see stale data (writes can be in GPU caches or reorder buffers).

UnifiedBuffer<T>

vx-core/src/buffer.rs

#![allow(unused)]
fn main() {
pub struct UnifiedBuffer<T: Pod> {
    raw: Retained<ProtocolObject<dyn MTLBuffer>>,  // the Metal buffer (+1 refcount)
    count: usize,                                   // number of T elements
    in_flight: Arc<AtomicBool>,                     // prevents CPU mutation during GPU work
    _marker: PhantomData<T>,                        // carries T without owning one
}
}

The Pod bound (bytemuck::Pod) guarantees the type has no padding traps, no invalid bit patterns, and is safe to interpret from arbitrary bytes. Metal writes raw bytes into the buffer — Pod ensures the CPU can safely interpret them as T after GPU completion. Every #[repr(C)] struct in types.rs derives Pod.

Allocation: newBufferWithLength_options(count * size_of::<T>(), StorageModeShared). Metal zero-initializes shared buffers on allocation. This is relied upon — atomic counter buffers start at zero by default (though kernels explicitly write [0u32] before dispatch for clarity).

CPU access paths:

  • as_slice() -> &[T] — casts MTLBuffer.contents() pointer to &[T]. No copy. The slice directly aliases GPU-visible memory.
  • as_mut_slice() -> &mut [T] — same, but asserts !in_flight. Panics if a GpuGuard exists. This prevents writing to memory the GPU is reading from.
  • write(&[T])copy_from_slice into the buffer. Panics if in-flight.
  • to_vec() -> Vec<T> — copies out. Use after GPU completion for safe owned data.

GpuGuard<T>

vx-core/src/buffer.rs

RAII guard that blocks CPU mutation while the buffer is in-flight on the GPU.

#![allow(unused)]
fn main() {
pub struct GpuGuard<T: Pod> {
    in_flight: Arc<AtomicBool>,  // shared with the UnifiedBuffer
    _marker: PhantomData<T>,
}

// Created: buf.gpu_guard() → sets in_flight = true (Release)
// Dropped: sets in_flight = false (Release)
// Checked: as_mut_slice() loads in_flight (Acquire) → panics if true
}

Atomic ordering:

  • store(true, Release) on guard creation — ensures all CPU writes to the buffer before creating the guard are visible before the flag becomes true. The GPU sees consistent data.
  • load(Acquire) in as_mut_slice() — ensures the CPU doesn’t reorder reads after the check to before it. If the flag is false, the GPU is done and all its writes are visible.
  • store(false, Release) on drop — pairs with the Acquire in as_mut_slice() to establish happens-before.

This is not a mutex — it’s a one-way gate. No contention, no blocking, just a panic if the contract is violated. The real synchronization is waitUntilCompleted() on the command buffer.

Memory Lifecycle

Alloc Write Guard Dispatch Wait Drop Guard Read CPU CPU lock GPU fence unlock CPU CPU mutation panics in this window

Invariant: Guard created before commit. Dropped after waitUntilCompleted. Reversing this allows CPU reads of partially-written GPU data — silent corruption, not a crash.


4. Texture Subsystem

vx-vision/src/texture.rs

#![allow(unused)]
fn main() {
pub struct Texture {
    raw: Retained<ProtocolObject<dyn MTLTextureTrait>>,
    width: u32,
    height: u32,
    format: TextureFormat,
}
}

Textures are used over buffers for image data because Metal’s texture hardware provides spatial locality optimization (tiled/swizzled memory layout), hardware bilinear interpolation, and automatic [0,255] → [0.0,1.0] normalization on R8Unorm reads. A buffer has linear memory layout — worse cache behavior for 2D access patterns like convolution kernels.

Formats

FormatMTLPixelFormatBytes/pxReadbackShader behavior
R8UnormR8Unorm1read_gray8() → Vec<u8>image.read(gid).r returns [0.0, 1.0]
R32FloatR32Float4read_r32float() → Vec<f32>image.read(gid).r returns raw float
RGBA8UnormRGBA8Unorm4read_rgba8() → Vec<u8>.rgba returns 4 normalized channels

R8Unorm normalization matters: FAST multiplies by 255 to get integer intensity for threshold comparison. Sobel operates directly in [0,1] space. Every shader must account for the format it reads.

Usage Flags

RoleFlagWhy
InputShaderReadGPU reads, CPU uploads via replaceRegion. Cheapest — Metal can optimize read-only layout.
OutputShaderWriteGPU writes. Cannot be sampled in the same pass.
IntermediateShaderRead | ShaderWriteRead in one pass, written in another. Required for multi-pass kernels (Gaussian H→V, Sobel→magnitude).

Writing to a ShaderRead-only texture is undefined behavior — no GPU error, just corrupted output. Metal does not validate usage flags at dispatch time.

Upload & Readback

Both use MTLTexture’s replaceRegion (upload) and getBytes (readback). These are synchronous CPU-side memcpy operations. bytesPerRow must match: width * 1 for R8, width * 4 for R32Float and RGBA8.

Performance note: These operations copy data. For input textures, the copy happens once at creation. For output readback, it happens after GPU completion. In a real-time pipeline, prefer keeping data in textures across kernel stages rather than reading back to CPU between stages — use Pipeline for this.

TexturePool

vx-vision/src/pool.rs

#![allow(unused)]
fn main() {
pub struct TexturePool {
    buckets: HashMap<(u32, u32, TextureFormat), Vec<Texture>>,
    max_per_bucket: usize,  // default: 8
}
}

MTLTexture allocation is non-trivial — Metal must find contiguous GPU memory, configure tiling parameters, and set up page tables. In a real-time pipeline processing 30fps video, allocating and freeing intermediate textures per frame wastes time.

The pool caches by (width, height, format). acquire() pops from the matching bucket or allocates fresh. release() pushes back. Pool textures always have ShaderRead | ShaderWrite — they may serve as either input or output in different pipeline stages.

hit_rate() tracks effectiveness. If it’s low, the pipeline is using too many distinct texture sizes.


5. Context & Pipeline

Context

vx-vision/src/context.rs

#![allow(unused)]
fn main() {
pub struct Context {
    device: Retained<ProtocolObject<dyn MTLDevice>>,
    queue: Retained<ProtocolObject<dyn MTLCommandQueue>>,
    library: Retained<ProtocolObject<dyn MTLLibrary>>,
}
}

Single entry point. Context::new() does three things:

  1. MTLCreateSystemDefaultDevice() — gets the system GPU
  2. device.newCommandQueue() — creates the submission queue
  3. Loads the embedded metallib (see Build System)

device(), queue(), library() are pub(crate) — kernel code uses them internally, but downstream consumers only interact through Context and kernel structs.

One queue per Context. Metal command queues are serial — command buffers submitted to the same queue execute in order. This simplifies synchronization: no explicit fences between dependent dispatches. For parallel kernel execution, create multiple Context instances (each with its own queue).

Pipeline

vx-vision/src/pipeline.rs

#![allow(unused)]
fn main() {
pub struct Pipeline {
    cmd_buf: Retained<ProtocolObject<dyn MTLCommandBuffer>>,
    retained: Vec<Texture>,
    committed: bool,
}
}

Batches multiple kernel dispatches into a single MTLCommandBuffer. This matters because command buffer creation has overhead — Metal must acquire a buffer from its internal pool, set up completion handlers, and track resources.

Texture retention problem: When a kernel’s encode() creates an intermediate texture, that texture would be dropped when encode() returns. But the GPU hasn’t run yet — the command buffer only records commands, it doesn’t execute them until commit(). If the texture is freed, the GPU reads garbage.

retain(tex) moves the texture into the pipeline’s Vec<Texture>, keeping it alive until commit_and_wait() returns.

Pipeline Batching Flow

Pipeline (single MTLCommandBuffer) begin() gaussian.encode() encoder 1 → endEncoding retain(tex) fast.encode() encoder 2 → endEncoding commit_and_wait() Multiple encoders within one command buffer. Metal executes passes sequentially. Single commit/wait amortizes submission overhead across all passes.

6. Kernel Taxonomy

Every kernel struct follows the same construction pattern:

  1. new(&Context) — compile pipeline(s) from the metallib. Store as Retained<...>.
  2. Sync method (detect / apply / compute) — allocate per-dispatch resources, encode, commit, wait, readback.
  3. encode(...) — record commands without committing. For Pipeline batching.
  4. unsafe impl Send + Sync — pipeline states are immutable, thread-safe.

The difference between kernel categories is what goes in and what comes out.

6.1 Buffer-Output Kernels

Variable-length structured output. The shader decides at runtime how many results to emit.

Mechanism: A device atomic_uint* counter buffer starts at zero. Each thread that produces a result atomically increments the counter and writes to that slot:

uint slot = atomic_fetch_add_explicit(count, 1, memory_order_relaxed);
if (slot < params.max_corners) {
    corners[slot] = result;
}

memory_order_relaxed is sufficient — we only need the atomic increment to be unique, not ordered relative to other atomics. The slot assignment is the synchronization.

An alternative is prefix scan (parallel exclusive scan) for deterministic output ordering, but that requires two passes and shared memory coordination. Atomic append is one pass, simpler, and sufficient — output ordering isn’t needed since downstream kernels (Harris, NMS) operate on the full set regardless of order.

Texture R8Unorm Compute Shader 2D grid: 1 thread/pixel or 1D: 1 thread/element atomic_fetch_add → slot Buffer<T> [N] atomic_uint count CPU Readback buf[..count].to_vec() constant Params&

Annotated walkthrough: FastDetector::detect() (vx-vision/src/kernels/fast.rs)

#![allow(unused)]
fn main() {
pub fn detect(&self, ctx: &Context, texture: &Texture, config: &FastDetectConfig)
    -> Result<FastDetectResult>
{
    let w = texture.width();
    let h = texture.height();

    // Pre-allocate max capacity. The GPU fills [0..actual_count].
    // Over-allocation is cheap on UMA — no copy, just page table entries.
    let corner_buf = UnifiedBuffer::<CornerPoint>::new(ctx.device(), config.max_corners as usize)?;
    let mut count_buf = UnifiedBuffer::<u32>::new(ctx.device(), 1)?;
    count_buf.write(&[0u32]);  // Zero the atomic counter. Forgetting this → stale count from
                                // previous dispatch or random memory, causing buffer overread.

    // Build the params struct. Field order must exactly match FASTParams in FastDetect.metal.
    let params = FASTParams { threshold: config.threshold, max_corners: config.max_corners,
                              width: w, height: h };

    // Guards BEFORE dispatch — marks buffers as in-flight.
    let _corner_guard = corner_buf.gpu_guard();
    let _count_guard  = count_buf.gpu_guard();

    // Create command buffer and encoder.
    let cmd_buf = ctx.queue().commandBuffer()
        .ok_or(Error::Gpu("failed to create command buffer".into()))?;
    let encoder = cmd_buf.computeCommandEncoder()
        .ok_or(Error::Gpu("failed to create compute encoder".into()))?;

    // Bind pipeline, textures, buffers, params.
    // Index arguments (0, 1, 2) must match [[texture(0)]], [[buffer(0)]], [[buffer(1)]], [[buffer(2)]]
    // in the .metal file. A mismatch binds the wrong data — silent corruption, not a crash.
    encoder.setComputePipelineState(&self.pipeline);
    encoder.setTexture_atIndex(Some(texture.raw()), 0);
    encoder.setBuffer_offset_atIndex(Some(corner_buf.metal_buffer()), 0, 0);
    encoder.setBuffer_offset_atIndex(Some(count_buf.metal_buffer()), 0, 1);
    encoder.setBytes_length_atIndex(
        NonNull::new_unchecked(&params as *const _ as *mut c_void),
        mem::size_of::<FASTParams>(), 2);

    // 2D dispatch: one thread per pixel. threadExecutionWidth is the SIMD width (32 on Apple Silicon).
    // maxTotalThreadsPerThreadgroup is typically 1024. So threadgroup = 32x32 = 1024 threads.
    // Edge threadgroups are partial — the shader's bounds check handles out-of-range gids.
    let tew = self.pipeline.threadExecutionWidth();
    let max_tg = self.pipeline.maxTotalThreadsPerThreadgroup();
    let grid = MTLSize { width: w as usize, height: h as usize, depth: 1 };
    let tg = MTLSize { width: tew, height: (max_tg / tew).max(1), depth: 1 };
    encoder.dispatchThreads_threadsPerThreadgroup(grid, tg);
    encoder.endEncoding();

    cmd_buf.commit();
    cmd_buf.waitUntilCompleted();  // blocks until GPU finishes

    // Drop guards → unlocks buffers for CPU access.
    drop(_corner_guard);
    drop(_count_guard);

    // Read actual count, clamp to max (shader may have over-incremented the atomic
    // if more corners found than buffer capacity), slice the buffer.
    let n = (count_buf.as_slice()[0] as usize).min(config.max_corners as usize);
    Ok(FastDetectResult { corners: corner_buf.as_slice()[..n].to_vec() })
}
}

Buffer-output kernel catalog:

KernelStructShaderDispatchOutput type
FAST-9FastDetectorfast_detect2D per-pixelVec<CornerPoint>
HarrisHarrisScorerharris_response1D per-cornerVec<CornerPoint> (scored)
NMSNmsSuppressornms_suppress1D per-cornerVec<CornerPoint> (filtered)
ORBOrbDescriptororb_compute1D per-keypointVec<ORBOutput>
MatcherBruteMatcherhamming_distance + extract_matches2D + 1DVec<MatchResult>
StereoMatchStereoMatcherstereo_match2DVec<StereoMatchResult>
HistogramHistogramComputerhistogram_compute2D per-pixelVec<u32> (256 bins)
HoughHoughDetectorhough_vote + hough_peaks2D + 1DVec<HoughLine>
HomographyHomographyScorerscore_homography1D per-pointVec<ScoreResult>

6.2 Texture-to-Texture Kernels

Fixed-size output. One output pixel per input pixel (or per output pixel for resize/warp). No atomic counters, no variable-length output.

Input Texture [[texture(0)]] Compute Shader 2D grid, 1 thread/pixel read → transform → write Output Texture [[texture(1)]]

Annotated walkthrough: GaussianBlur::apply() (vx-vision/src/kernels/gaussian.rs)

Separable 2-pass convolution. A 2D Gaussian with radius r requires (2r+1)^2 reads per pixel. Separating into H+V passes reduces to 2*(2r+1) reads — O(r) instead of O(r²).

#![allow(unused)]
fn main() {
pub fn apply(&self, ctx: &Context, input: &Texture, output: &Texture,
             config: &GaussianConfig) -> Result<()>
{
    let w = input.width();
    let h = input.height();

    // Intermediate texture for horizontal pass output / vertical pass input.
    // Must be ShaderRead|ShaderWrite — written by pass 1, read by pass 2.
    // R32Float avoids precision loss during intermediate accumulation.
    let intermediate = Texture::intermediate_r32float(ctx.device(), w, h)?;

    let params = GaussianParams { width: w, height: h, sigma: config.sigma, radius: config.radius };
    let cmd_buf = ctx.queue().commandBuffer()?;

    // Pass 1: horizontal blur. Each thread reads (2*radius+1) horizontal neighbors.
    // Clamp-to-edge boundary: shader does clamp(x + dx, 0, width-1).
    // Weights computed inline: exp(-dx²/(2σ²)). Not precomputed — ALU is faster than
    // the memory load that a weight LUT would require at these kernel sizes.
    {
        let enc = cmd_buf.computeCommandEncoder()?;
        enc.setComputePipelineState(&self.h_pipeline);
        enc.setTexture_atIndex(Some(input.raw()), 0);        // source
        enc.setTexture_atIndex(Some(intermediate.raw()), 1);  // dest
        enc.setBytes_length_atIndex(/* params */, /* size */, 0);
        enc.dispatchThreads_threadsPerThreadgroup(grid_2d(w, h), tg_2d(&self.h_pipeline));
        enc.endEncoding();  // seal pass 1
    }

    // Pass 2: vertical blur on the horizontally-blurred intermediate.
    // Metal guarantees pass 2 sees pass 1's writes — encoders within a command buffer
    // execute in order with implicit barriers between them.
    {
        let enc = cmd_buf.computeCommandEncoder()?;
        enc.setComputePipelineState(&self.v_pipeline);
        enc.setTexture_atIndex(Some(intermediate.raw()), 0);  // source (pass 1 output)
        enc.setTexture_atIndex(Some(output.raw()), 1);         // final dest
        enc.setBytes_length_atIndex(/* params */, /* size */, 0);
        enc.dispatchThreads_threadsPerThreadgroup(grid_2d(w, h), tg_2d(&self.v_pipeline));
        enc.endEncoding();
    }

    cmd_buf.commit();
    cmd_buf.waitUntilCompleted();
    // intermediate texture dropped here — its only purpose was the H→V handoff.
    Ok(())
}
}

Multi-pass flow:

Input Pass 1 (H) gaussian_blur_h Intermediate R32Float R|W Pass 2 (V) gaussian_blur_v Output

Texture-to-texture kernel catalog:

KernelStructShader(s)PassesNotes
GaussianGaussianBlurgaussian_blur_h, gaussian_blur_v2Separable, O(r) vs O(r²)
SobelSobelFiltersobel_3x3, gradient_magnitude2Outputs: gx, gy, magnitude, direction
CannyCannyDetectorSobel + canny_hysteresis3Composes Sobel internally
ThresholdThresholdFilterthreshold_binary / adaptive / otsu1–2Otsu needs histogram first
ColorColorConverterrgba_to_gray, gray_to_rgba, rgba_to_hsv, hsv_to_rgba1Per-pixel, no neighbors
MorphologyMorphFiltermorph_erode, morph_dilate1–2Open = erode+dilate
PyramidImagePyramidBuilderpyramid_downsampleN4 levels = 3 downsamples
ResizeResizeFilterbilinear_resize1Grid = output dimensions
WarpWarpFilterwarp_affine, warp_perspective1Inverse transform per pixel
BilateralBilateralFilterbilateral_filter1O(r²) — not separable
Dense FlowDenseFlowComputerdense_flow1Horn-Schunck
ConnectedConnectedComponentsccl_*iterativeLabel propagation until convergence
DistanceDistanceTransformjfa_seed, jfa_step, jfa_distance2+NJump Flooding, O(log n) passes
TemplateTemplateMatchertemplate_match_ncc1NCC score map
IntegralIntegralImageintegral_*multiPrefix sum (row then column)

6.3 Hybrid Kernels

Consume textures, produce buffer output — or both.

KLT Optical Flow (vx-vision/src/kernels/klt.rs): Binds 8 textures (4-level pyramid × 2 frames) at indices 0–7, plus 3 buffers (prev_points, curr_points, status). Dispatch is 1D: one thread per point. Each thread does iterative Lucas-Kanade at each pyramid level (coarse-to-fine), reading texture neighborhoods and writing final position + tracking status to buffers.

DoG (Difference-of-Gaussians): Two-phase. First phase subtracts adjacent scale-space textures (texture→texture). Second phase finds 3D extrema across scale and space, appending DoGKeypoint to a buffer via atomic counter (texture→buffer).

6.4 Utility Kernels

IndirectDispatch (vx-vision/src/kernels/indirect.rs): Solves the “CPU round-trip” problem in FAST→Harris chaining.

FAST produces N corners (unknown until GPU completes). Without indirect dispatch: commit FAST → wait → read count → dispatch Harris with count. Two command buffers, one CPU stall.

With indirect dispatch: prepare_indirect_args reads the atomic counter and computes MTLDispatchThreadgroupsIndirectArguments on the GPU. Harris then dispatches using dispatchThreadgroupsWithIndirectBuffer — the GPU reads the thread count from the args buffer directly. Everything stays in one command buffer, zero CPU round-trips.

Implementation note: objc2-metal doesn’t bind dispatchThreadgroupsWithIndirectBuffer. The code uses raw msg_send! to call it.

6.5 3D Reconstruction Kernels

Feature-gated: #[cfg(feature = "reconstruction")].

CategoryKernelsPattern
Depth processingDepthFilter (bilateral + median), DepthInpaint (JFA hole-fill), DepthColorizetexture→texture
StereoSGMStereo (Semi-Global Matching)texture→texture (disparity map)
Point cloudDepthToCloud (unprojection), OutlierFilter, VoxelDownsample, NormalEstimationtexture/buffer→buffer
VolumetricTSDFIntegrate, TSDFRaycast, MarchingCubesbuffer→buffer (3D voxel grids)
GeometryTriangulatebuffer→buffer

Same patterns as core kernels. TSDF uses 3D buffer indexing (res_x * res_y * res_z voxels). MarchingCubes outputs triangle meshes to buffers with atomic vertex/index counters.


7. #[repr(C)] Contract

vx-vision/src/types.rs

Every params struct is passed to the GPU via setBytes_length_atIndex — a raw memcpy into the argument buffer. The GPU interprets the bytes according to the MSL struct layout. If the Rust and MSL layouts differ by even one byte, fields shift — silent data corruption, no error.

Rules

  1. #[repr(C)] — C-compatible field ordering. Without this, Rust may reorder fields for alignment optimization.
  2. Pod + Zeroable (bytemuck) — certifies the type is safe to transmute from raw bytes.
  3. Field order matches MSL exactly. Same names, same order, same types.
  4. Type mapping:
RustMSLSizeAlignmentNotes
u32uint44
i32int44
f32float44
[f32; 2]float288
[f32; 3] + _pad: f32float31616Rust [f32;3] is only 4-byte aligned
[f32; 4]float41616
[u8; 4]uchar444

float3 Alignment

MSL float3 occupies 16 bytes (12 data + 4 padding) and requires 16-byte alignment. Rust’s [f32; 3] is 12 bytes with 4-byte alignment. Without explicit padding, every field after [f32; 3] is offset by 4 bytes from what the GPU expects.

#![allow(unused)]
fn main() {
// Rust — matches MSL float3 layout
pub struct GpuPoint3D {
    pub position: [f32; 3],  // 12 bytes at offset 0
    pub _pad0: f32,          // 4 bytes → total 16, matching float3's footprint
    pub color: [u8; 4],      // offset 16
    pub normal: [f32; 3],    // 12 bytes at offset 20
    pub _pad1: f32,          // 4 bytes → total 16
}
}
struct GpuPoint3D {
    float3 position;  // 16 bytes (implicit padding to align)
    uchar4 color;     // 4 bytes
    float3 normal;    // 16 bytes
};

8. Dispatch Model

2D Per-Pixel

Image filters and per-pixel detectors. Grid = image dimensions.

#![allow(unused)]
fn main() {
let tew = pipeline.threadExecutionWidth();            // SIMD width: 32 on Apple Silicon
let max_tg = pipeline.maxTotalThreadsPerThreadgroup(); // typically 1024
let tg_h = (max_tg / tew).max(1);                     // 1024/32 = 32

let grid = MTLSize { width: w as usize, height: h as usize, depth: 1 };
let tg   = MTLSize { width: tew, height: tg_h, depth: 1 };  // 32x32 = 1024 threads
}

threadExecutionWidth is queried rather than hardcoded because future Apple Silicon may change the SIMD width — the query returns the hardware truth. The height is derived as max_tg / tew to maximize threadgroup occupancy: a 32x32 threadgroup fills 1024 threads (the hardware max). Smaller threadgroups waste SIMD lanes at threadgroup boundaries.

1D Per-Element

Buffer-output kernels processing variable-length arrays.

#![allow(unused)]
fn main() {
let grid = MTLSize { width: n_elements, height: 1, depth: 1 };
let tg   = MTLSize { width: tew, height: 1, depth: 1 };
}

Why dispatchThreads (Non-Uniform)

dispatchThreadgroups requires grid dimensions to be a multiple of threadgroup size — manual ceil-division, easy to get wrong.

dispatchThreads takes the exact thread count. Metal internally handles partial threadgroups. Every shader still bounds-checks (if gid.x >= width return) because threads in the partial edge threadgroup may exceed the intended grid.

2D Dispatch: 640x480, threadgroup 32x32 Grid: 640 x 480 threads ... ... Full 32x32 threadgroup Partial (edge threads) 640 / 32 = 20 full columns 480 / 32 = 15 full rows Edge threads: gid >= dims → early return

Performance: threadgroup sizing and occupancy. Apple Silicon GPUs execute threads in SIMD groups (wavefronts) of 32. A 32x32 threadgroup = 1024 threads = 32 SIMD groups. The GPU schedules multiple threadgroups per compute unit to hide memory latency. Smaller threadgroups reduce this parallelism. Querying maxTotalThreadsPerThreadgroup and filling it completely maximizes occupancy.


9. Build System

vx-vision/build.rs

Shader Compilation Pipeline

shaders/ FastDetect.metal GaussianBlur.metal ... (42 files) xcrun metal -c -std=metal3.1 -O2 .air files (42 objects) xcrun metallib vx.metallib OUT_DIR include_bytes!() → embedded in binary
  1. build.rs globs shaders/*.metal
  2. Each .metal.air (Apple Intermediate Representation):
    xcrun -sdk macosx metal -c -target air64-apple-macos14.0 -std=metal3.1 -O2 file.metal -o file.air
    
  3. All .air → linked vx.metallib:
    xcrun -sdk macosx metallib *.air -o vx.metallib
    
  4. Embedded at compile time:
    #![allow(unused)]
    fn main() {
    static METALLIB_BYTES: &[u8] = include_bytes!(concat!(env!("OUT_DIR"), "/vx.metallib"));
    }
  5. At runtime: written to temp file → loaded via newLibraryWithURL_error. Metal has no “load from bytes” API — it requires a file URL. The temp file is deleted immediately after loading.

Auto-discovery: cargo:rerun-if-changed=shaders watches the directory. New .metal files are picked up automatically — no build.rs edits needed.

Compiler flags:

  • -target air64-apple-macos14.0 — AIR bytecode for macOS 14+
  • -std=metal3.1 — MSL 3.1 (mesh shaders, ray tracing intrinsics available)
  • -O2 — full optimization (loop unrolling, dead code elimination, register allocation)
  • -Wno-unused-variable — suppress warnings from template code

10. Error Handling

vx-vision/src/error.rs

#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum Error {
    DeviceNotFound,              // MTLCreateSystemDefaultDevice() returned nil
    ShaderMissing(String),       // newFunctionWithName() returned nil — typo in kernel name
    PipelineCompile(String),     // newComputePipelineStateWithFunction_error failed — MSL error
    BufferAlloc { bytes: usize },// newBufferWithLength returned nil — out of GPU memory
    TextureSizeMismatch,         // dimension mismatch between input/output textures
    InvalidConfig(String),       // bad parameter (negative radius, zero dimensions, etc.)
    Gpu(String),                 // runtime: commandBuffer/encoder creation failed, execution error
}
}

From<String> maps string errors to Error::Gpu. This bridges vx-core (which returns Result<T, String>) into vx-vision’s typed error enum.

#[non_exhaustive] allows adding variants without breaking downstream match statements.


11. Common Pitfalls

SymptomCauseFix
Stale data on CPU readbackGuard dropped before waitUntilCompleted, or never createdGuard before commit, drop after wait
Garbage pixel outputFormat mismatch (R32Float where R8Unorm expected)Match shader’s texture2d<float> access to texture format
Silent write failureInput texture used as output (ShaderRead only)Use intermediate_* or output_*
Shifted struct fields[f32; 3] without _pad: f32 in repr(C) structPad every float3 to 16 bytes
All-zero resultsAtomic counter not zeroed before dispatchcount_buf.write(&[0u32])
Use-after-free crashEncodedBuffers dropped before GPU completesKeep encoded state alive until after wait
“Missing shader function”ns_string!() doesn’t match kernel void nameExact string match between Rust and MSL
Threadgroup too largeHardcoded threadgroup exceeds hardware maxQuery maxTotalThreadsPerThreadgroup()
Incorrect buffer bindingIndex in setBuffer_atIndex mismatches [[buffer(N)]]Match Rust indices to MSL attribute indices

Appendix A: Shader Inventory

42 Metal shader files in vx-vision/shaders/:

FileKernel Function(s)Category
FastDetect.metalfast_detectFeature detection
HarrisResponse.metalharris_responseFeature detection
NMS.metalnms_suppressFeature detection
ORBDescriptor.metalorb_computeFeature description
BruteMatcher.metalhamming_distance, extract_matchesFeature matching
StereoMatch.metalstereo_matchStereo
KLTTracker.metalklt_track_forwardOptical flow
DenseFlow.metaldense_flowOptical flow
GaussianBlur.metalgaussian_blur_h, gaussian_blur_vFiltering
Sobel.metalsobel_3x3, gradient_magnitudeEdge detection
Canny.metalcanny_hysteresisEdge detection
Threshold.metalthreshold_binary, threshold_adaptive, threshold_otsuSegmentation
Morphology.metalmorph_erode, morph_dilateMorphology
BilateralFilter.metalbilateral_filterFiltering
ColorConvert.metalrgba_to_gray, gray_to_rgba, rgba_to_hsv, hsv_to_rgbaColor
Pyramid.metalpyramid_downsampleScale space
Resize.metalbilinear_resizeGeometry
Warp.metalwarp_affine, warp_perspectiveGeometry
IntegralImage.metalintegral_*Analysis
Histogram.metalhistogram_computeAnalysis
HoughLines.metalhough_vote, hough_peaksLine detection
TemplateMatch.metaltemplate_match_nccTemplate matching
ConnectedComponents.metalccl_*Labeling
DistanceTransform.metaljfa_seed, jfa_step, jfa_distanceDistance
Homography.metalscore_homographyGeometry
IndirectArgs.metalprepare_indirect_argsUtility
DoG.metaldog_subtract, dog_extremaScale space
undistort.metalundistortCalibration
SGMStereo.metalsgm_*Stereo
DepthFilter.metaldepth_bilateral, depth_medianDepth
DepthInpaint.metaldepth_inpaint_*Depth
DepthColorize.metaldepth_colorizeVisualization
DepthToCloud.metaldepth_to_cloud3D reconstruction
NormalEstimation.metalnormal_estimation_*3D reconstruction
OutlierFilter.metaloutlier_filter_*3D reconstruction
VoxelDownsample.metalvoxel_downsample_*3D reconstruction
TSDFIntegrate.metaltsdf_integrateVolumetric
TSDFRaycast.metaltsdf_raycastVolumetric
MarchingCubes.metalmarching_cubes_*Mesh extraction
Triangulate.metaltriangulate_*Geometry
PointCloudRender.metalpoint_cloud_*Visualization
MeshRender.metalmesh_*Visualization