Design Document
Internal engineering reference. Assumes familiarity with Rust and GPU compute concepts.
1. System Overview
Crate Topology
Two workspace crates. vx-vision imports vx-core as vx_gpu. Both crates depend on objc2-metal for Metal bindings and bytemuck for safe #[repr(C)] transmutes. No other runtime dependencies — no wgpu, no ash, no cross-platform abstraction. Metal-only by design: this eliminates translation layers and enables direct access to Apple Silicon UMA semantics that cross-platform APIs abstract away.
2. Metal–Rust Bridge
objc2-metal
Metal is an Objective-C framework. Its API surface is protocol-based — MTLDevice, MTLCommandBuffer, etc. are protocols, not concrete classes. The runtime returns opaque objects conforming to these protocols.
objc2-metal generates Rust trait definitions from Metal’s protocol headers. Each protocol becomes a Rust trait. Since the concrete type behind the protocol is unknown at compile time, the binding uses type-erased wrappers:
#![allow(unused)]
fn main() {
Retained<ProtocolObject<dyn MTLComputePipelineState>>
}
Three layers here:
dyn MTLComputePipelineState— the Metal protocol, expressed as a Rust trait objectProtocolObject<_>— type erasure wrapper for ObjC protocol conformance (analogous todyn Traitbut bridging ObjC’s protocol dispatch, not Rust’s vtable dispatch)Retained<_>— ARC-compatible smart pointer. Holds a +1 reference count. Sendsreleaseon drop. This is the bridge between ObjC’s reference counting and Rust’s ownership semantics.
Retained means Rust owns the object. When a kernel struct holds Retained<ProtocolObject<dyn MTLComputePipelineState>>, the pipeline state lives exactly as long as the kernel struct. No manual retain/release, no leak risk, no double-free — Rust’s drop semantics handle it.
Metal Object Roles
| Object | Created by | Role | Cost | Reuse |
|---|---|---|---|---|
MTLDevice | MTLCreateSystemDefaultDevice() | GPU handle, factory for everything | One-time | Always |
MTLLibrary | device.newLibraryWithURL_error() | Container of compiled shader functions | One-time (loads metallib) | Always |
MTLComputePipelineState | device.newComputePipelineStateWithFunction_error() | Compiled, optimized kernel ready for dispatch | Expensive (shader compilation, register allocation, occupancy calculation) | Always |
MTLCommandQueue | device.newCommandQueue() | Serial scheduler that submits command buffers to GPU | One-time | Always |
MTLCommandBuffer | queue.commandBuffer() | Single batch of GPU work | Cheap (pool-allocated internally by Metal) | Never (one-shot) |
MTLComputeCommandEncoder | cmd_buf.computeCommandEncoder() | Records bind + dispatch commands into a command buffer | Cheap | Never (one-shot) |
The key insight: objects above the line (Device through Queue) are created once and stored for app lifetime. Objects below the line (CommandBuffer, Encoder) are created per-dispatch and discarded after use. This is why kernel structs store pipelines but create command buffers on every call.
Send + Sync
ObjC objects don’t carry Send/Sync in Rust’s type system. Every kernel struct, Context, and Texture needs explicit unsafe impls:
#![allow(unused)]
fn main() {
unsafe impl Send for FastDetector {}
unsafe impl Sync for FastDetector {}
}
Justification: Metal pipeline states are immutable after creation — Apple documents them as thread-safe. MTLDevice and MTLCommandQueue are thread-safe. MTLCommandBuffer is not thread-safe, but it’s never stored in a struct — it’s created and consumed within a single method call.
Without these impls, kernel structs couldn’t be stored in Arc, passed across threads, or used in async contexts.
GPU Command Lifecycle
The Rust code mapping each step:
#![allow(unused)]
fn main() {
// ── App lifetime (Context::new) ──────────────────────────────────
let device = MTLCreateSystemDefaultDevice(); // GPU handle
let queue = device.newCommandQueue(); // submission queue
let library = device.newLibraryWithURL_error(&url); // compiled shaders
// ── App lifetime (Kernel::new) ───────────────────────────────────
let func = library.newFunctionWithName(ns_string!("fast_detect")); // function handle
let pipeline = device.newComputePipelineStateWithFunction_error(&func); // compiled kernel
// ↑ This is the expensive step: Metal compiles the function to GPU ISA,
// determines register usage, calculates max occupancy. ~1-10ms per kernel.
// ── Per dispatch (Kernel::detect / apply / compute) ──────────────
let cmd_buf = queue.commandBuffer(); // lightweight, pool-allocated
let encoder = cmd_buf.computeCommandEncoder(); // begins a compute pass
encoder.setComputePipelineState(&pipeline); // which kernel to run
encoder.setTexture_atIndex(Some(tex.raw()), 0); // [[texture(0)]] in shader
encoder.setBuffer_offset_atIndex( // [[buffer(0)]] in shader
Some(buf.metal_buffer()), 0, 0);
encoder.setBytes_length_atIndex( // [[buffer(2)]] — inline constant data
NonNull::new_unchecked(params_ptr), size, 2); // memcpy'd into argument buffer
encoder.dispatchThreads_threadsPerThreadgroup(grid, tg); // launch the grid
encoder.endEncoding(); // seal this compute pass
cmd_buf.commit(); // submit to GPU — non-blocking, returns immediately
cmd_buf.waitUntilCompleted(); // block CPU until GPU finishes this command buffer
// After this point: all buffer/texture writes by the GPU are visible to CPU.
}
setBytes copies the struct directly into Metal’s argument buffer (max 4KB), avoiding a MTLBuffer allocation for small constant data. Every kernel uses this for its Params struct. For large data (keypoint arrays, descriptors), setBuffer is required — it binds an existing MTLBuffer by reference.
3. Memory Model
Apple Silicon UMA
Apple Silicon uses Unified Memory Architecture — CPU and GPU share the same physical DRAM. No PCIe bus. No DMA copies. A pointer to shared memory is valid on both processors.
MTLStorageModeShared exposes this: one allocation, one virtual address space, zero-copy access from both CPU and GPU. This is the only storage mode vx-rs uses.
Trade-off: On discrete GPUs (NVIDIA/AMD), StorageModeShared would require PCIe transfers on every access. On Apple Silicon, it’s free — both CPU and GPU access the same cache hierarchy. This is a deliberate platform bet: zero-copy UMA is the core performance advantage that makes the “no cross-platform” design worthwhile.
Synchronization model: After cmd_buf.waitUntilCompleted(), all GPU writes are visible to CPU. No explicit cache flush needed — Metal’s command buffer completion acts as a full memory fence. Before that fence, CPU reads may see stale data (writes can be in GPU caches or reorder buffers).
UnifiedBuffer<T>
vx-core/src/buffer.rs
#![allow(unused)]
fn main() {
pub struct UnifiedBuffer<T: Pod> {
raw: Retained<ProtocolObject<dyn MTLBuffer>>, // the Metal buffer (+1 refcount)
count: usize, // number of T elements
in_flight: Arc<AtomicBool>, // prevents CPU mutation during GPU work
_marker: PhantomData<T>, // carries T without owning one
}
}
The Pod bound (bytemuck::Pod) guarantees the type has no padding traps, no invalid bit patterns, and is safe to interpret from arbitrary bytes. Metal writes raw bytes into the buffer — Pod ensures the CPU can safely interpret them as T after GPU completion. Every #[repr(C)] struct in types.rs derives Pod.
Allocation: newBufferWithLength_options(count * size_of::<T>(), StorageModeShared). Metal zero-initializes shared buffers on allocation. This is relied upon — atomic counter buffers start at zero by default (though kernels explicitly write [0u32] before dispatch for clarity).
CPU access paths:
as_slice() -> &[T]— castsMTLBuffer.contents()pointer to&[T]. No copy. The slice directly aliases GPU-visible memory.as_mut_slice() -> &mut [T]— same, but asserts!in_flight. Panics if aGpuGuardexists. This prevents writing to memory the GPU is reading from.write(&[T])—copy_from_sliceinto the buffer. Panics if in-flight.to_vec() -> Vec<T>— copies out. Use after GPU completion for safe owned data.
GpuGuard<T>
vx-core/src/buffer.rs
RAII guard that blocks CPU mutation while the buffer is in-flight on the GPU.
#![allow(unused)]
fn main() {
pub struct GpuGuard<T: Pod> {
in_flight: Arc<AtomicBool>, // shared with the UnifiedBuffer
_marker: PhantomData<T>,
}
// Created: buf.gpu_guard() → sets in_flight = true (Release)
// Dropped: sets in_flight = false (Release)
// Checked: as_mut_slice() loads in_flight (Acquire) → panics if true
}
Atomic ordering:
store(true, Release)on guard creation — ensures all CPU writes to the buffer before creating the guard are visible before the flag becomes true. The GPU sees consistent data.load(Acquire)inas_mut_slice()— ensures the CPU doesn’t reorder reads after the check to before it. If the flag is false, the GPU is done and all its writes are visible.store(false, Release)on drop — pairs with the Acquire inas_mut_slice()to establish happens-before.
This is not a mutex — it’s a one-way gate. No contention, no blocking, just a panic if the contract is violated. The real synchronization is waitUntilCompleted() on the command buffer.
Memory Lifecycle
Invariant: Guard created before commit. Dropped after waitUntilCompleted. Reversing this allows CPU reads of partially-written GPU data — silent corruption, not a crash.
4. Texture Subsystem
vx-vision/src/texture.rs
#![allow(unused)]
fn main() {
pub struct Texture {
raw: Retained<ProtocolObject<dyn MTLTextureTrait>>,
width: u32,
height: u32,
format: TextureFormat,
}
}
Textures are used over buffers for image data because Metal’s texture hardware provides spatial locality optimization (tiled/swizzled memory layout), hardware bilinear interpolation, and automatic [0,255] → [0.0,1.0] normalization on R8Unorm reads. A buffer has linear memory layout — worse cache behavior for 2D access patterns like convolution kernels.
Formats
| Format | MTLPixelFormat | Bytes/px | Readback | Shader behavior |
|---|---|---|---|---|
R8Unorm | R8Unorm | 1 | read_gray8() → Vec<u8> | image.read(gid).r returns [0.0, 1.0] |
R32Float | R32Float | 4 | read_r32float() → Vec<f32> | image.read(gid).r returns raw float |
RGBA8Unorm | RGBA8Unorm | 4 | read_rgba8() → Vec<u8> | .rgba returns 4 normalized channels |
R8Unorm normalization matters: FAST multiplies by 255 to get integer intensity for threshold comparison. Sobel operates directly in [0,1] space. Every shader must account for the format it reads.
Usage Flags
| Role | Flag | Why |
|---|---|---|
| Input | ShaderRead | GPU reads, CPU uploads via replaceRegion. Cheapest — Metal can optimize read-only layout. |
| Output | ShaderWrite | GPU writes. Cannot be sampled in the same pass. |
| Intermediate | ShaderRead | ShaderWrite | Read in one pass, written in another. Required for multi-pass kernels (Gaussian H→V, Sobel→magnitude). |
Writing to a ShaderRead-only texture is undefined behavior — no GPU error, just corrupted output. Metal does not validate usage flags at dispatch time.
Upload & Readback
Both use MTLTexture’s replaceRegion (upload) and getBytes (readback). These are synchronous CPU-side memcpy operations. bytesPerRow must match: width * 1 for R8, width * 4 for R32Float and RGBA8.
Performance note: These operations copy data. For input textures, the copy happens once at creation. For output readback, it happens after GPU completion. In a real-time pipeline, prefer keeping data in textures across kernel stages rather than reading back to CPU between stages — use Pipeline for this.
TexturePool
vx-vision/src/pool.rs
#![allow(unused)]
fn main() {
pub struct TexturePool {
buckets: HashMap<(u32, u32, TextureFormat), Vec<Texture>>,
max_per_bucket: usize, // default: 8
}
}
MTLTexture allocation is non-trivial — Metal must find contiguous GPU memory, configure tiling parameters, and set up page tables. In a real-time pipeline processing 30fps video, allocating and freeing intermediate textures per frame wastes time.
The pool caches by (width, height, format). acquire() pops from the matching bucket or allocates fresh. release() pushes back. Pool textures always have ShaderRead | ShaderWrite — they may serve as either input or output in different pipeline stages.
hit_rate() tracks effectiveness. If it’s low, the pipeline is using too many distinct texture sizes.
5. Context & Pipeline
Context
vx-vision/src/context.rs
#![allow(unused)]
fn main() {
pub struct Context {
device: Retained<ProtocolObject<dyn MTLDevice>>,
queue: Retained<ProtocolObject<dyn MTLCommandQueue>>,
library: Retained<ProtocolObject<dyn MTLLibrary>>,
}
}
Single entry point. Context::new() does three things:
MTLCreateSystemDefaultDevice()— gets the system GPUdevice.newCommandQueue()— creates the submission queue- Loads the embedded metallib (see Build System)
device(), queue(), library() are pub(crate) — kernel code uses them internally, but downstream consumers only interact through Context and kernel structs.
One queue per Context. Metal command queues are serial — command buffers submitted to the same queue execute in order. This simplifies synchronization: no explicit fences between dependent dispatches. For parallel kernel execution, create multiple Context instances (each with its own queue).
Pipeline
vx-vision/src/pipeline.rs
#![allow(unused)]
fn main() {
pub struct Pipeline {
cmd_buf: Retained<ProtocolObject<dyn MTLCommandBuffer>>,
retained: Vec<Texture>,
committed: bool,
}
}
Batches multiple kernel dispatches into a single MTLCommandBuffer. This matters because command buffer creation has overhead — Metal must acquire a buffer from its internal pool, set up completion handlers, and track resources.
Texture retention problem: When a kernel’s encode() creates an intermediate texture, that texture would be dropped when encode() returns. But the GPU hasn’t run yet — the command buffer only records commands, it doesn’t execute them until commit(). If the texture is freed, the GPU reads garbage.
retain(tex) moves the texture into the pipeline’s Vec<Texture>, keeping it alive until commit_and_wait() returns.
Pipeline Batching Flow
6. Kernel Taxonomy
Every kernel struct follows the same construction pattern:
new(&Context)— compile pipeline(s) from the metallib. Store asRetained<...>.- Sync method (
detect/apply/compute) — allocate per-dispatch resources, encode, commit, wait, readback. encode(...)— record commands without committing. ForPipelinebatching.unsafe impl Send + Sync— pipeline states are immutable, thread-safe.
The difference between kernel categories is what goes in and what comes out.
6.1 Buffer-Output Kernels
Variable-length structured output. The shader decides at runtime how many results to emit.
Mechanism: A device atomic_uint* counter buffer starts at zero. Each thread that produces a result atomically increments the counter and writes to that slot:
uint slot = atomic_fetch_add_explicit(count, 1, memory_order_relaxed);
if (slot < params.max_corners) {
corners[slot] = result;
}
memory_order_relaxed is sufficient — we only need the atomic increment to be unique, not ordered relative to other atomics. The slot assignment is the synchronization.
An alternative is prefix scan (parallel exclusive scan) for deterministic output ordering, but that requires two passes and shared memory coordination. Atomic append is one pass, simpler, and sufficient — output ordering isn’t needed since downstream kernels (Harris, NMS) operate on the full set regardless of order.
Annotated walkthrough: FastDetector::detect() (vx-vision/src/kernels/fast.rs)
#![allow(unused)]
fn main() {
pub fn detect(&self, ctx: &Context, texture: &Texture, config: &FastDetectConfig)
-> Result<FastDetectResult>
{
let w = texture.width();
let h = texture.height();
// Pre-allocate max capacity. The GPU fills [0..actual_count].
// Over-allocation is cheap on UMA — no copy, just page table entries.
let corner_buf = UnifiedBuffer::<CornerPoint>::new(ctx.device(), config.max_corners as usize)?;
let mut count_buf = UnifiedBuffer::<u32>::new(ctx.device(), 1)?;
count_buf.write(&[0u32]); // Zero the atomic counter. Forgetting this → stale count from
// previous dispatch or random memory, causing buffer overread.
// Build the params struct. Field order must exactly match FASTParams in FastDetect.metal.
let params = FASTParams { threshold: config.threshold, max_corners: config.max_corners,
width: w, height: h };
// Guards BEFORE dispatch — marks buffers as in-flight.
let _corner_guard = corner_buf.gpu_guard();
let _count_guard = count_buf.gpu_guard();
// Create command buffer and encoder.
let cmd_buf = ctx.queue().commandBuffer()
.ok_or(Error::Gpu("failed to create command buffer".into()))?;
let encoder = cmd_buf.computeCommandEncoder()
.ok_or(Error::Gpu("failed to create compute encoder".into()))?;
// Bind pipeline, textures, buffers, params.
// Index arguments (0, 1, 2) must match [[texture(0)]], [[buffer(0)]], [[buffer(1)]], [[buffer(2)]]
// in the .metal file. A mismatch binds the wrong data — silent corruption, not a crash.
encoder.setComputePipelineState(&self.pipeline);
encoder.setTexture_atIndex(Some(texture.raw()), 0);
encoder.setBuffer_offset_atIndex(Some(corner_buf.metal_buffer()), 0, 0);
encoder.setBuffer_offset_atIndex(Some(count_buf.metal_buffer()), 0, 1);
encoder.setBytes_length_atIndex(
NonNull::new_unchecked(¶ms as *const _ as *mut c_void),
mem::size_of::<FASTParams>(), 2);
// 2D dispatch: one thread per pixel. threadExecutionWidth is the SIMD width (32 on Apple Silicon).
// maxTotalThreadsPerThreadgroup is typically 1024. So threadgroup = 32x32 = 1024 threads.
// Edge threadgroups are partial — the shader's bounds check handles out-of-range gids.
let tew = self.pipeline.threadExecutionWidth();
let max_tg = self.pipeline.maxTotalThreadsPerThreadgroup();
let grid = MTLSize { width: w as usize, height: h as usize, depth: 1 };
let tg = MTLSize { width: tew, height: (max_tg / tew).max(1), depth: 1 };
encoder.dispatchThreads_threadsPerThreadgroup(grid, tg);
encoder.endEncoding();
cmd_buf.commit();
cmd_buf.waitUntilCompleted(); // blocks until GPU finishes
// Drop guards → unlocks buffers for CPU access.
drop(_corner_guard);
drop(_count_guard);
// Read actual count, clamp to max (shader may have over-incremented the atomic
// if more corners found than buffer capacity), slice the buffer.
let n = (count_buf.as_slice()[0] as usize).min(config.max_corners as usize);
Ok(FastDetectResult { corners: corner_buf.as_slice()[..n].to_vec() })
}
}
Buffer-output kernel catalog:
| Kernel | Struct | Shader | Dispatch | Output type |
|---|---|---|---|---|
| FAST-9 | FastDetector | fast_detect | 2D per-pixel | Vec<CornerPoint> |
| Harris | HarrisScorer | harris_response | 1D per-corner | Vec<CornerPoint> (scored) |
| NMS | NmsSuppressor | nms_suppress | 1D per-corner | Vec<CornerPoint> (filtered) |
| ORB | OrbDescriptor | orb_compute | 1D per-keypoint | Vec<ORBOutput> |
| Matcher | BruteMatcher | hamming_distance + extract_matches | 2D + 1D | Vec<MatchResult> |
| StereoMatch | StereoMatcher | stereo_match | 2D | Vec<StereoMatchResult> |
| Histogram | HistogramComputer | histogram_compute | 2D per-pixel | Vec<u32> (256 bins) |
| Hough | HoughDetector | hough_vote + hough_peaks | 2D + 1D | Vec<HoughLine> |
| Homography | HomographyScorer | score_homography | 1D per-point | Vec<ScoreResult> |
6.2 Texture-to-Texture Kernels
Fixed-size output. One output pixel per input pixel (or per output pixel for resize/warp). No atomic counters, no variable-length output.
Annotated walkthrough: GaussianBlur::apply() (vx-vision/src/kernels/gaussian.rs)
Separable 2-pass convolution. A 2D Gaussian with radius r requires (2r+1)^2 reads per pixel. Separating into H+V passes reduces to 2*(2r+1) reads — O(r) instead of O(r²).
#![allow(unused)]
fn main() {
pub fn apply(&self, ctx: &Context, input: &Texture, output: &Texture,
config: &GaussianConfig) -> Result<()>
{
let w = input.width();
let h = input.height();
// Intermediate texture for horizontal pass output / vertical pass input.
// Must be ShaderRead|ShaderWrite — written by pass 1, read by pass 2.
// R32Float avoids precision loss during intermediate accumulation.
let intermediate = Texture::intermediate_r32float(ctx.device(), w, h)?;
let params = GaussianParams { width: w, height: h, sigma: config.sigma, radius: config.radius };
let cmd_buf = ctx.queue().commandBuffer()?;
// Pass 1: horizontal blur. Each thread reads (2*radius+1) horizontal neighbors.
// Clamp-to-edge boundary: shader does clamp(x + dx, 0, width-1).
// Weights computed inline: exp(-dx²/(2σ²)). Not precomputed — ALU is faster than
// the memory load that a weight LUT would require at these kernel sizes.
{
let enc = cmd_buf.computeCommandEncoder()?;
enc.setComputePipelineState(&self.h_pipeline);
enc.setTexture_atIndex(Some(input.raw()), 0); // source
enc.setTexture_atIndex(Some(intermediate.raw()), 1); // dest
enc.setBytes_length_atIndex(/* params */, /* size */, 0);
enc.dispatchThreads_threadsPerThreadgroup(grid_2d(w, h), tg_2d(&self.h_pipeline));
enc.endEncoding(); // seal pass 1
}
// Pass 2: vertical blur on the horizontally-blurred intermediate.
// Metal guarantees pass 2 sees pass 1's writes — encoders within a command buffer
// execute in order with implicit barriers between them.
{
let enc = cmd_buf.computeCommandEncoder()?;
enc.setComputePipelineState(&self.v_pipeline);
enc.setTexture_atIndex(Some(intermediate.raw()), 0); // source (pass 1 output)
enc.setTexture_atIndex(Some(output.raw()), 1); // final dest
enc.setBytes_length_atIndex(/* params */, /* size */, 0);
enc.dispatchThreads_threadsPerThreadgroup(grid_2d(w, h), tg_2d(&self.v_pipeline));
enc.endEncoding();
}
cmd_buf.commit();
cmd_buf.waitUntilCompleted();
// intermediate texture dropped here — its only purpose was the H→V handoff.
Ok(())
}
}
Multi-pass flow:
Texture-to-texture kernel catalog:
| Kernel | Struct | Shader(s) | Passes | Notes |
|---|---|---|---|---|
| Gaussian | GaussianBlur | gaussian_blur_h, gaussian_blur_v | 2 | Separable, O(r) vs O(r²) |
| Sobel | SobelFilter | sobel_3x3, gradient_magnitude | 2 | Outputs: gx, gy, magnitude, direction |
| Canny | CannyDetector | Sobel + canny_hysteresis | 3 | Composes Sobel internally |
| Threshold | ThresholdFilter | threshold_binary / adaptive / otsu | 1–2 | Otsu needs histogram first |
| Color | ColorConverter | rgba_to_gray, gray_to_rgba, rgba_to_hsv, hsv_to_rgba | 1 | Per-pixel, no neighbors |
| Morphology | MorphFilter | morph_erode, morph_dilate | 1–2 | Open = erode+dilate |
| Pyramid | ImagePyramidBuilder | pyramid_downsample | N | 4 levels = 3 downsamples |
| Resize | ResizeFilter | bilinear_resize | 1 | Grid = output dimensions |
| Warp | WarpFilter | warp_affine, warp_perspective | 1 | Inverse transform per pixel |
| Bilateral | BilateralFilter | bilateral_filter | 1 | O(r²) — not separable |
| Dense Flow | DenseFlowComputer | dense_flow | 1 | Horn-Schunck |
| Connected | ConnectedComponents | ccl_* | iterative | Label propagation until convergence |
| Distance | DistanceTransform | jfa_seed, jfa_step, jfa_distance | 2+N | Jump Flooding, O(log n) passes |
| Template | TemplateMatcher | template_match_ncc | 1 | NCC score map |
| Integral | IntegralImage | integral_* | multi | Prefix sum (row then column) |
6.3 Hybrid Kernels
Consume textures, produce buffer output — or both.
KLT Optical Flow (vx-vision/src/kernels/klt.rs): Binds 8 textures (4-level pyramid × 2 frames) at indices 0–7, plus 3 buffers (prev_points, curr_points, status). Dispatch is 1D: one thread per point. Each thread does iterative Lucas-Kanade at each pyramid level (coarse-to-fine), reading texture neighborhoods and writing final position + tracking status to buffers.
DoG (Difference-of-Gaussians): Two-phase. First phase subtracts adjacent scale-space textures (texture→texture). Second phase finds 3D extrema across scale and space, appending DoGKeypoint to a buffer via atomic counter (texture→buffer).
6.4 Utility Kernels
IndirectDispatch (vx-vision/src/kernels/indirect.rs): Solves the “CPU round-trip” problem in FAST→Harris chaining.
FAST produces N corners (unknown until GPU completes). Without indirect dispatch: commit FAST → wait → read count → dispatch Harris with count. Two command buffers, one CPU stall.
With indirect dispatch: prepare_indirect_args reads the atomic counter and computes MTLDispatchThreadgroupsIndirectArguments on the GPU. Harris then dispatches using dispatchThreadgroupsWithIndirectBuffer — the GPU reads the thread count from the args buffer directly. Everything stays in one command buffer, zero CPU round-trips.
Implementation note: objc2-metal doesn’t bind dispatchThreadgroupsWithIndirectBuffer. The code uses raw msg_send! to call it.
6.5 3D Reconstruction Kernels
Feature-gated: #[cfg(feature = "reconstruction")].
| Category | Kernels | Pattern |
|---|---|---|
| Depth processing | DepthFilter (bilateral + median), DepthInpaint (JFA hole-fill), DepthColorize | texture→texture |
| Stereo | SGMStereo (Semi-Global Matching) | texture→texture (disparity map) |
| Point cloud | DepthToCloud (unprojection), OutlierFilter, VoxelDownsample, NormalEstimation | texture/buffer→buffer |
| Volumetric | TSDFIntegrate, TSDFRaycast, MarchingCubes | buffer→buffer (3D voxel grids) |
| Geometry | Triangulate | buffer→buffer |
Same patterns as core kernels. TSDF uses 3D buffer indexing (res_x * res_y * res_z voxels). MarchingCubes outputs triangle meshes to buffers with atomic vertex/index counters.
7. #[repr(C)] Contract
vx-vision/src/types.rs
Every params struct is passed to the GPU via setBytes_length_atIndex — a raw memcpy into the argument buffer. The GPU interprets the bytes according to the MSL struct layout. If the Rust and MSL layouts differ by even one byte, fields shift — silent data corruption, no error.
Rules
#[repr(C)]— C-compatible field ordering. Without this, Rust may reorder fields for alignment optimization.Pod + Zeroable(bytemuck) — certifies the type is safe to transmute from raw bytes.- Field order matches MSL exactly. Same names, same order, same types.
- Type mapping:
| Rust | MSL | Size | Alignment | Notes |
|---|---|---|---|---|
u32 | uint | 4 | 4 | |
i32 | int | 4 | 4 | |
f32 | float | 4 | 4 | |
[f32; 2] | float2 | 8 | 8 | |
[f32; 3] + _pad: f32 | float3 | 16 | 16 | Rust [f32;3] is only 4-byte aligned |
[f32; 4] | float4 | 16 | 16 | |
[u8; 4] | uchar4 | 4 | 4 |
float3 Alignment
MSL float3 occupies 16 bytes (12 data + 4 padding) and requires 16-byte alignment. Rust’s [f32; 3] is 12 bytes with 4-byte alignment. Without explicit padding, every field after [f32; 3] is offset by 4 bytes from what the GPU expects.
#![allow(unused)]
fn main() {
// Rust — matches MSL float3 layout
pub struct GpuPoint3D {
pub position: [f32; 3], // 12 bytes at offset 0
pub _pad0: f32, // 4 bytes → total 16, matching float3's footprint
pub color: [u8; 4], // offset 16
pub normal: [f32; 3], // 12 bytes at offset 20
pub _pad1: f32, // 4 bytes → total 16
}
}
struct GpuPoint3D {
float3 position; // 16 bytes (implicit padding to align)
uchar4 color; // 4 bytes
float3 normal; // 16 bytes
};
8. Dispatch Model
2D Per-Pixel
Image filters and per-pixel detectors. Grid = image dimensions.
#![allow(unused)]
fn main() {
let tew = pipeline.threadExecutionWidth(); // SIMD width: 32 on Apple Silicon
let max_tg = pipeline.maxTotalThreadsPerThreadgroup(); // typically 1024
let tg_h = (max_tg / tew).max(1); // 1024/32 = 32
let grid = MTLSize { width: w as usize, height: h as usize, depth: 1 };
let tg = MTLSize { width: tew, height: tg_h, depth: 1 }; // 32x32 = 1024 threads
}
threadExecutionWidth is queried rather than hardcoded because future Apple Silicon may change the SIMD width — the query returns the hardware truth. The height is derived as max_tg / tew to maximize threadgroup occupancy: a 32x32 threadgroup fills 1024 threads (the hardware max). Smaller threadgroups waste SIMD lanes at threadgroup boundaries.
1D Per-Element
Buffer-output kernels processing variable-length arrays.
#![allow(unused)]
fn main() {
let grid = MTLSize { width: n_elements, height: 1, depth: 1 };
let tg = MTLSize { width: tew, height: 1, depth: 1 };
}
Why dispatchThreads (Non-Uniform)
dispatchThreadgroups requires grid dimensions to be a multiple of threadgroup size — manual ceil-division, easy to get wrong.
dispatchThreads takes the exact thread count. Metal internally handles partial threadgroups. Every shader still bounds-checks (if gid.x >= width return) because threads in the partial edge threadgroup may exceed the intended grid.
Performance: threadgroup sizing and occupancy. Apple Silicon GPUs execute threads in SIMD groups (wavefronts) of 32. A 32x32 threadgroup = 1024 threads = 32 SIMD groups. The GPU schedules multiple threadgroups per compute unit to hide memory latency. Smaller threadgroups reduce this parallelism. Querying maxTotalThreadsPerThreadgroup and filling it completely maximizes occupancy.
9. Build System
vx-vision/build.rs
Shader Compilation Pipeline
build.rsglobsshaders/*.metal- Each
.metal→.air(Apple Intermediate Representation):xcrun -sdk macosx metal -c -target air64-apple-macos14.0 -std=metal3.1 -O2 file.metal -o file.air - All
.air→ linkedvx.metallib:xcrun -sdk macosx metallib *.air -o vx.metallib - Embedded at compile time:
#![allow(unused)] fn main() { static METALLIB_BYTES: &[u8] = include_bytes!(concat!(env!("OUT_DIR"), "/vx.metallib")); } - At runtime: written to temp file → loaded via
newLibraryWithURL_error. Metal has no “load from bytes” API — it requires a file URL. The temp file is deleted immediately after loading.
Auto-discovery: cargo:rerun-if-changed=shaders watches the directory. New .metal files are picked up automatically — no build.rs edits needed.
Compiler flags:
-target air64-apple-macos14.0— AIR bytecode for macOS 14+-std=metal3.1— MSL 3.1 (mesh shaders, ray tracing intrinsics available)-O2— full optimization (loop unrolling, dead code elimination, register allocation)-Wno-unused-variable— suppress warnings from template code
10. Error Handling
vx-vision/src/error.rs
#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum Error {
DeviceNotFound, // MTLCreateSystemDefaultDevice() returned nil
ShaderMissing(String), // newFunctionWithName() returned nil — typo in kernel name
PipelineCompile(String), // newComputePipelineStateWithFunction_error failed — MSL error
BufferAlloc { bytes: usize },// newBufferWithLength returned nil — out of GPU memory
TextureSizeMismatch, // dimension mismatch between input/output textures
InvalidConfig(String), // bad parameter (negative radius, zero dimensions, etc.)
Gpu(String), // runtime: commandBuffer/encoder creation failed, execution error
}
}
From<String> maps string errors to Error::Gpu. This bridges vx-core (which returns Result<T, String>) into vx-vision’s typed error enum.
#[non_exhaustive] allows adding variants without breaking downstream match statements.
11. Common Pitfalls
| Symptom | Cause | Fix |
|---|---|---|
| Stale data on CPU readback | Guard dropped before waitUntilCompleted, or never created | Guard before commit, drop after wait |
| Garbage pixel output | Format mismatch (R32Float where R8Unorm expected) | Match shader’s texture2d<float> access to texture format |
| Silent write failure | Input texture used as output (ShaderRead only) | Use intermediate_* or output_* |
| Shifted struct fields | [f32; 3] without _pad: f32 in repr(C) struct | Pad every float3 to 16 bytes |
| All-zero results | Atomic counter not zeroed before dispatch | count_buf.write(&[0u32]) |
| Use-after-free crash | EncodedBuffers dropped before GPU completes | Keep encoded state alive until after wait |
| “Missing shader function” | ns_string!() doesn’t match kernel void name | Exact string match between Rust and MSL |
| Threadgroup too large | Hardcoded threadgroup exceeds hardware max | Query maxTotalThreadsPerThreadgroup() |
| Incorrect buffer binding | Index in setBuffer_atIndex mismatches [[buffer(N)]] | Match Rust indices to MSL attribute indices |
Appendix A: Shader Inventory
42 Metal shader files in vx-vision/shaders/:
| File | Kernel Function(s) | Category |
|---|---|---|
FastDetect.metal | fast_detect | Feature detection |
HarrisResponse.metal | harris_response | Feature detection |
NMS.metal | nms_suppress | Feature detection |
ORBDescriptor.metal | orb_compute | Feature description |
BruteMatcher.metal | hamming_distance, extract_matches | Feature matching |
StereoMatch.metal | stereo_match | Stereo |
KLTTracker.metal | klt_track_forward | Optical flow |
DenseFlow.metal | dense_flow | Optical flow |
GaussianBlur.metal | gaussian_blur_h, gaussian_blur_v | Filtering |
Sobel.metal | sobel_3x3, gradient_magnitude | Edge detection |
Canny.metal | canny_hysteresis | Edge detection |
Threshold.metal | threshold_binary, threshold_adaptive, threshold_otsu | Segmentation |
Morphology.metal | morph_erode, morph_dilate | Morphology |
BilateralFilter.metal | bilateral_filter | Filtering |
ColorConvert.metal | rgba_to_gray, gray_to_rgba, rgba_to_hsv, hsv_to_rgba | Color |
Pyramid.metal | pyramid_downsample | Scale space |
Resize.metal | bilinear_resize | Geometry |
Warp.metal | warp_affine, warp_perspective | Geometry |
IntegralImage.metal | integral_* | Analysis |
Histogram.metal | histogram_compute | Analysis |
HoughLines.metal | hough_vote, hough_peaks | Line detection |
TemplateMatch.metal | template_match_ncc | Template matching |
ConnectedComponents.metal | ccl_* | Labeling |
DistanceTransform.metal | jfa_seed, jfa_step, jfa_distance | Distance |
Homography.metal | score_homography | Geometry |
IndirectArgs.metal | prepare_indirect_args | Utility |
DoG.metal | dog_subtract, dog_extrema | Scale space |
undistort.metal | undistort | Calibration |
SGMStereo.metal | sgm_* | Stereo |
DepthFilter.metal | depth_bilateral, depth_median | Depth |
DepthInpaint.metal | depth_inpaint_* | Depth |
DepthColorize.metal | depth_colorize | Visualization |
DepthToCloud.metal | depth_to_cloud | 3D reconstruction |
NormalEstimation.metal | normal_estimation_* | 3D reconstruction |
OutlierFilter.metal | outlier_filter_* | 3D reconstruction |
VoxelDownsample.metal | voxel_downsample_* | 3D reconstruction |
TSDFIntegrate.metal | tsdf_integrate | Volumetric |
TSDFRaycast.metal | tsdf_raycast | Volumetric |
MarchingCubes.metal | marching_cubes_* | Mesh extraction |
Triangulate.metal | triangulate_* | Geometry |
PointCloudRender.metal | point_cloud_* | Visualization |
MeshRender.metal | mesh_* | Visualization |