VX
GPU-accelerated computer vision for Rust on Apple Silicon.
VX talks directly to the Metal GPU through compute shaders, using Apple Silicon’s Unified Memory Architecture to eliminate the CPU-GPU copy overhead that plagues traditional vision libraries.
What it does
28 GPU kernels covering classical computer vision: feature detection (FAST, Harris, ORB, SIFT), image processing (Gaussian, bilateral, Canny, morphology, thresholding), geometry (pyramids, warping, homography), motion (KLT tracking, dense flow), stereo matching, and analysis (Hough lines, template matching, distance transforms, connected components).
Why it exists
OpenCV and similar libraries treat the GPU as a separate device. Data gets copied from CPU memory to GPU memory and back, repeatedly. On Apple Silicon this is wasteful — the CPU and GPU share the same physical memory. VX skips the copies entirely.
The library uses Rust bindings to Metal via objc2-metal, giving type-safe GPU access with Rust’s ownership model enforcing buffer safety at compile time. Metal Shading Language (MSL) kernels run the pixel-level computation on the GPU while Rust handles orchestration and the public API.
Quick taste
#![allow(unused)]
fn main() {
use vx_vision::Context;
use vx_vision::kernels::fast::{FastDetector, FastDetectConfig};
let ctx = Context::new()?;
let texture = ctx.texture_gray8(&pixels, width, height)?;
let fast = FastDetector::new(&ctx)?;
let result = fast.detect(&ctx, &texture, &FastDetectConfig::default())?;
println!("Found {} corners", result.corners.len());
}
No unsafe in user code. No Metal imports. No GPU boilerplate.
Getting Started
Requirements
- macOS on Apple Silicon (M1/M2/M3/M4) or any Mac with a Metal-capable GPU
- Rust stable toolchain
- Xcode command line tools:
xcode-select --install
Installation
[dependencies]
vx-vision = "0.1"
Core concepts
Context
Context initializes the Metal device, command queue, and shader library. Create one at startup and pass references to kernels.
#![allow(unused)]
fn main() {
let ctx = vx_vision::Context::new()?;
}
Texture
Texture wraps a Metal texture with a known format. Three formats are supported:
| Format | Create from data | Create empty | Read back |
|---|---|---|---|
| R8Unorm (grayscale) | ctx.texture_gray8(&pixels, w, h) | ctx.texture_output_gray8(w, h) | tex.read_gray8() |
| R32Float | ctx.texture_r32float(&data, w, h) | ctx.texture_output_r32float(w, h) | tex.read_r32float() |
| RGBA8Unorm (color) | ctx.texture_rgba8(&pixels, w, h) | ctx.texture_output_rgba8(w, h) | tex.read_rgba8() |
On Apple Silicon, textures live in unified memory — no hidden copies between CPU and GPU.
Kernels
Each GPU operation is a struct. The pattern is always:
- Create the kernel (compiles the Metal pipeline — do this once)
- Call the kernel method (dispatches GPU work)
- Read results back
#![allow(unused)]
fn main() {
let blur = GaussianBlur::new(&ctx)?;
let output = ctx.texture_output_gray8(w, h)?;
blur.apply(&ctx, &input, &output, &GaussianConfig::default())?;
let result = output.read_gray8();
}
First program
Load an image, blur it, detect edges:
use vx_vision::Context;
use vx_vision::kernels::gaussian::{GaussianBlur, GaussianConfig};
use vx_vision::kernels::sobel::SobelFilter;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let ctx = Context::new()?;
let img = image::open("input.png")?.to_luma8();
let (w, h) = img.dimensions();
let texture = ctx.texture_gray8(img.as_raw(), w, h)?;
let blur = GaussianBlur::new(&ctx)?;
let sobel = SobelFilter::new(&ctx)?;
let blurred = ctx.texture_output_gray8(w, h)?;
blur.apply(&ctx, &texture, &blurred, &GaussianConfig::default())?;
let result = sobel.compute(&ctx, &blurred)?;
let edges = result.magnitude.read_r32float();
let max_val = edges.iter().cloned().fold(0.0f32, f32::max);
let output: Vec<u8> = edges.iter()
.map(|&v| ((v / max_val) * 255.0) as u8)
.collect();
image::save_buffer("edges.png", &output, w, h, image::ColorType::L8)?;
Ok(())
}
Add image = "0.25" to your Cargo.toml dependencies.
Running examples
cargo run --example fast_demo -- path/to/image.png
cargo run --example edge_detection_demo -- path/to/image.png
cargo run --example threshold_demo -- path/to/image.png
cargo run --example advanced_cv_demo -- path/to/image.png
cargo run --example feature_matching_demo -- path/to/image.png
cargo run --example pipeline_pool_demo -- path/to/image.png
Running tests
cargo test # everything
cargo test -p vx-vision # kernel tests only
cargo test -p vx-vision -- gaussian # specific test
Architecture
Three-layer stack
Memory layer (vx-gpu)
The vx-core/ directory, published as the vx-gpu crate. Manages shared GPU/CPU buffers.
UnifiedBuffer<T>— Type-safe wrapper aroundMTLBufferwithStorageModeShared. Provideswrite(),as_slice(),as_mut_slice(). All element types must implementbytemuck::Pod + Zeroable.GpuGuard<T>— RAII guard that prevents CPU mutation while a buffer is in-flight on the GPU. Create beforecommit(), drop afterwaitUntilCompleted().- Device helpers —
default_device(),new_queue(),load_library_from_bytes().
Kernel layer (vx-vision)
The vx-vision/ directory. Contains Rust bindings for each Metal shader.
Context— Holds the Metal device, command queue, and compiled shader library. Entry point for everything.Texture— GPU texture with tracked dimensions and format. Provides readback methods and zero-copy wrapping of external Metal textures.Pipeline— Batches multiple kernel dispatches into a single Metal command buffer.TexturePool— Recycles textures by(width, height, format)to avoid repeated allocation.- Kernel structs — One per algorithm (e.g.,
FastDetector,GaussianBlur,CannyDetector). Each holds compiledMTLComputePipelineStateobjects, constructed once and reused.
Shader-to-kernel contract
Each algorithm has two sides:
| Component | Location | Naming |
|---|---|---|
| Metal shader | vx-vision/shaders/PascalCase.metal | kernel function: snake_case |
| Rust binding | vx-vision/src/kernels/snake_case.rs | struct: PascalCase |
Example: FastDetect.metal defines kernel void fast_detect(...), and fast.rs defines FastDetector which compiles that function into a pipeline at construction.
Parameter structs
GPU parameter structs live in vx-vision/src/types.rs with #[repr(C)] layout. They must match the MSL struct field-by-field:
| Rust | Metal |
|---|---|
u32 | uint |
i32 | int |
f32 | float |
[f32; 2] | float2 |
[f32; 3] + _pad: f32 | float3 (16-byte aligned) |
[f32; 4] | float4 |
Any mismatch causes silent data corruption.
Build system
vx-vision/build.rs auto-discovers all .metal files in vx-vision/shaders/, compiles each to .air via xcrun metal, links into vx.metallib via xcrun metallib, and embeds it via include_bytes!. Adding a new .metal file triggers automatic recompilation.
Thread dispatch patterns
- 2D per-pixel (image filters): grid =
(width, height, 1), threadgroup computed fromthreadExecutionWidth()andmaxTotalThreadsPerThreadgroup() - 1D per-element (feature operations): grid =
(n, 1, 1), threadgroup =(threadExecutionWidth, 1, 1) - Always uses
dispatchThreads:threadsPerThreadgroup:(non-uniform dispatch)
Thread safety
All kernel structs, Context, and Texture implement Send + Sync. Metal pipeline state objects are immutable after creation. MTLCommandQueue is thread-safe, but each thread should create its own command buffers.
Memory model
On Apple Silicon (UMA), CPU and GPU share physical memory. VX uses MTLStorageModeShared for all buffers:
- No copies — data written by CPU is immediately visible to GPU and vice versa
- Synchronization —
waitUntilCompleted()on the command buffer is sufficient - Safety —
GpuGuard<T>prevents CPU mutation while GPU is reading
Context & Texture
Context
Entry point for all GPU operations. Holds the Metal device, command queue, and compiled shader library.
#![allow(unused)]
fn main() {
use vx_vision::Context;
let ctx = Context::new()?;
}
Texture creation
#![allow(unused)]
fn main() {
// From pixel data (ShaderRead)
let gray = ctx.texture_gray8(&pixels, w, h)?;
let float = ctx.texture_r32float(&data, w, h)?;
let color = ctx.texture_rgba8(&pixels, w, h)?;
// Empty output (ShaderWrite)
let out = ctx.texture_output_gray8(w, h)?;
let out = ctx.texture_output_r32float(w, h)?;
let out = ctx.texture_output_rgba8(w, h)?;
// Pipeline intermediates (ShaderRead | ShaderWrite)
let tmp = ctx.texture_intermediate_gray8(w, h)?;
let tmp = ctx.texture_intermediate_r32float(w, h)?;
}
Use output_* when a texture is only written to by a kernel. Use intermediate_* when a texture is written by one kernel and read by the next in a pipeline chain.
Texture
Wraps a Metal texture with tracked dimensions and format.
Readback
#![allow(unused)]
fn main() {
let pixels: Vec<u8> = tex.read_gray8(); // R8Unorm
let data: Vec<f32> = tex.read_r32float(); // R32Float
let pixels: Vec<u8> = tex.read_rgba8(); // RGBA8Unorm (4 bytes/pixel)
}
Call readback only after the GPU command buffer has completed. Reading while the GPU is still writing produces undefined results.
Properties
#![allow(unused)]
fn main() {
let w = tex.width(); // u32
let h = tex.height(); // u32
let f = tex.format(); // TextureFormat enum
}
External textures
For AVFoundation or Core Video integration, wrap an existing Metal texture without copying:
#![allow(unused)]
fn main() {
use vx_vision::{Texture, TextureFormat};
let tex = Texture::from_metal_texture(metal_tex, w, h, TextureFormat::RGBA8Unorm);
}
Pipeline
Batches multiple kernel dispatches into a single Metal command buffer.
#![allow(unused)]
fn main() {
use vx_vision::Pipeline;
let pipe = Pipeline::begin(&ctx)?;
let cmd = pipe.cmd_buf();
let _s1 = blur.encode(&ctx, cmd, &input, &temp, &cfg)?;
sobel.encode(&ctx, cmd, &temp)?;
let _retained = pipe.commit_and_wait();
}
Intermediate textures and encoded state must outlive the command buffer. The commit_and_wait() return value holds retained textures.
For CPU/GPU overlap:
#![allow(unused)]
fn main() {
let mut pipe = Pipeline::begin(&ctx)?;
blur.encode(&ctx, pipe.cmd_buf(), &input, &output, &cfg)?;
pipe.commit(); // non-blocking
// ... CPU work ...
pipe.wait(); // block until GPU done
}
TexturePool
Recycles GPU textures by (width, height, format) to avoid repeated allocation.
#![allow(unused)]
fn main() {
use vx_vision::TexturePool;
let mut pool = TexturePool::new();
let tex = pool.acquire_gray8(&ctx, 1920, 1080)?;
// ... use tex ...
pool.release(tex);
// Second acquire reuses the cached texture
let tex = pool.acquire_gray8(&ctx, 1920, 1080)?;
}
All pool textures have ShaderRead | ShaderWrite usage flags.
#![allow(unused)]
fn main() {
let pool = TexturePool::with_capacity(4); // max 4 per bucket
pool.hit_rate(); // cache efficiency
pool.cached_count(); // total cached
pool.clear(); // free all
}
Error handling
All fallible operations return Result<T, vx_vision::Error>. Error variants:
| Variant | Meaning |
|---|---|
DeviceNotFound | No Metal GPU available |
ShaderMissing(String) | Named shader function not in metallib |
PipelineCompile(String) | Metal failed to compile a pipeline |
BufferAlloc { bytes } | GPU buffer allocation failed |
TextureSizeMismatch | Texture dimensions don’t match |
InvalidConfig(String) | Parameter out of range |
Gpu(String) | Runtime GPU error |
Feature Detection
FAST-9 Corner Detector
Detects corners using the FAST-9 algorithm. Tests 16 pixels on a Bresenham circle — a pixel is a corner if 9 contiguous pixels are all brighter or darker than the center by a threshold.
#![allow(unused)]
fn main() {
use vx_vision::kernels::fast::{FastDetector, FastDetectConfig};
let fast = FastDetector::new(&ctx)?;
let mut cfg = FastDetectConfig::default(); // threshold: 20, max_corners: 2048
cfg.threshold = 30;
let result = fast.detect(&ctx, &input, &cfg)?;
for corner in &result.corners {
println!("({}, {}) score={}", corner.position[0], corner.position[1], corner.response);
}
}
Also supports pipeline encoding via fast.encode().
Harris Corner Response
Computes the Harris response R = det(M) - k * trace(M)^2 for each keypoint. Use after FAST to rank corners by quality.
#![allow(unused)]
fn main() {
use vx_vision::kernels::harris::{HarrisScorer, HarrisConfig};
let harris = HarrisScorer::new(&ctx)?;
let scored = harris.compute(&ctx, &input, &corners, &HarrisConfig::default())?;
// scored: Vec<CornerPoint> with updated response values
}
Config: k (sensitivity, default 0.04), patch_radius (neighborhood size, default 3).
Non-Maximum Suppression
Filters keypoints so no two are within min_distance of each other. Keeps the highest-response point in each neighborhood.
#![allow(unused)]
fn main() {
use vx_vision::kernels::nms::{NmsSuppressor, NmsConfig};
let nms = NmsSuppressor::new(&ctx)?;
let filtered = nms.run(&ctx, &corners, &NmsConfig::default())?;
}
Config: min_distance (default 10.0 pixels).
ORB Descriptors
Computes 256-bit binary descriptors for keypoints using oriented BRIEF test pairs.
#![allow(unused)]
fn main() {
use vx_vision::kernels::orb::{OrbDescriptor, OrbConfig};
let orb = OrbDescriptor::new(&ctx)?;
let result = orb.compute(&ctx, &input, &keypoints, &pattern, &OrbConfig::default())?;
// result.descriptors: Vec<ORBOutput> (256-bit descriptors as 8 x u32)
// result.orientations: Vec<f32>
}
The pattern is 1024 i32 values (256 test pairs, each with 4 offsets: dx1, dy1, dx2, dy2).
DoG Keypoint Detector
Difference-of-Gaussians scale-space extrema detection.
#![allow(unused)]
fn main() {
use vx_vision::kernels::dog::{DoGDetector, DoGConfig};
use vx_vision::kernels::gaussian::GaussianBlur;
let blur = GaussianBlur::new(&ctx)?;
let dog = DoGDetector::new(&ctx)?;
let mut cfg = DoGConfig::default();
cfg.n_levels = 5;
let keypoints = dog.detect(&ctx, &blur, &input, &cfg)?;
}
Each keypoint has position, scale, and response. Full pipelining isn’t practical due to the iterative blur-subtract-extrema pattern, but encode_subtract() exposes the subtraction step for custom pipelines.
SIFT Pipeline
Full SIFT-like pipeline: multi-octave pyramid, DoG detection, orientation assignment, 128-dimensional descriptors.
#![allow(unused)]
fn main() {
use vx_vision::kernels::sift::{SiftPipeline, SiftConfig};
let sift = SiftPipeline::new(&ctx)?;
let features = sift.detect_and_describe(&ctx, &input, &SiftConfig::default())?;
for f in &features {
println!("({}, {}) scale={:.2} orient={:.2}", f.x, f.y, f.scale, f.orientation);
// f.descriptor: [f32; 128]
}
}
Matching between two feature sets:
#![allow(unused)]
fn main() {
let matches = SiftPipeline::match_features(&features_a, &features_b, 0.75);
}
Typical detection pipeline
A common pattern chains FAST → Harris → NMS → ORB:
#![allow(unused)]
fn main() {
let corners = fast.detect(&ctx, &texture, &fast_cfg)?;
let scored = harris.compute(&ctx, &texture, &corners.corners, &harris_cfg)?;
let best = nms.run(&ctx, &scored, &nms_cfg)?;
let descs = orb.compute(&ctx, &texture, &best, &pattern, &orb_cfg)?;
}
For single-submission batching, each of these kernels provides an encode() method that writes into a shared command buffer via Pipeline.
Image Processing
Gaussian Blur
Separable two-pass blur (horizontal then vertical).
#![allow(unused)]
fn main() {
use vx_vision::kernels::gaussian::{GaussianBlur, GaussianConfig};
let blur = GaussianBlur::new(&ctx)?;
let output = ctx.texture_output_gray8(w, h)?;
let mut cfg = GaussianConfig::default(); // sigma: 1.0, radius: 3
cfg.sigma = 2.0;
blur.apply(&ctx, &input, &output, &cfg)?;
}
Config: sigma (standard deviation), radius (kernel half-width, full kernel = 2*radius + 1).
Pipeline encoding returns a GaussianEncodedState that holds the intermediate texture:
#![allow(unused)]
fn main() {
let state = blur.encode(&ctx, cmd_buf, &input, &output, &cfg)?;
// state must outlive the command buffer
}
Bilateral Filter
Edge-preserving smoothing. Smooths flat regions while keeping edges sharp.
#![allow(unused)]
fn main() {
use vx_vision::kernels::bilateral::{BilateralFilter, BilateralConfig};
let bilateral = BilateralFilter::new(&ctx)?;
let output = ctx.texture_output_gray8(w, h)?;
bilateral.apply(&ctx, &input, &output, &BilateralConfig::new(5, 10.0, 0.1))?;
}
Config: radius, sigma_spatial, sigma_range. Larger sigma_range allows more intensity variation.
Sobel Edge Detection
Computes gradient magnitude and direction.
#![allow(unused)]
fn main() {
use vx_vision::kernels::sobel::SobelFilter;
let sobel = SobelFilter::new(&ctx)?;
let result = sobel.compute(&ctx, &input)?;
// result.magnitude: R32Float texture
// result.direction: R32Float texture (radians)
// result.grad_x, result.grad_y: R32Float gradient components
}
Canny Edge Detection
Multi-stage: Gaussian blur → Sobel → non-maximum suppression → hysteresis thresholding.
#![allow(unused)]
fn main() {
use vx_vision::kernels::canny::{CannyDetector, CannyConfig};
let canny = CannyDetector::new(&ctx)?;
let mut cfg = CannyConfig::default();
cfg.low_threshold = 0.04;
cfg.high_threshold = 0.12;
let edges = canny.detect(&ctx, &input, &cfg)?;
// edges: R32Float texture (1.0 = edge, 0.0 = non-edge)
}
Config: low_threshold, high_threshold (hysteresis), blur_sigma, blur_radius.
Supports pipeline encoding via canny.encode().
Morphology
Binary operations with a rectangular structuring element.
#![allow(unused)]
fn main() {
use vx_vision::kernels::morphology::{Morphology, MorphConfig};
let morph = Morphology::new(&ctx)?;
let cfg = MorphConfig::default(); // radius_x: 1, radius_y: 1 (3x3 kernel)
let output = ctx.texture_output_gray8(w, h)?;
morph.erode(&ctx, &input, &output, &cfg)?;
morph.dilate(&ctx, &input, &output, &cfg)?;
morph.open(&ctx, &input, &output, &cfg)?; // erode then dilate
morph.close(&ctx, &input, &output, &cfg)?; // dilate then erode
}
All four operations support pipeline encoding: encode_erode, encode_dilate, encode_open, encode_close.
Threshold
Binary, adaptive, and automatic (Otsu) thresholding.
#![allow(unused)]
fn main() {
use vx_vision::kernels::threshold::{Threshold, AdaptiveThresholdConfig};
let thresh = Threshold::new(&ctx)?;
let output = ctx.texture_output_gray8(w, h)?;
// Fixed binary (normalized 0.0-1.0 threshold)
thresh.binary(&ctx, &input, &output, 0.5, false)?;
// Otsu's method (auto-selects threshold, returns it)
let value = thresh.otsu(&ctx, &input, &output)?;
// Adaptive (requires integral image)
let cfg = AdaptiveThresholdConfig::new(15, 0.03, false);
thresh.adaptive_auto(&ctx, &input, &output, &cfg)?;
}
Pipeline encoding: encode_binary(), encode_adaptive().
Histogram
Compute 256-bin histogram and equalize contrast.
#![allow(unused)]
fn main() {
use vx_vision::kernels::histogram::Histogram;
let hist = Histogram::new(&ctx)?;
let bins: [u32; 256] = hist.compute(&ctx, &input)?;
let output = ctx.texture_output_gray8(w, h)?;
hist.equalize(&ctx, &input, &output)?;
}
Not pipeline-encodable — requires CPU readback of bin counts.
Color Conversion
Convert between RGBA, grayscale, and HSV.
#![allow(unused)]
fn main() {
use vx_vision::kernels::color::ColorConvert;
let color = ColorConvert::new(&ctx)?;
color.rgba_to_gray(&ctx, &rgba, &gray)?;
color.gray_to_rgba(&ctx, &gray, &rgba)?;
color.rgba_to_hsv(&ctx, &rgba, &hsv)?;
color.hsv_to_rgba(&ctx, &hsv, &rgba)?;
}
All four conversions support pipeline encoding.
Geometry & Transforms
Image Pyramid
Builds successive half-resolution levels in a single GPU submission.
#![allow(unused)]
fn main() {
use vx_vision::kernels::pyramid::PyramidBuilder;
let pyr = PyramidBuilder::new(&ctx)?;
let levels = pyr.build(&ctx, &input, 4)?;
// levels[0] = half, levels[1] = quarter, levels[2] = eighth
let half = pyr.downsample(&ctx, &input)?; // single level
}
Resize
Bilinear interpolation resize to arbitrary dimensions.
#![allow(unused)]
fn main() {
use vx_vision::kernels::resize::ImageResize;
let resizer = ImageResize::new(&ctx)?;
let output = resizer.apply(&ctx, &input, new_w, new_h)?;
}
Warp
Affine and perspective warping.
#![allow(unused)]
fn main() {
use vx_vision::kernels::warp::ImageWarp;
let warp = ImageWarp::new(&ctx)?;
// Affine: 2x3 matrix as [f32; 6]
let output = ctx.texture_output_gray8(out_w, out_h)?;
warp.affine(&ctx, &input, &output, &matrix_2x3)?;
// Perspective: 3x3 matrix as [f32; 9]
warp.perspective(&ctx, &input, &output, &matrix_3x3)?;
}
Both support pipeline encoding via encode_affine() and encode_perspective().
Lens Undistortion
Corrects radial and tangential lens distortion using camera intrinsics.
#![allow(unused)]
fn main() {
use vx_vision::kernels::undistort::Undistorter;
let undistort = Undistorter::new(&ctx)?;
let output = undistort.apply(&ctx, &input, &camera_params)?;
}
Homography Estimation
RANSAC-based homography from point correspondences. GPU-accelerated scoring with CPU-side model selection.
#![allow(unused)]
fn main() {
use vx_vision::kernels::homography::{HomographyEstimator, RansacConfig};
let estimator = HomographyEstimator::new(&ctx)?;
let mut cfg = RansacConfig::default();
cfg.max_iterations = 1000;
cfg.inlier_threshold = 3.0;
let result = estimator.estimate(&ctx, &point_pairs, &cfg)?;
// result.homography: [f32; 9]
// result.n_inliers: u32
// result.inlier_mask: Vec<bool>
}
Not pipeline-encodable — RANSAC iterates with CPU readback between GPU scoring passes.
Motion & Stereo
KLT Optical Flow
Sparse Kanade-Lucas-Tomasi tracker. Tracks keypoints across two frames using iterative Lucas-Kanade with image pyramids.
#![allow(unused)]
fn main() {
use vx_vision::kernels::klt::{KltTracker, KltConfig};
let klt = KltTracker::new(&ctx)?;
let mut cfg = KltConfig::default();
cfg.max_iterations = 30;
cfg.win_radius = 7;
cfg.max_level = 3;
let tracked = klt.track(&ctx, &prev_frame, &curr_frame, &keypoints, &cfg)?;
// tracked: Vec<KltResult> with new positions, status, and error
}
Config: max_iterations, epsilon (convergence threshold), win_radius, max_level (pyramid levels), min_eigenvalue.
Dense Optical Flow
Horn-Schunck per-pixel flow estimation using iterative Jacobi relaxation.
#![allow(unused)]
fn main() {
use vx_vision::kernels::dense_flow::{DenseFlow, DenseFlowConfig};
let flow = DenseFlow::new(&ctx)?;
let mut cfg = DenseFlowConfig::default();
cfg.alpha = 0.012;
cfg.iterations = 50;
let result = flow.compute(&ctx, &frame0, &frame1, &cfg)?;
// result.flow_u: R32Float texture (horizontal displacement)
// result.flow_v: R32Float texture (vertical displacement)
}
Supports pipeline encoding via flow.encode().
Stereo Matching
Matches ORB features between rectified stereo image pairs using Hamming distance, epipolar constraints, and disparity bounds. Triangulates 3D positions from disparities.
#![allow(unused)]
fn main() {
use vx_vision::kernels::stereomatch::{StereoMatcher, StereoConfig};
let stereo = StereoMatcher::new(&ctx)?;
let mut cfg = StereoConfig::default();
cfg.max_disparity = 64.0;
cfg.baseline = 0.12; // meters between cameras
cfg.fx = 500.0; // focal length in pixels
let result = stereo.run(
&ctx,
&left_kpts, &right_kpts,
&left_descs, &right_descs,
&cfg,
)?;
for m in &result.matches {
println!("3D: ({:.2}, {:.2}, {:.2})", m.point_3d[0], m.point_3d[1], m.point_3d[2]);
}
}
Config: max_epipolar, min_disparity, max_disparity, max_hamming, ratio_thresh, fx, fy, cx, cy, baseline.
Brute-Force Descriptor Matching
Matches ORB binary descriptors using Hamming distance with Lowe’s ratio test.
#![allow(unused)]
fn main() {
use vx_vision::kernels::matcher::{BruteMatcher, MatchConfig};
let matcher = BruteMatcher::new(&ctx)?;
let mut cfg = MatchConfig::default();
cfg.max_hamming = 64;
cfg.ratio_thresh = 0.75;
let matches = matcher.match_descriptors(&ctx, &query_desc, &train_desc, &cfg)?;
for m in &matches {
println!("query[{}] → train[{}] dist={}", m.query_idx, m.train_idx, m.distance);
}
}
Descriptors are flat &[u32] arrays where every 8 consecutive values form one 256-bit ORB descriptor.
Analysis
Template Matching
Normalized cross-correlation (NCC). Finds the best match location for a small template within a larger image.
#![allow(unused)]
fn main() {
use vx_vision::kernels::template_match::TemplateMatcher;
let tm = TemplateMatcher::new(&ctx)?;
let result = tm.match_template(&ctx, &image, &template)?;
println!("Best at ({}, {}) score={:.4}", result.best_x, result.best_y, result.best_score);
}
The template must have non-zero variance (not a uniform color) for NCC to produce meaningful results.
Hough Line Detection
Detects lines via the Hough transform. Works best on binary edge images (e.g., output of Canny).
#![allow(unused)]
fn main() {
use vx_vision::kernels::hough::{HoughLines, HoughConfig};
let hough = HoughLines::new(&ctx)?;
let mut cfg = HoughConfig::default();
cfg.vote_threshold = 50;
cfg.max_lines = 100;
let lines = hough.detect(&ctx, &edge_image, &cfg)?;
for line in &lines {
println!("rho={:.1} theta={:.1}° votes={}", line.rho, line.theta.to_degrees(), line.votes);
}
}
Config: n_theta, edge_threshold, vote_threshold, max_lines, nms_radius.
Not pipeline-encodable — requires CPU readback of the accumulator between voting and peak-finding.
Integral Image
Summed area table for O(1) region-sum queries.
#![allow(unused)]
fn main() {
use vx_vision::kernels::integral::IntegralImage;
let integral = IntegralImage::new(&ctx)?;
let sat = integral.compute(&ctx, &input)?;
// sat: R32Float texture
}
Supports pipeline encoding via integral.encode(). Used internally by adaptive thresholding.
Distance Transform
Euclidean distance from each pixel to the nearest seed pixel, computed via Jump Flooding Algorithm (JFA).
#![allow(unused)]
fn main() {
use vx_vision::kernels::distance::{DistanceTransform, DistanceConfig};
let dt = DistanceTransform::new(&ctx)?;
let mut cfg = DistanceConfig::default();
cfg.threshold = 0.5;
let distances = dt.compute(&ctx, &binary_input, &cfg)?;
// distances: R32Float texture with per-pixel Euclidean distance
}
Connected Components
Labels connected regions in a binary image using iterative min-label propagation.
#![allow(unused)]
fn main() {
use vx_vision::kernels::connected::{ConnectedComponents, CCLConfig};
let ccl = ConnectedComponents::new(&ctx)?;
let mut cfg = CCLConfig::default();
cfg.threshold = 0.5;
let result = ccl.label(&ctx, &binary_input, &cfg)?;
println!("{} components in {} iterations", result.n_components, result.iterations);
// result.labels: R32Float texture with integer label per pixel
}
Not pipeline-encodable — iterative convergence requires CPU readback between passes.
Pipeline & Performance
Pipeline batching
By default, each kernel’s sync method (apply, compute, detect) creates its own command buffer and waits for completion. For multi-stage pipelines, this means N GPU round-trips.
Pipeline batches everything into a single command buffer:
#![allow(unused)]
fn main() {
use vx_vision::Pipeline;
let pipe = Pipeline::begin(&ctx)?;
let cmd = pipe.cmd_buf();
let s1 = blur.encode(&ctx, cmd, &input, &temp1, &blur_cfg)?;
bilateral.encode(cmd, &temp1, &temp2, &bilateral_cfg)?;
morph.encode_dilate(cmd, &temp2, &output, &morph_cfg)?;
let _retained = pipe.commit_and_wait();
}
Encoded state (like s1 above) holds intermediate textures that must outlive the command buffer.
Which kernels support encoding?
| Encodable | Not encodable (multi-pass) |
|---|---|
| Gaussian, Bilateral, Sobel, Canny, Morphology, Threshold, Color, Warp, Integral, Dense Flow, FAST, Harris, NMS, ORB, KLT, Resize, Undistort, DoG (subtract only) | Histogram, Homography, Connected Components, Hough |
Multi-pass kernels require CPU readback between GPU passes, so they can’t be batched.
TexturePool
GPU texture allocation is expensive. Reuse textures across frames:
#![allow(unused)]
fn main() {
use vx_vision::TexturePool;
let mut pool = TexturePool::new();
for frame in frames {
let temp = pool.acquire_gray8(&ctx, w, h)?; // reuses cached texture
blur.apply(&ctx, &frame, &temp, &cfg)?;
// ... process ...
pool.release(temp); // return to pool
}
println!("Hit rate: {:.0}%", pool.hit_rate() * 100.0);
}
The pool keys by (width, height, format). All pool textures have ShaderRead | ShaderWrite flags.
Optimization tips
Reuse kernel structs. Creating a kernel compiles the Metal pipeline. Do it once at startup.
#![allow(unused)]
fn main() {
let blur = GaussianBlur::new(&ctx)?; // once
for frame in frames {
blur.apply(&ctx, &frame, &output, &cfg)?; // reuse
}
}
Avoid unnecessary readbacks. read_gray8() forces GPU sync. If the output feeds another kernel, pass the texture directly.
Downsample first. Run feature detection on half-resolution images when full resolution isn’t needed:
#![allow(unused)]
fn main() {
let levels = pyr.build(&ctx, &input, 3)?;
let corners = fast.detect(&ctx, &levels[0], &cfg)?; // half-res
}
Batch with Pipeline. One command buffer is faster than five:
#![allow(unused)]
fn main() {
let pipe = Pipeline::begin(&ctx)?;
// encode 5 kernels into pipe.cmd_buf()
pipe.commit_and_wait();
}
Memory model
On Apple Silicon (UMA), CPU and GPU share physical memory. VX uses MTLStorageModeShared — no copies, no uploads, no downloads. waitUntilCompleted() is the only synchronization needed.
GpuGuard<T> in vx-gpu prevents CPU reads of a UnifiedBuffer<T> while the GPU is using it, catching race conditions at runtime.
Benchmarking
Run the built-in criterion benchmarks:
cargo bench -p vx-vision
Benchmarks include:
- FAST at 752x480 and 1920x1080
- Full FAST → Harris → NMS → ORB pipeline at both resolutions
- Pipeline vs individual dispatch comparison (3x Gaussian)
Examples Walkthrough
All examples take an image path as argument:
cargo run --release --example <name> -- path/to/image.png
fast_demo
Detects FAST corners, scores with Harris, suppresses with NMS. Prints corner count and timing.
Demonstrates: Feature detection pipeline, FastDetector, HarrisScorer, NmsSuppressor.
edge_detection_demo
Runs Gaussian blur → Sobel gradients → Canny edges. Reports timing for each stage and pixel statistics.
Demonstrates: Image processing chain, GaussianBlur, SobelFilter, CannyDetector.
threshold_demo
Compares thresholding methods: histogram analysis, Otsu’s automatic threshold, fixed binary, integral image computation, and adaptive threshold. Prints timing and foreground percentages.
Demonstrates: Histogram, Threshold (all modes), IntegralImage.
advanced_cv_demo
Runs five algorithms on one image: bilateral filter, Canny + Hough line detection, Otsu + distance transform, connected components, and template matching (self-patch). Prints detailed results for each.
Demonstrates: Full range of analysis kernels.
feature_matching_demo
Detects ORB features in two images, matches with brute-force Hamming distance, and reports match statistics. Also runs SIFT detection for comparison.
Demonstrates: OrbDescriptor, BruteMatcher, SiftPipeline.
klt_benchmark
Loads a sequence of PNG frames (e.g., from EuRoC dataset), detects FAST corners on the first frame, then tracks them through subsequent frames using KLT optical flow. Reports per-frame timing and track survival rate.
Demonstrates: KltTracker, multi-frame processing, re-detection strategy.
orb_stereo_benchmark
Runs the full stereo pipeline on synthetic or real stereo pairs: FAST detection, Harris scoring, NMS, ORB descriptors, stereo matching with epipolar constraints. Reports 3D point triangulation results.
Demonstrates: StereoMatcher, full detection-to-3D pipeline.
pipeline_pool_demo
Benchmarks three approaches to multi-frame processing: individual dispatches, pipeline batching, and pipeline + TexturePool. Reports timing comparison and pool hit rates.
Demonstrates: Pipeline, TexturePool, performance comparison.
Adding a Kernel
Five steps to add a new GPU kernel to VX.
1. Write the Metal shader
Create vx-vision/shaders/YourKernel.metal:
#include <metal_stdlib>
using namespace metal;
struct YourParams {
uint width;
uint height;
float some_param;
};
kernel void your_kernel(
texture2d<float, access::read> input [[texture(0)]],
texture2d<float, access::write> output [[texture(1)]],
constant YourParams& params [[buffer(0)]],
uint2 gid [[thread_position_in_grid]]
) {
if (gid.x >= params.width || gid.y >= params.height) return;
float4 pixel = input.read(gid);
// ... your computation ...
output.write(result, gid);
}
The build system auto-discovers .metal files — no build.rs changes needed.
2. Add the parameter struct
In vx-vision/src/types.rs:
#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Clone, Copy, Debug, Pod, Zeroable)]
pub struct YourParams {
pub width: u32,
pub height: u32,
pub some_param: f32,
}
}
Must match the Metal struct field-by-field. Same types, same order, same padding. See the Architecture page for type mapping.
3. Write the Rust kernel
Create vx-vision/src/kernels/your_kernel.rs:
#![allow(unused)]
fn main() {
use crate::context::Context;
use crate::error::{Error, Result};
use crate::texture::Texture;
use crate::types::YourParams;
// ... Metal imports ...
#[derive(Clone, Debug)]
#[non_exhaustive]
pub struct YourConfig {
pub some_param: f32,
}
impl Default for YourConfig {
fn default() -> Self {
Self { some_param: 1.0 }
}
}
pub struct YourKernel {
pipeline: Retained<ProtocolObject<dyn MTLComputePipelineState>>,
}
impl YourKernel {
pub fn new(ctx: &Context) -> Result<Self> {
let name = objc2_foundation::ns_string!("your_kernel");
let func = ctx.library().newFunctionWithName(name)
.ok_or(Error::ShaderMissing("your_kernel".into()))?;
let pipeline = ctx.device()
.newComputePipelineStateWithFunction_error(&func)
.map_err(|e| Error::PipelineCompile(format!("your_kernel: {e}")))?;
Ok(Self { pipeline })
}
/// Sync method: creates command buffer, dispatches, waits.
pub fn apply(
&self, ctx: &Context, input: &Texture, output: &Texture, config: &YourConfig,
) -> Result<()> {
let cmd_buf = ctx.queue().commandBuffer()
.ok_or(Error::Gpu("failed to create command buffer".into()))?;
self.encode_pass(&cmd_buf, input, output, config)?;
cmd_buf.commit();
cmd_buf.waitUntilCompleted();
Ok(())
}
/// Pipeline encoding: writes into existing command buffer.
pub fn encode(
&self, cmd_buf: &ProtocolObject<dyn MTLCommandBuffer>,
input: &Texture, output: &Texture, config: &YourConfig,
) -> Result<()> {
self.encode_pass(cmd_buf, input, output, config)
}
fn encode_pass(/* ... */) -> Result<()> {
// set pipeline, textures, params, dispatch
}
}
unsafe impl Send for YourKernel {}
unsafe impl Sync for YourKernel {}
}
4. Register the module
In vx-vision/src/kernels/mod.rs:
#![allow(unused)]
fn main() {
pub mod your_kernel;
}
5. Add tests
In vx-vision/tests/test_kernels.rs, add a test that creates a synthetic image, runs the kernel, and verifies output properties.
Checklist
- Metal shader compiles (check
cargo buildoutput) -
#[repr(C)]struct matches MSL struct exactly - Kernel has both sync method and
encode()for pipelining - Config struct has
Default,#[non_exhaustive],Clone,Debug -
Send + Syncimplemented on kernel struct - Module registered in
mod.rs - Test passes