Engineering Blog

Engineering: Optimizing WebGPU Inference Kernels

Pioneering raw WebGPU shader operations to achieve sub-10ms neural network layer operations directly in raw browser runtimes.

The Rise of Browser-Native Tensor Cores

WebGPU unlocks direct compute shaders inside Chromium-based sandboxes, allowing raw WGSL code to interface directly with Vulkan, Metal, or Direct3D pipelines under the hood.

WebGPU Compute Matrix Multiply

// Raw WGSL shader bound to pipeline
@group(0) @binding(0) var matrixA: array;
@group(0) @binding(1) var matrixB: array;
@group(0) @binding(2) var matrixC: array;

@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) global_id: vec3) {
  // Compute parallel tensor multiplies asynchronously
}

We evaluate matrix-multiplication performance using 16-bit float formats (FP16) versus 32-bit (FP32). Our findings demonstrate a direct 2.1x speedup in sequence execution speeds with dynamic memory packing schemas.