Engineering Blog

Engineering: Optimizing WebGPU Inference Kernels

Pioneering raw WebGPU shader operations to achieve sub-10ms neural network layer operations directly in raw browser runtimes.

The Rise of Browser-Native Tensor Cores

WebGPU unlocks direct compute shaders inside Chromium-based sandboxes, allowing raw WGSL code to interface directly with Vulkan, Metal, or Direct3D pipelines under the hood.

WebGPU Compute Matrix Multiply
// Raw WGSL shader bound to pipeline @group(0) @binding(0) var matrixA: array; @group(0) @binding(1) var matrixB: array; @group(0) @binding(2) var matrixC: array; @compute @workgroup_size(16, 16) fn main(@builtin(global_invocation_id) global_id: vec3) { // Compute parallel tensor multiplies asynchronously }

We evaluate matrix-multiplication performance using 16-bit float formats (FP16) versus 32-bit (FP32). Our findings demonstrate a direct 2.1x speedup in sequence execution speeds with dynamic memory packing schemas.