Running LLMs in the Browser with WebGPU
How on-device inference in the browser works with transformers.js and WebGPU, and when it actually makes sense.

Running a language model inside the browser is no longer a party trick. It is increasingly a legitimate architectural choice — and in some cases, the right one. The combination of WebGPU, aggressive quantization, and maturing libraries has quietly crossed a threshold where browser inference is fast enough to be useful.
Why Now?
Several things converged at roughly the same time.
WebGPU shipped in Chrome 113 (May 2023) and has since landed in Firefox and Edge. It gives JavaScript real access to GPU compute — not the limited, draw-call-oriented API of WebGL, but a proper compute-shader model closer to Metal or Vulkan. That matters because transformer inference is dominated by matrix multiplications that GPUs handle orders of magnitude faster than a CPU.
Quantization shrunk the models. INT8 and INT4 quantization cut model weights to one-quarter to one-half of their original float32 size with manageable quality loss. A model that was 6 GB at full precision can drop to under 1 GB quantized, putting it within range of a browser download.
transformers.js v3 (from Hugging Face) brought a JavaScript-native pipeline API that mirrors the Python transformers library, handles WebGPU acceleration automatically, and falls back to WASM when WebGPU is unavailable. You do not write any shader code. You just call a pipeline.
Loading a Model in the Browser
Here is a realistic example using the transformers.js pipeline API:
import { pipeline, env } from "@huggingface/transformers";
// Prefer WebGPU; fall back to WASM automatically
env.backends.onnx.wasm.proxy = false;
async function runInference() {
// Downloads and caches the model on first run (~300 MB for this checkpoint)
const generator = await pipeline(
"text-generation",
"Xenova/Qwen2.5-0.5B-Instruct",
{ device: "webgpu", dtype: "q4" }
);
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Summarise what WebGPU is in two sentences." },
];
const result = await generator(messages, {
max_new_tokens: 128,
do_sample: false,
});
console.log(result[0].generated_text.at(-1).content);
}
runInference();The first call to pipeline() fetches weights from the Hugging Face CDN and stores them in the browser's cache. Subsequent page loads are instant because the browser serves them from disk.
Warning
Model weights range from 300 MB to 4 GB depending on the checkpoint and quantization level. Users on slow or metered connections will wait — sometimes for several minutes — on first load. Always show a visible download progress indicator, and consider persisting weights to IndexedDB via the transformers.js cache API so users only pay that cost once.
The Tradeoffs
Browser inference has real advantages, but it also has real constraints worth being honest about.
Privacy is the strongest argument in its favour. Data never leaves the device. For document Q&A, note summarisation, or anything involving sensitive personal content, keeping inference local eliminates an entire class of data-handling concerns.
Performance is more complicated. On a modern MacBook with WebGPU, a 0.5B–1.5B parameter quantized model can generate 20–50 tokens per second — fast enough for interactive use. On older hardware, or when WASM fallback kicks in, throughput can drop to single digits. Server inference on a dedicated GPU is still dramatically faster for large models.
Compatibility is a genuine gap. WebGPU support is incomplete in Safari (Metal backend is present but has known issues as of mid-2026) and absent in older browsers. The WASM fallback works everywhere but is significantly slower. You need to test both paths.
Quality is bounded by model size. Quantized 0.5B–7B models are impressively capable for their size, but they are not GPT-4-class. For tasks that require deep reasoning, broad world knowledge, or high reliability, a server-side API call to a large model is still the right answer.
When to Reach for Browser Inference
Browser inference fits best in a narrow but valuable set of situations:
- Privacy-sensitive tasks where sending data to a server is legally or ethically unacceptable — local document Q&A, personal note summarisation, medical or financial draft assistance.
- Offline-first apps where a network connection cannot be assumed — field tools, note-taking apps, anything that needs to work on a plane.
- Demos and prototypes where you want zero backend infrastructure and are willing to accept the download cost and quality constraints.
It is not the right choice for production at scale, for tasks that need a large capable model, or for users on low-end hardware where you cannot guarantee a usable experience.
Takeaway
WebGPU and transformers.js have made browser inference real — not theoretical. The constraints (model size, WebGPU coverage, quality ceiling) are genuine, and they narrow the use cases considerably. But for privacy-sensitive or offline-first applications, running a small quantized model entirely on the client is now a serious option worth evaluating, not just a demo trick.

Written by
Rhythm Bhiwani
Engineer and relentless builder, happiest reverse-engineering hard problems until they click.
Enjoyed this?
Tap the heart to leave some love.
Be the first to react
Comments
Join the conversation.
Loading comments…


