DeepSeek V4 Flash Is Running Everywhere — AMD, NVIDIA, and Even a Raspberry Pi
DeepSeek V4 Flash Is Running Everywhere — AMD, NVIDIA, and Even a Raspberry Pi
Two weeks ago, DeepSeek dropped DeepSeek V4 Flash on HuggingFace: 284 billion parameters, 13 billion active, a 1M token context window, MIT license. The benchmark numbers were impressive — 91.6 on LiveCodeBench in Max mode, competitive with GPT-5 on coding tasks. But the real story isn't what's on the spec sheet. It's what happened after the weights went public.
In the span of a single week, the open-source community ported V4 Flash to three fundamentally different hardware platforms — and the implications for the AI hardware landscape are bigger than any benchmark score.
AMD MI300X: The One That Broke HN
On June 2, Fergus Finn published a detailed walkthrough of bringing DeepSeek V4 Flash up on AMD MI300X GPUs, and it shot to the top of Hacker News with 119 points. The significance is hard to overstate: MI300X is AMD's flagship datacenter GPU, the company's direct competitor to NVIDIA's H100. Getting a frontier-class model running on it without NVIDIA's CUDA stack isn't just a hobbyist flex — it's a proof point that the AI hardware market might actually be contestable.
The blog post walks through the full pipeline: VLLM setup, FP8 quantization, tensor parallelism across two MI300X cards, and hitting ~15 tokens per second at batch size 1. That's not blazing fast, but it's real. It works. And that changes the conversation for any enterprise sitting on AMD hardware wondering if they're locked out of the open-weight model revolution.
The HN discussion that followed was equal parts technical deep-dive and market analysis. Multiple commenters pointed out that DeepSeek's MoE architecture — where only 13B of 284B parameters are active per token — is a natural fit for AMD's hardware profile. MI300X has 192GB of HBM3 memory (versus H100's 80GB), which means you can load the full model weights without model parallelism tricks that eat into throughput.
NVIDIA DGX Spark: The Official Path
The same day, an NVIDIA forum post confirmed V4 Flash running on 2× DGX Spark systems with tensor parallelism, handling 200K context out of the box. This is the "works on NVIDIA" story, which is less surprising — of course it does, CUDA is the default — but the confirmation that it runs on a pair of consumer-adjacent systems (DGX Spark is NVIDIA's $3,000 developer kit) still matters. You don't need a datacenter.
Raspberry Pi 5: The One Nobody Expected
Then on June 1, Daniel Somerfield posted a video of DeepSeek V4 Flash running on a Raspberry Pi 5 with 8GB of RAM. This isn't a production deployment — the inference speed is measured in seconds per token, not tokens per second — but it proves a point that should terrify anyone betting on API lock-in as a business model.
A 284B parameter model that can technically run on a $60 single-board computer is a model that will run anywhere. Your laptop. Your cloud VM. Your AMD cluster. Your NVIDIA DGX. Your Raspberry Pi collecting dust in a drawer. The hardware portability of open-weight models is not just a feature. It's a moat.
Scores at a glance: DeepSeek V4 Flash (8.5) · DeepSeek V4 Pro (9.0) · DeepSeek R1 (8.5) · DeepSeek V3 (7.5)
See also: Best Open Models · Best Value Models · Best Coding Models
Why Hardware Portability Is the Real Moat
For the last 18 months, the AI industry has been having the wrong conversation about "moats." OpenAI and Anthropic argue that API exclusivity — you can only access GPT-5.5 and Claude Opus 4.8 through their servers — is a durable competitive advantage. Google bets on vertical integration: Gemini Ultra 2 runs on Google's TPUs, served through Google's API, with Google's rate limits.
But the hardware portability story suggests a different kind of moat. When a model runs on AMD, NVIDIA, and ARM — when it can be deployed on-prem, in your VPC, or on a Pi — the moat isn't in the API endpoint. It's in the deployment optionality. Enterprises that care about data sovereignty, latency, or infrastructure flexibility can't even consider GPT-5.5 for sensitive workloads — it's API-or-nothing. But V4 Flash can run in their datacenter, on their hardware, under their control.
This is the same dynamic that made Linux eat the server market. Proprietary UNIX had better support contracts and cleaner integration, but Linux ran everywhere. When "everywhere" matters more than "polished," the portable option wins. DeepSeek's MIT license is the GPL of the LLM era.
The Million-Token Context Multiplier
Hardware portability would be noteworthy on any open model. But V4 Flash pairs it with a 1 million token context window — something that, until May, required a Gemini Ultra 2 or Grok 4.3 API key. Both are excellent models, and both are completely locked to their provider's infrastructure. You cannot run Grok 4.3 on-prem. You cannot fine-tune Gemini Ultra 2. You cannot even inspect their attention patterns.
DeepSeek's "Hybrid Attention" mechanism is what makes this possible. By combining Compressed Sparse Attention and Heavily Compressed Attention, the KV cache at 1M tokens collapses to roughly 10% of what a dense model would require. That's the difference between "fits in 80GB" and "needs a cluster" — and it's why a model with 284B total parameters can run on consumer hardware.
For comparison, Llama 4 405B — the other major open-weight contender at 8.7 — maxes out at 128K context. It's a great model, but the gap between 128K and 1M tokens isn't incremental. It's the difference between summarizing a document and ingesting an entire codebase, legal discovery corpus, or biomedical literature review in a single pass.
The Open-Weight Flywheel
There's a pattern emerging that should worry every closed-model provider:
- Release open weights under a permissive license — DeepSeek (MIT), MiniMax M3 (Apache 2.0), Mistral (Apache 2.0)
- Community ports to diverse hardware — AMD, Apple Silicon, Raspberry Pi, consumer GPUs
- Enterprise adoption accelerates — no vendor lock-in, no data egress, no per-token tax
- Feedback loop tightens — more deployments → more bug reports → faster improvements → better model
This flywheel is already spinning. DeepSeek V4 Flash went from HuggingFace upload to running on three hardware platforms in under two weeks. Qwen3.6 35B A3B (8.5 score, 35B total / 3B active) achieves similar portability — you can run it on a MacBook. MiniMax M3 (8.5 score, 1M context, open weights) just joined the party with Together AI publishing inference optimization guides within days of release.
Contrast this with the closed-model experience. When OpenAI releases GPT-5.5, you get an API endpoint. When DeepSeek releases V4 Flash, you get weights, a README, and within 48 hours the community has it running on hardware the original team never tested. The velocity difference is not small. It's structural.
What This Means for the LMRank Leaderboard
DeepSeek V4 Flash currently scores 8.5 on LMRank, tied with DeepSeek R1, Grok 4.3, Qwen3.6 35B A3B, and MiniMax M3. Its big sibling DeepSeek V4 Pro sits at 9.0 — the highest-scoring open-weight model on the board, trailing only Claude Opus 4.8, Opus 4.7, Opus 4.5, GPT-5.5, GPT-5, and Gemini Ultra 2.
But raw scores don't capture the full picture. V4 Flash's LMRank Value Score — which weights capability against cost and deployment flexibility — would likely place it at or near the top of the board. When you factor in zero API costs for self-hosted deployments, MIT licensing, and hardware portability across AMD, NVIDIA, and ARM, the value proposition is hard to beat.
The million-token club now has three members with open-weight options: Gemini Ultra 2 (9.2, closed, $10/M), Grok 4.3 (8.5, closed, $1.25/M), and DeepSeek V4 Flash (8.5, open, self-hosted or ~$0.50/M via API). For the first time, the cheapest option is also the one you can run on your own hardware.
Compare: DeepSeek V4 vs Gemini Ultra 2 · DeepSeek V4 vs GPT-5 · DeepSeek V4 vs Llama 4 405B
Sources: Fergus Finn: V4 Flash on AMD MI300X · NVIDIA Forums: DGX Spark Deployment · Daniel Somerfield: Pi 5 Demo
\n
Explore: Best Coding Models · Best Reasoning Models · Full Leaderboard
\n