A from-scratch nano MoE trained on 18B tokens — and why the early signal matters
Most AI narratives today boil down to one thing: who can buy the most compute? But a small independent lab in Austria is taking the opposite bet—that disciplined architecture and high-signal data can rival brute-force scale—and the early results challenge conventional assumptions about what’s possible with minimal resources.
Noeum.ai recently released Noeum-1-Nano, a nano-scale Mixture-of-Experts model (0.6B total parameters / ~0.2B active) trained entirely from scratch on 18 billion tokens—roughly 20–667× less training data than standard models in its class. The notable detail isn’t just the size—it’s the methodology: the team reports benchmarks with its optional “thinking mode” disabled to keep comparisons fair, and the results still show above-average performance for a nano-class model, including a #1 ranking on MRPC (semantic equivalence) among comparable models.
The investor-relevant takeaway is that this is a proof of method, not a promise. Training from-scratch weights, achieving strong reasoning behavior under tight token budgets, and being explicit about evaluation posture are how you de-risk a bigger scaling plan.
One concrete example: the model supports a dedicated System-2 style “think mode” designed for multi-step verification and self-correction. In demonstrations, that mode correctly solves basic multi-step reasoning (e.g., distance = speed × time) and fact-checking style prompts where standard generation can fail—behavior that small models typically struggle to sustain reliably.
One concrete example: the model supports a dedicated System-2 style “think mode” designed for multi-step verification and self-correction. In demonstrations, that mode correctly solves basic multi-step reasoning (e.g., distance = speed × time) and fact-checking style prompts where standard generation can fail—behavior that small models typically struggle to sustain reliably.
Where this gets interesting is the roadmap. Noeum.ai’s plan is not “outspend the incumbents.” It’s: iterate cheaply at the nano scale, validate what truly improves reasoning per token, then scale only the proven recipes. The next step is a realistic-sized model with multimodality and multilingual support, trained on 1–3T tokens, with research directions focused on long-context efficiency and self-correcting reasoning pipelines.
What I would watch next:
- A reproducibility package (eval configs, scripts, baselines, reruns)
- An intermediate-scale checkpoint that preserves the efficiency gains under harder conditions
- A clear product wedge (e.g., on-prem/edge deployments, sovereign/industrial settings) that turns “lab progress” into durable distribution
For investors and compute partners focused on efficiency over brute-force scale, Noeum.ai represents a validated thesis at an inflection point—where the next milestone is less about ambition and more about converting a proven nano-scale recipe into scalable advantage.
Benchmark tables and model details are available via the public model card and the lab’s website.
What I would watch next:
- A reproducibility package (eval configs, scripts, baselines, reruns)
- An intermediate-scale checkpoint that preserves the efficiency gains under harder conditions
- A clear product wedge (e.g., on-prem/edge deployments, sovereign/industrial settings) that turns “lab progress” into durable distribution
For investors and compute partners focused on efficiency over brute-force scale, Noeum.ai represents a validated thesis at an inflection point—where the next milestone is less about ambition and more about converting a proven nano-scale recipe into scalable advantage.
Benchmark tables and model details are available via the public model card and the lab’s website.


