🌊 sushant's knowledge ocean

Recent Notes

Text-to-Qdrant: A Natural Language-first Semantic Query Layer
Jun 30, 2025
- resource
AI Interpretability
Jun 05, 2025
- resource
Garmin Epix Pro Sapphire
Jun 05, 2025
- resource
Guide to investing
Jun 05, 2025
Machu Picchu Trip May 2025
Jun 05, 2025
- resource
Objectives and Key Results (OKRs)
Jun 05, 2025
- resource

See 76 more →

Inside s1: An o1-Style Reasoning Model that Cost Under $50 to Train
Inspiration behind s1
Test-Time Scaling
Alternative Methods for Aggregation in Parallel Test-Time Scaling
s1 vs DeepSeek R1
Performance
Resources
Papers and Blog Posts
LLM Benchmarks and Evals

❯

❯

s1

Apr 22, 20252 min read

resource

Inside s1: An o1-Style Reasoning Model that Cost Under $50 to Train

This is a YouTube podcast on “The TWIML AI Podcast With Sam Charrington” released in March 2025

Inspiration behind s1

Right at the start of Niklas’s PhD, o1 was released by OpenAI.
There were two really interesting advancements with o1:
- Gains in reasoning performance
- Those gains were better with “more compute at test-time”

Test-Time Scaling

Two approaches:
- Scale in parallel
- Scale sequentially
Parallel test-time scaling means running one model with the same question multiple times, and doing some sort of aggregation/ensemble decision, like majority voting.
In sequential scaling, you have the model generate a single chain of reasoning, then iteratively redefine what it’s doing in order to do better. Theoretically this should be better, since in sequential scaling you’re not “making the same mistake” as might happen in parallel scaling.
These two methods are NOT mutually exclusive.
- In OpenAI o1 blog post, even though o1 uses sequential test-time scaling, they showed that you can use parallel scaling on top of it. Generate multiple chains in parallel and aggregate them via majority vote via voting.

Alternative Methods for Aggregation in Parallel Test-Time Scaling

Rather than aggregating via voting, is it possible to aggregate via another model processing state, like summarization or something?
- Maybe if you can have a model learn from each parallel generation, it can be better? (no work on this yet)
Approaches like “Best of N” or “ReBase” or “tree-based methods”
- Second reward model that assigns a reward to current intermediate or final output of the model. Take the one or few wth higest score in order to reach a better answer.

s1 vs DeepSeek R1

s1 and R1 are attempts to replicate OpenAI’s o1.
in R1 they tried to go the full 9 yards, and replicate everything
In s1 they wanted to just find the minimal results to get better reasoning and test-time scaling
Final training run took 26 minutes on 16 H100’s. 16 H100’s on an online platform like Prime Intellect costs $40 per hour. Very cheap overall. 2 nodes with 16 total H100’s
Anyone should be able to replicate results since the datasets are public.

Performance

Better than o1 preview on some math benchmarks like AIME and Math500

Resources

Papers and Blog Posts

OpenAI Blog: Learning to reason with LLMs (“o1 Blog Post”)

LLM Benchmarks and Evals

Vals AI AIME Benchmark

Graph View

Inside s1: An o1-Style Reasoning Model that Cost Under $50 to Train
Inspiration behind s1
Test-Time Scaling
Alternative Methods for Aggregation in Parallel Test-Time Scaling
s1 vs DeepSeek R1
Performance
Resources
Papers and Blog Posts
LLM Benchmarks and Evals

Created seamlessly with Quartz v4.5.0 © 2025

GitHub
LinkedIn
Twitter