Inside s1: An o1-Style Reasoning Model that Cost Under $50 to Train

Inspiration behind s1

  • Right at the start of Niklas’s PhD, o1 was released by OpenAI.
  • There were two really interesting advancements with o1:
    • Gains in reasoning performance
    • Those gains were better with “more compute at test-time”

Test-Time Scaling

  • Two approaches:

    • Scale in parallel
    • Scale sequentially
  • Parallel test-time scaling means running one model with the same question multiple times, and doing some sort of aggregation/ensemble decision, like majority voting.

  • In sequential scaling, you have the model generate a single chain of reasoning, then iteratively redefine what it’s doing in order to do better. Theoretically this should be better, since in sequential scaling you’re not “making the same mistake” as might happen in parallel scaling.

  • These two methods are NOT mutually exclusive.

    • In OpenAI o1 blog post, even though o1 uses sequential test-time scaling, they showed that you can use parallel scaling on top of it. Generate multiple chains in parallel and aggregate them via majority vote via voting.

Alternative Methods for Aggregation in Parallel Test-Time Scaling

  • Rather than aggregating via voting, is it possible to aggregate via another model processing state, like summarization or something?
    • Maybe if you can have a model learn from each parallel generation, it can be better? (no work on this yet)
  • Approaches like “Best of N” or “ReBase” or “tree-based methods”
    • Second reward model that assigns a reward to current intermediate or final output of the model. Take the one or few wth higest score in order to reach a better answer.

s1 vs DeepSeek R1

  • s1 and R1 are attempts to replicate OpenAI’s o1.
  • in R1 they tried to go the full 9 yards, and replicate everything
  • In s1 they wanted to just find the minimal results to get better reasoning and test-time scaling
  • Final training run took 26 minutes on 16 H100’s. 16 H100’s on an online platform like Prime Intellect costs $40 per hour. Very cheap overall. 2 nodes with 16 total H100’s
  • Anyone should be able to replicate results since the datasets are public.

Performance

  • Better than o1 preview on some math benchmarks like AIME and Math500

Resources

Papers and Blog Posts

LLM Benchmarks and Evals