Thinking Past the Answer:
Evaluating Harmful Overthinking in Large Reasoning Models

Simone Caldarella1,* Davide Talon3 Rahaf Aljundi2 Elisa Ricci1,3 Massimiliano Mancini1
1University of Trento 2Toyota Motor Europe 3Fondazione Bruno Kessler
Performance averaged on LRMs. 
    Actual Length is the model’s default behavior, No-CoT disables intermediate reasoning, and Instruct Model is the pre-reasoning instruction-tuned model.
    Finally, Optimal Length stops at the first correct prefix. 
    The gap between Actual Length and Optimal Length shows that models often reason past correctness, making additional reasoning harmful.

Large Reasoning Models (LRMs) often reach a correct answer before they stop reasoning. This project studies what happens after correctness under the lens of overthinking, by distinguishing between verbose and harmful overthinking. While the former has been largely studied as an efficiency problem, the latter is currently underexplored. As shows in the above figure, learning to stop at the right time would greatly improve performance of LRMs.

Abstract

Large Reasoning Models improve performance by producing explicit intermediate reasoning traces with additional test-time compute. However, longer reasoning is not always beneficial. We ask whether a model that has already reached the correct answer continues to refine that answer or instead drifts away from it. To study this, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency: the minimum reasoning budget required for a model to first generate the correct answer. This separates verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Across multimodal and language-only benchmarks, stopping at the first correct prefix improves accuracy over default reasoning, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time.

Reasoning sufficiency as difficulty

Algorithm.

We evaluate a reasoning trace prefix by prefix. For each partial trace, the model is forced to provide an answer. The first prefix that yields the correct answer defines the empirical sufficient reasoning budget for that model and instance. Any reasoning beyond that point is overthinking; if the final answer remains correct, it is verbose, and if the final answer becomes incorrect, it is harmful.

Key idea. Difficulty should be tied to the minimum compute needed to first reach correctness, not to the total length of a model-generated chain of thought.

Main findings

  • Reasoning length is a poor proxy for difficulty. Large Reasoning Models often reach the correct answer early, then continue generating long traces that are not required for correctness.
  • Models frequently reason past correct intermediate states. Stopping at the first correct prefix can substantially outperform default full-length reasoning, showing that additional reasoning can be harmful rather than merely redundant.
  • Free-form generation exposes harmful overthinking more sharply. Without a fixed answer set, unconstrained reasoning is more likely to drift away from an already-correct answer.
  • Reasoning trajectories are non-monotonic. After first reaching correctness, the probability of staying correct drops as models continue reasoning.
  • Efficiency methods reduce verbosity, but not necessarily harmful overthinking. Shorter traces remove wasted computation, yet they do not reliably prevent correctness deviations.
  • Harmful overthinking also appears in language-only reasoning. The phenomenon is not only caused by visual drift; similar instability emerges on math-heavy and knowledge-heavy language benchmarks.

Why does reasoning become harmful?

Failure modes figure.

In harmful trajectories, the model first reaches a correct answer and then changes it. We analyze the segment from the last correct prefix to the final trace and categorize deviations into visual errors, calculation errors, and logical errors. The dominant causes are logical drift and visual reinterpretation rather than arithmetic mistakes.

BibTeX

@misc{caldarella2026overthinking,
  title        = {Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models},
  author       = {Caldarella, Simone and Talon, Davide and Ricci, Elisa and Aljundi, Rahaf and Mancini, Massimiliano},
  year         = {2026},
  note         = {Preprint}
}