Long-running Tasks - Scaling Laws and the Role of Verifiers

AI

Posted at 2026/05/308 min to read

#Long-running Tasks

An engineer from cipepser talks on scaling laws for long-running LLM tasks and the role of verifiers. As LLM agents become more capable, the interesting question is no longer just: “Can the model solve this prompt?” It is becoming: “Can the model keep working correctly for hours?”

The speaker begins with the METR long-horizon task benchmark. The benchmark shows how recent models have become capable of completing tasks that would take humans hours. In the referenced slide, Claude Opus 4.6 is shown as reaching around 12-hour tasks at a 50% success rate.

metr.org

METR

METR is a research nonprofit that evaluates frontier AI models to help companies and wider society understand AI capabilities and what risks they pose.

That is impressive, but the 50% number matters. A model that can complete a 12-hour task half of the time is very different from a system that can be trusted in production. The core issue is simple: long-running tasks contain many steps. Even if each step is individually reliable, the probability of completing every step without error drops quickly as the number of steps grows. If a task has $s$ steps and each step succeeds with probability $p$ , the naive probability of completing the whole task is roughly $p^s$ That is brutal.

A 99.9% step success rate sounds excellent, but over a million steps, the full-task success probability collapses. So long-running AI agents need more than a strong base model. They need mechanisms for voting, retrying, verifying, and recovering.

The talk introduces three papers that each approach this problem from a different angle.

#1. Solving a Million-Step LLM Task with Zero Errors

The first paper is Solving a Million-Step LLM Task with Zero Errors.

arxiv.org

Solving a Million-Step LLM Task with Zero Errors

LLMs have achieved remarkable breakthroughs in reasoning, insights, and tool use, but chaining these abilities into extended processes at the scale of those routinely executed by humans, organizations, and societies has remained out of reach. The models have a persistent error rate that prevents scale-up: for instance, recent experiments in the Towers of Hanoi benchmark domain showed that the process inevitably becomes derailed after at most a few hundred steps. Thus, although LLM research is often still benchmarked on tasks with relatively few dependent logical steps, there is increasing attention on the ability (or inability) of LLMs to perform long range tasks. This paper describes MAKER, the first system that successfully solves a task with over one million LLM steps with zero errors, and, in principle, scales far beyond this level. The approach relies on an extreme decomposition of a task into subtasks, each of which can be tackled by focused microagents. The high level of modularity resulting from the decomposition allows error correction to be applied at each step through an efficient multi-agent voting scheme. This combination of extreme decomposition and error correction makes scaling possible. Thus, the results suggest that instead of relying on continual improvement of current LLMs, massively decomposed agentic processes (MDAPs) may provide a way to efficiently solve problems at the level of organizations and societies.

The paper studies what happens when a long task is decomposed into many small subtasks. If each subtask has some chance of failure, how can the full task still be completed reliably? The central technique introduced in the talk is First-to-ahead-by-k Voting. Instead of asking the model once per step and immediately accepting the output, the system samples multiple outputs. Once one candidate is ahead of the others by $k$ votes, that candidate is accepted. For example, suppose a step has two possible actions: A and B.

The model outputs A.
Then A again.
Then B.
Then A.

Now A has three votes, while B has one. A is ahead by two votes. If $k = 2$ , the system chooses A and moves on. The interesting result is that $k$ does not necessarily need to be large. Even for a million-step task, if the per-step success probability is high enough, a single-digit $k$ can be enough to make the overall task highly reliable. The paper also discusses the cost of LLM calls. In the simple case, the cost scales around $s \log s$ That is encouraging. It suggests that reliability does not always require an impossible explosion in cost.

The practical lesson is straightforward: for important substeps, do not treat a single model output as final. Sample multiple times, compare the outputs, and only proceed once there is enough agreement.

However, voting is not a complete solution. If the model makes the same kind of mistake repeatedly, voting can still confidently select the wrong answer. This is especially dangerous in long-running tasks, where one bad step can poison everything that follows.

cipepser points out the importance of detecting “red flags” that indicate unreliable model behavior. Examples include unusually long response time, invalid output format, missing schema keys, suspiciously short responses, or outputs that look truncated. When those signs appear, the system should retry before passing the result downstream.

This is not glamorous. It is not a new model architecture or a clever prompt trick. But for long-running agents, this kind of reliability engineering matters.

#2. On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

The second paper is On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks.

arxiv.org

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples--ranging from multiplication to simple planning--there persists a wide spread belief that LLMs can self-critique and improve their own solutions in an iterative fashion. This belief seemingly rests on the assumption that verification of correctness should be easier than generation--a rather classical argument from computational complexity--which should be irrelevant to LLMs to the extent that what they are doing is approximate retrieval. In this paper, we set out to systematically investigate the effectiveness of iterative prompting in the context of reasoning and planning. We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. We experiment both with the model critiquing its own answers and with an external correct reasoner verifying proposed solutions. In each case, we analyze whether the content of criticisms actually affects bottom line performance, and whether we can ablate elements of the augmented system without losing performance. We observe significant performance collapse with self-critique and significant performance gains with sound external verification. We also note that merely re-prompting with a sound verifier maintains most of the benefits of more involved setups.

This paper asks a deceptively simple question: can LLMs reliably critique their own answers? The answer is: not always.

The paper compares LLM self-critique with feedback from trusted verifiers across several reasoning and planning tasks, including Game of 24, graph coloring, and STRIPS planning. In Game of 24, the model is given four numbers and must create an arithmetic expression that evaluates to 24. After the model proposes an answer, there are two ways to check it. One way is to ask another LLM, or the same LLM, whether the expression is correct.

Another way is to use a Python program to evaluate the expression. The second option is obviously more reliable. Arithmetic can be checked deterministically.

The same pattern appears in graph coloring. The task is to assign colors to graph nodes so that adjacent nodes do not share the same color. An LLM can be asked whether the coloring is valid, but a program can simply inspect every edge and check the colors directly.

For STRIPS planning, a trusted verifier can check whether a proposed plan actually reaches the goal from the initial state. The key lesson is that what matters is not whether the feedback is produced by an LLM. What matters is whether the feedback accurately tells the model: correct or incorrect.

In the experiments discussed in the talk, trusted verifiers improved performance much more consistently than LLM self-critique. In some cases, repeated LLM self-critique even made performance worse. Meanwhile, repeated feedback from a reliable verifier improved results over iterations. That is a very useful lesson for agent design. A verifier is not just a scoring component. It is a safety rail. It prevents the agent from continuing deeper into a bad trajectory. In real systems, trusted verifiers can take many forms:

unit tests for generated code
schema validation for structured outputs
SQL execution checks
arithmetic validation
business-rule checks
state-machine constraints
deterministic tools that confirm whether an action succeeded

The point is to avoid asking the model “Are you sure?” when a programmatic check is available.

For long-running tasks, correctness needs to be externalized. The model should generate, but the system should verify.

#3. Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

The third paper is Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers.

arxiv.org

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.

This paper introduces a different but related idea: if multiple reasoning paths lead to the same answer, the answer is more trustworthy. The talk uses a simple math example: “How many two-digit primes have digits that sum to 8?”

There are multiple ways to approach the problem. One path is to list two-digit primes and check the digit sum of each one. Another path is to list two-digit numbers whose digits sum to 8, then check which are prime. Another path is to reason directly through possible digit pairs.

If different approaches converge on the same final answer, confidence should go up. If they disagree, that disagreement is useful too. It tells the system that the answer may need another attempt, another verifier, or human review.

This is close to how people check their own work. In math, we often solve a problem one way, then verify it another way. In accounting, we reconcile totals from different directions. In error-correcting systems, redundancy helps detect and repair mistakes.

The same idea applies to LLM agents. Instead of asking one model to produce one answer and then trusting it, we can ask multiple models, or the same model through different reasoning paths, to solve the problem independently. Agreement becomes a signal of reliability. Disagreement becomes a signal for escalation. This is especially useful when a perfect verifier is not available.

For arithmetic or code, verification can often be deterministic. But many real-world tasks are messier. There may not be a simple program that can say whether an answer is correct. In those cases, mutual reasoning gives the system another way to estimate confidence.

It is not a guarantee. Models can still agree on the same wrong answer. But it is a useful reliability signal, especially when combined with voting, retries, and external checks.

#Takeaway

The main takeaway from the talk is that long-running LLM tasks are not solved by model intelligence alone.

They require system design.

A reliable long-running agent needs to decompose work into smaller steps, raise per-step reliability with voting or sampling, detect red flags early, retry suspicious outputs, use trusted verifiers whenever possible, compare independent reasoning paths, and escalate disagreements when needed.

For short tasks, you can sometimes get away with trusting the model.

For long-running tasks, trust needs infrastructure.

The future of AI agents will not only be about better prompts or bigger models. It will also be about verification, recovery, and reliability engineering.

The model may be the engine, but the verifier is the guardrail.

Across the archive

Earlier post

How AI Is Reshaping Engineering Organizations

2026-05-27