arXiv2606.24898asi13 hr ago Rituraj Sharma, Tu Vu
Abstract: Looped language models turn hidden states into runtime state: each state is decoded for prediction and fed back into future computation. This creates a basic supervision question: which state variables does cross-entropy actually control? We show that dense per-loop cross-entropy controls the variables exposed by the readout, not every variable active in the recurrent transition. Hidden-state scale gives a concrete failure mode. Scale-invariant readouts such as RMSNorm and LayerNorm hide radial scale from the immediate cross-entropy loss, while pre-norm residual recurrence continues to carry and update that same scale. Thus per-loop loss can make early exits usable without controlling recurrent scale. In 44M and 129M looped transformers without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts still drives final hidden-state norms into the thousands or tens of thousands. Scale-visible readouts and explicit norm penalties keep norms in the tens, and scale-removing recurrence is the complementary architectural fix. The resulting design rule is simple: dense supervision trains exits; recurrent scale control requires either making scale visible to a loss or removing it from the loop. Consistent with this rule, scale-controlled variants achieve lower perplexity at matched inference-depth operating points in our variable-depth benchmarks.
Jun 26, 20261 views
D. R. Halvorsen M. Okonkwo L. Petrova J. S. Almeida
Abstract: Most practitioners still operate large language model agents the way they operated chatbots: one prompt, one reply, a human in the loop on every turn. We argue this is a transitional posture, not an endpoint. As agents acquire the ability to discover their own work, isolate their changes, and check each other, the unit of engineering shifts from the prompt to the loop: a scheduled, self-driving cycle that the operator designs once and supervises rarely. We present a field playbook for designing such loops. We decompose a loop into five movements, Discovery, Handoff, Verification, Persistence, and Scheduling, and define five levels of operator autonomy from manual prompting to fully autonomous operation. We give a structural account of why an agent that grades its own work tends to inflate its score, and why an independent verifier, the “thing that can say no,” is the single highest-leverage component in the design. We close with a fourteen-step adoption path and a set of failure modes observed in practice. The central claim is simple: you stop prompting the agent, and start building the system that prompts it.
Jun 25, 20261 views
A Field Study of Designing Loops That Run Themselves
Abstract: Abstract—Over the past two years a string of “XX Engineering”
terms has tracked the pace of model releases. This note examines
the newest of them, Loop Engineering, a term independently
surfaced in June 2026 by Peter Steinberger, Boris Cherny, and
Addy Osmani, and named in writing by Osmani. Unlike prompt,
context, or harness engineering, loop engineering does not teach
the practitioner to do the work better; it removes the practitioner
from the position of doing the work at all. We define the term,
place it as a fourth layer above the harness, and decompose a
single turn of a loop into five moves—discovery, handoff, ver-
ification, persistence, and scheduling—and the six parts that
realize them. We give particular attention to the generator/eval-
uator separation: empirically, an agent asked to grade its own
output tends to praise it, and tuning an independent skeptical
evaluator is far more tractable than making a generator criti-
cal of its own work. We survey three loops running in practice,
from one engineer’s morning triage to Stripe’s enterprise-scale
pipeline merging over 1,300 machine-written pull requests per
week, and we catalog four costs that accrue silently—verification
debt, comprehension rot, cognitive surrender, and token blowout.
We close with a concrete recipe for building a first loop. The
central claim is that loops make generation nearly free and leave
judgment as the scarce resource; the same loop, built by two
people, can yield opposite outcomes.
Index Terms—Agentic AI, software engineering, autonomous
agents, coding agents, generator–evaluator, scheduling, automa-
tion.
Jun 25, 20268 views
Yimo Lin, Zhen Zhang, Yibin Li
Abstract: While expert-validated "LLM + script" workflows deliver significant value, they remain static: they encode hard-won domain knowledge yet fail to adapt execution based on feedback. Existing agent research predominantly targets greenfield agents and synthetic benchmarks, leaving the migration of active legacy workflows unresolved. To bridge this gap, we present a reversible, Strangler-Fig migration path that refactors legacy workflows into composable, typed, and auditable stages. Central to this framework is a three-tier convertibility taxonomy (A/B/C), implemented as a routing stage within the system harness, which diagnoses a workflow's readiness and routes it accordingly.
Jun 25, 20267 views
Zijie Dai, Siuhin He, Hui Li, Qihui Zhou, Jiajun Li, Mingcong Song, Guoping Long, Hongjie Si, Xin Yao, Lin Zhang, James Cheng, Xiao Yan
Abstract: Self-evolving agents improve over time by distilling experience from past executions and reusing it in future tasks. Existing systems represent such experience either as natural-language text injected into the agent context or as code exposed as callable tools. However, the choice between these representations is typically made at design time rather than derived from the characteristics of the experience itself, leaving the trade-offs between them poorly understood. We present the first controlled study that isolates text memory and code memory over an identical set of experiences. Our results show that the two forms exhibit complementary trade-offs in construction cost, execution efficiency, and transferability, such that neither representation alone is sufficient. Guided by these findings, we propose Metis, a self-evolving agent system built on a hierarchical dual-representation memory. Metis organizes textual experience into execution plans, environment facts, and common pitfalls, and selectively crystallizes recurring plans into validated callable tools. This design combines the broad applicability of text memory with the execution efficiency of code memory while incurring tool-generation cost only when justified by repeated reuse. We evaluate Metis on AppWorld, a challenging benchmark for interactive agents. The results show that Metis improves task accuracy by up to 20.6% over ReAct while reducing execution cost by up to 22.8%. Compared with representative self-evolving agent systems, Metis consistently achieves a better balance between accuracy, execution efficiency, and memory-construction cost.
Jun 25, 20262 views
arXiv2606.23983asi1 day ago Hidayet Aksu
Abstract: A single forward pass of a capable model is a fast, fluent, and unreliable problem-solver: it is right often enough to be useful and wrong often enough to be dangerous; in language models, such confident errors are known as hallucinations. We present Maestro Order, a model-agnostic orchestration harness that turns unreliable solvers into reliable problem-solving systems by composing them according to four structural primitives (decompose, ensemble, verify, and recurse) and a budget-aware controller that decides where to spend compute. The harness treats any model as a black-box base solver behind a uniform interface, layers a verifier ensemble whose discrimination is measured online, and allocates verification and voting to the stages with the highest marginal reliability per unit cost. We give the architecture, the message and state schema, the controller algorithm, and the engineering that makes it deterministic, observable, and fault-tolerant. We then specify an evaluation methodology (reliability at fixed cost, coverage, calibration, and ablations) and report results from a faithful Monte Carlo simulation of the harness over a parameterized solver/verifier model. The simulation reproduces the predicted laws quantitatively: verification amplifies reliability geometrically (e.g. $0.55\to0.98$ with two gates, $\to0.999$ with four), voting helps only above chance and is limited by shared errors, and a budget-aware controller reaches a target reliability at a small fraction of the cost of voting alone by selecting the cheapest mechanism for each regime. We close with failure modes (verifier gaming, correlated errors, and decomposition error compounding) and concrete guidance: build robust checkers, diversify solvers, and let the controller put compute where the information is.
Jun 25, 20261 views