Observation
Open agent systems can now complete a surprising amount of useful work. The weak point is knowing whether the result is actually trustworthy.
Failure mode
Runs look productive because they emit many actions, logs, and artifacts. That activity can hide low-quality judgment.
Practical response
- Score outputs against explicit task criteria.
- Separate tool success from task success.
- Review the handoff points where the run changes direction.