alphaXiv

Explore

Sign In

Blog

Feedback

Browser Extension

Upgrade to Pro

Dark mode

May 26–29, 2026·San Jose, CA

ACM Conference on AI and Agentic Systems

Building the Future of Agentic & AI Systems

ACM CAIS 2026 — The premier venue for rigorous, reproducible research on compound AI architectures, optimization, and deployment.

61Research Papers

46System Demos

Architectural Patterns & Composition

19 of 25 on alphaXiv

Glia: A Human-Inspired AI for Automated Systems Design and Optimization

03 Apr 2026

Pouya Hamadanian

Pantea Karimi

Arash Nasr-Esfahany

An AI system named Glia, developed at MIT CSAIL, autonomously designs and optimizes computer system mechanisms using a human-inspired multi-agent architecture. It generates interpretable algorithms and insights, achieving a 1.4x reduction in mean request completion time over prior methods in LLM serving optimization and demonstrating adaptability to varying conditions.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Retrieval-Augmented LLMs for Security Incident Analysis

04 May 2026

Xavier Cadet

Aditya Vikram Singh

Harsh Mamania

A Retrieval-Augmented Generation (RAG) system is introduced for end-to-end security incident analysis, enabling large language models to answer forensic questions and reconstruct attack sequences from diverse security logs. The system achieved 94% recall for malware traffic analysis and 96% recall with 100% precision for Active Directory attack step detection using models like Claude Sonnet 4, significantly outperforming baselines.

#agents#ai-for-cybersecurity#computer-science

Paper thumbnail

Improving Coherence and Persistence in Agentic AI for System Optimization

19 May 2026

MIT

Pantea Karimi

Kimia Noorbakhsh

Mohammad Alizadeh

Researchers at MIT CSAIL developed Engram, an agentic AI architecture that enables sustained, coherent progress in complex system optimization by transferring distilled knowledge between sequential LLM agents. Engram generated solutions outperforming previous LLM-based methods across diverse tasks and discovered novel algorithms, including a dynamic programming approach for multi-cloud multicast that achieved costs below the human state-of-the-art.

Paper thumbnail

Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

16 May 2026

University of Washington Allen Institute for AI logo

Allen Institute for AI

Marissa Radensky

Simra Shahid

Raymond Fok

The scientific ideation process often involves blending facets of existing papers to create new ideas. We contribute Scideator, the first human-LLM system for facet-based scientific ideation. Starting from user-provided papers, Scideator extracts key facets -- purposes, mechanisms, and evaluations -- from these and related papers, allowing users to interactively recombine facets to synthesize ideas. Scideator is driven by three design choices: (1) human-in-the-loop facet recombination, in which users select facets from retrieved papers and the system generates ideas by finding analogies across them via the Faceted Idea Generator module; (2) distance-controlled retrieval via the Analogous Paper Facet Finder module, which surfaces papers ranging from the same topic to entirely different areas to provide a spectrum of directions; and (3) facet-based novelty verification via the Idea Novelty Checker module, a retrieve-then-rerank pipeline that helps users to evaluate idea originality using facets. In a user study with computer science researchers, Scideator provided significantly more creativity support than a baseline using the same backbone LLM without our facet-based modules, particularly in idea exploration and expressiveness. Ablations further show that the facets benefit the novelty checker: facet-based retrieve-then-rerank surfaces more relevant papers than standard retrieval and re-ranking, and a facet-grounded novelty classifier outperforms classifiers that reason over unstructured ideas and papers.

#computer-science#artificial-intelligence#human-computer-interaction

Paper thumbnail

Open Agent Specification: Enabling Cross-Framework Comparison of AI Agents

19 May 2026

Oracle

Soufiane Amini

Yassine Benajiba

Cesare Bernardis

Oracle researchers introduced Agent Spec, a declarative and framework-agnostic language designed to standardize the definition of AI agents and their workflows. This specification enables reproducible cross-framework evaluation, revealing measurable differences in accuracy, latency, and execution behavior across various runtimes for identical agent designs.

Paper thumbnail

AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting

23 Oct 2025

hao-zhu

Hao Zhu

Carnegie Mellon University University of Chicago logo

University of Chicago

Jibang Wu

Chenghao Yang

Yi Wu

This paper develops an agentic framework that employs large language models (LLMs) for grounded persuasive language generation in automated copywriting, with real estate marketing as a focal application. Our method is designed to align the generated content with user preferences while highlighting useful factual attributes. This agent consists of three key modules: (1) Grounding Module, mimicking expert human behavior to predict marketable features; (2) Personalization Module, aligning content with user preferences; (3) Marketing Module, ensuring factual accuracy and the inclusion of localized features. We conduct systematic human-subject experiments in the domain of real estate marketing, with a focus group of potential house buyers. The results demonstrate that marketing descriptions generated by our approach are preferred over those written by human experts by a clear margin while maintaining the same level of factual accuracy. Our findings suggest a promising agentic approach to automate large-scale targeted copywriting while ensuring factuality of content generation.

#computer-science#artificial-intelligence#computation-and-language

Paper thumbnail

Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

22 Oct 2025

Jackson Hassell

Dan Zhang

Hannah Kim

Researchers at Megagon Labs developed a memory-augmented framework that enables large language model agents to adapt without parameter updates by leveraging LLM-generated critiques stored in episodic and semantic memory. This reflective approach demonstrated up to a 24.8% accuracy improvement over retrieval-based baselines across diverse classification tasks.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Composing Policy Gradients and Prompt Optimization for Language Model Programs

11 May 2026

University of Notre Dame UC Berkeley logo

UC Berkeley

Noah Ziems

Dilara Soylu

Lakshya A Agrawal

Researchers developed mmGRPO, an online reinforcement learning framework that successfully adapts Group Relative Policy Optimization to optimize multi-module language model programs. Combining mmGRPO with prompt optimization improved performance by up to 11% over baseline methods across various tasks.

#agentic-frameworks#agents#computer-science

Paper thumbnail

MARVIS: Modality Adaptive Reasoning over VISualizations

02 Jul 2025

University of Freiburg New York University logo

New York University

Benjamin Feuer

Lennart Purucker

Oussama Elachqar

MARVIS introduces a training-free method that extends Vision-Language Models (VLMs) to predict across diverse data modalities, including vision, audio, biological, and tabular data, by transforming data embeddings into visualizations for VLM interpretation. The approach achieves performance within 2.5% of specialized models and significantly outperforms other generalist foundation models by an average of 16.7%, while also enhancing data privacy by avoiding direct exposure of raw data.

#ai-for-health#computer-science#machine-learning

Paper thumbnail

Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

08 May 2026

Naoki Otani

Nikita Bhutani

Hannah Kim

Explicit planning is a critical capability for LLM-based agents solving complex data-centric tasks, which require precise tool calling over external data sources. Existing strategies fall into two paradigms based on planning horizon: (1) full-horizon (FH), which generates a complete plan before execution, and (2) single-step horizon (SH), which interleaves each action (tool call) with incremental reasoning and observation. While step-by-step execution is a common default under the assumption that eager execution monitoring is necessary for adaptability, we revisit this assumption for well-defined data-centric tasks. Our controlled empirical study isolates planning horizon as the key architectural feature and systematically analyzes the effects of topological complexity and tool robustness on both paradigms. Our experiments across Knowledge Base Question Answering and Multi-hop QA show that FH planning with lazy replanning achieves accuracy parity with SH across varying depths, breadths, and robustness levels, while using 2-3x fewer tokens. These findings suggest that for well-defined data-centric tasks, eager step-wise monitoring is often unnecessary, and full-horizon planning with on-demand replanning can offer a more efficient default.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

05 May 2026

rania-khalaf

Rania Khalaf

Srinath Perera

Kaviru Hapuarachchi

Frank Leymann

We present Robust Agent Compensation (RAC), a log-based recovery paradigm (providing a safety net) implemented through an architectural extension that can be applied to most Agent frameworks to support reliable executions (avoiding unintended side effects). Users can choose to enable RAC without changing their current agent code (e.g., LangGraph agents). The proposed approach can be implemented in most existing agent frameworks via their existing extension points. We present an implementation based on LangChain, demonstrate its viability through the

\tau

-bench and REALM-Bench, and show that when solving complex problems, RAC is 1.5-8X or more better in both latency and token economy compared to state-of-the-art LLM-based recovery approaches.

#agentic-frameworks#agents#computer-science

Paper thumbnail

A Language for Describing Agentic LLM Contexts

03 May 2026

Noga Peleg Pelc

Gal A. Kaminka

Yoav Goldberg

Researchers at Bar-Ilan University developed ACDL, a formal language for precisely describing the structure and dynamic evolution of LLM agent input contexts, providing standardized visual diagrams. This formalization enhances communication and comparability of agentic systems, with experiments demonstrating that context structure variations can alter agent performance by up to 5 percentage points.

#agentic-frameworks#agents#computer-science

Paper thumbnail

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

08 May 2026

Shuren Xia

Qiwei Li

Taqiya Ehsan

We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456-run runtime comparison shows that topology-monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt-only and chat-only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC-verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

15 May 2026

Igor Bogdanov

Chung-Horng Lung

Thomas Kunz

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4

\times

worse mean return while using 1.8-2.7

\times

more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

#agentic-frameworks#agents#ai-for-cybersecurity

Paper thumbnail

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

15 May 2026

Igor Bogdanov

Chung-Horng Lung

Thomas Kunz

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7

\times

over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below

-100

) to as low as

\sim

1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with

\sim

40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

#agentic-frameworks#agents#ai-for-cybersecurity

Paper thumbnail

Tressoir: Unifying Online, Offline, and HIL Design and Evolution of Multi-Agent Systems via Interpretable Blueprints

19 May 2026

MIT

Amadou Latyr Ngom

Ziniu Wu

Jason Mohoney

We explore a principled approach that jointly designs and evolves the architectures, prompts, tools, and knowledge of multi-agent systems, whether online, offline, or with human guidance. We first propose Interpretable Blueprints (IBs), which pair an online-interpretable system ontology (describing architectures, invariants, domain knowledge, etc.) with offline-materialized components proven to be high-quality or cost-effective. Second, we propose a supervising interpreter that co-interprets the IB and the task to construct a specialized agentic system on the fly, without assuming any pre-existing implementation, thereby enabling maximal adaptation to the task. IBs are also the primary online communication mechanism between agents. Offline learning is a subset of this approach; learning IBs encode learning strategies that let the interpreter orchestrate metrics collection and IB improvement. Human guidance is enabled at every layer, whether by co-editing IBs or by steering online or offline interpretation in ways that the system learns from over time. To instantiate this vision, we develop Tressoir, an IB-centric framework that unifies online, offline, and human-guided evolution under a single mechanism. Tressoir is tailored for long-running, complex projects with tasks that build on each other and require continual learning during or in between executions. Its generality further allows it to bootstrap itself, where its own features are now self-generated with high-level human guidance. We also evaluate Tressoir on shorter-term benchmarks. On SWEBench-Pro’s Qutebrowser subset, Tressoir with Claude 4.6 Opus reaches 75.9% vs. 57.0% for SWE-Agent; on ScreenSpot-Pro, it lifts Gemini 3 Flash from a 69.1% baseline to 83.1%; and on Bird-Critic Flash, Tressoir with Gemini 3 Flash tools scores 56.0%, exceeding SQL-ACT with Claude 4.6 Opus at 52.0%.

Paper thumbnail

Expansion-Contraction: A Multi-Agent Graph Traversal Pattern for Compound AI Systems

19 May 2026

AmazonContinental AG

Aiham Taleb

Zainab Afolabi

Joao Sousa

Compound AI systems that coordinate multiple specialized agents offer a promising path for complex reasoning tasks, yet principled architectural patterns for multi-agent coordination over structured data remain under-explored. We introduce Expansion-Contraction, a multi-agent graph traversal pattern in which an expansion phase walks a domain graph outward from a query origin, dynamically spawning ephemeral specialist agents at each node, and a contraction phase aggregates their findings inward to produce a verdict. Agent topology emerges isomorphically from the data graph rather than being hand-designed, and each agent operates on a small local context—avoiding the context-window saturation that degrades single-agent approaches on large graphs. We instantiate the pattern for supply chain root cause analysis, integrating domain-specific tools with temporal lead-time propagation. Across eight datasets (three real-world, five synthetic with controlled depth and width), Expansion-Contraction achieves 98.2% accuracy on a production supply chain (624 cases) and 100% on public benchmarks, outperforming single-agent baselines by 14+ percentage points while degrading gracefully as graph complexity increases. A deterministic depth-priority disambiguation heuristic, motivated by our failure analysis, further improves Dataset A accuracy to 99.5% (621/624, 95% CI [98.6%, 99.9%]). To assess transfer, we evaluate the pattern on a second domain—microservice dependency tracing over a 17-service DAG (100 scenarios)—where Expansion-Contraction reaches 88% overall accuracy and 85% on NLP-complex cases (vs. 55% for the next-best baseline). Investigation caching reduces token usage by up to 93.9%, concurrent path analysis yields up to 1.43× speedup, and a production deployment demonstrates the pattern’s viability for enterprise-scale agentic systems.

Paper thumbnail

Vista: Verifier-in-the-Loop Agentic Reinforcement Learning for Quantum Program Synthesis

19 May 2026

University of Liverpool Aalto University logo

Aalto University

Cong Yu

Tuo Shi

Valter Uotila

The Vista framework introduces a verifier-in-the-loop agentic reinforcement learning approach for quantum program synthesis, addressing challenges of costly and multi-stage verification in OpenQASM 3.0 circuit generation. It employs hierarchical rewards and budget-aware gated evaluation, achieving superior semantic quality and a 1.77x speedup in verification efficiency.

Paper thumbnail

fastWorkflow: Closing the Performance Gap Between Small and Frontier Language Models for Conversational Agents

19 May 2026

Radiant Logic

Sanchit Satija

Aditya Bhatt

Priyanshu Jani

Large language models are increasingly deployed in conversational agents that assist humans with complex, multi-step tasks, yet their deployment at scale is constrained by high inference costs, latency, and data privacy concerns. Small language models (SLMs) offer compelling operational advantages but exhibit systematic failure modes in agentic settings, particularly in conversational workflows: domains where tasks are solved interactively by a human and an LLM through structured tool invocation. Despite growing SLM deployment, these agentic failure modes remain poorly characterized and inadequately addressed. We present an empirically-grounded taxonomy categorizing SLM failures across five dimensions: natural language understanding failures, tool management failures, task decomposition and sequencing failures, agentic reasoning failures, and context management failures, and quantify their prevalence on the 𝜏-bench benchmark. Guided by this taxonomy, we introduce fastWorkflow, a dual-mode agentic architectural framework implementing a cascaded NLU pipeline for intent detection and structured parameter extraction with validation, hierarchical context organization that reduces effective action space, explicit task planning with dependency-aware decomposition, and adaptive context management, among other targeted mitigations. On 𝜏-bench, GPTOSS-20B augmented with fastWorkflow achieves 83.47% Pass^1 on the Retail domain and 78% on Airline, surpassing all frontier models evaluated on 𝜏-bench leaderboard including Claude Sonnet 4 (80.5% Retail, 60.0% Airline) and Claude Opus 4.1 (82.4% Retail, 56.0% Airline), while operating at ∼22× lower inference cost. Even Mistral-7B-Instruct with fastWorkflow matches Claude Sonnet 4 on Airline at 60%. Ablation studies confirm that the cascaded NLU pipeline is the most impactful component, with its removal causing performance collapses of 58 points on Retail and 68 points on Airline. Our findings demonstrate that architectural separation of concerns, offloading error-prone operations to structured subsystems while preserving LLM flexibility for planning and recovery, can close the performance gap between small and frontier models in conversational workflow tasks, shifting the cost-performance Pareto frontier for production deployment in domains involving multi-turn, tool-augmented human-LLM collaboration.

Paper thumbnail

Evaluation & Benchmarking

10 of 12 on alphaXiv

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

13 May 2026

Hung Tran

Langston Nashold

Rayan Krishnan

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents

03 May 2026

Carnegie Mellon University

Pradyumna Shome

Sashreek Krishnan

Sauvik Das

This research reveals a disparity between industry marketing and user experience with AI agents, establishing a taxonomy of advertised agent capabilities and empirically identifying five critical usability barriers faced by end-users. The findings suggest that existing usability challenges from large language models are amplified in agentic systems, particularly concerning delegated multi-step workflows and real-world consequences.

#computer-science#human-computer-interaction

Paper thumbnail

Reasoning-Intensive Regression

01 May 2026

MIT

Diane Tchuindjo

Omar Khattab

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks such as sentiment or similarity analysis, RiR often appears instead in ad-hoc applications such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances.

#computer-science#artificial-intelligence#computation-and-language

Paper thumbnail

Willful Disobedience: Automatically Detecting Failures in Agentic Traces

25 Mar 2026

Reshabh K Sharma

Shraddha Barke

Benjamin Zorn

AgentPex introduces an automated, specification-driven system for evaluating AI agent behavior across multi-step execution traces. It systematically extracts behavioral rules from agent prompts and tool schemas to detect procedural violations, even in scenarios where task outcome-based evaluations report success, revealing agent "willful disobedience".

#computer-science#artificial-intelligence#software-engineering

Paper thumbnail

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

16 Feb 2026

Skyler Hallinan

Thejas Venkatesh

Xiang Ren

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel

12 Mar 2026

Aadyaa Maddi

Prakhar Naval

Deepti Mande

Rockfish Data and Carnegie Mellon University researchers introduce AgentFuel, a framework for generating expressive and customizable evaluations tailored for timeseries data analysis agents. The framework reveals that agents achieve an average accuracy of 66% on e-commerce and 60% on IoT benchmarks, but only 21% on a telecom benchmark, particularly struggling with stateful (34% accuracy) and incident-specific queries (10% accuracy), and demonstrates that integrating AgentFuel into an optimization loop can improve agent accuracy by up to 25%.

#agentic-frameworks#agents#ai-for-cybersecurity

Paper thumbnail

Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

18 Feb 2026

University of Maryland

Mohamed bin Zayed University of Artificial Intelligence

Ming Li

Xirui Li

Tianyi Zhou

An empirical diagnosis of Moltbook, a large-scale AI-only social platform with over two million LLM agents, reveals that despite extensive interactions, robust socialization, defined as sustained behavioral adaptation and collective structure formation, does not automatically emerge. The study finds consistent semantic diversity at the micro-level, limited agent adaptation to social feedback, and transient influence hierarchies.

#computer-science#artificial-intelligence#computation-and-language

Paper thumbnail

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

03 Feb 2026

University of Illinois at Urbana-Champaign

Nimet Beyza Bozdag

Shuhaib Mehri

Gokhan Tur

An automated framework, Persuade Me If You Can (PMIYC), was developed by researchers at the University of Illinois Urbana-Champaign to quantify the persuasive abilities and vulnerabilities of large language models (LLMs) in multi-turn dialogues. This framework, using a Normalized Change in Agreement (NCA) metric, revealed that LLMs' susceptibility to persuasion varies significantly between subjective and misinformation claims, with models like GPT-4o showing over 50% greater resistance to misinformation.

#computer-science#artificial-intelligence#computation-and-language

Paper thumbnail

Benchmarking Agents in Insurance Underwriting Environments

31 Jan 2026

Amanda Dsouza

Ramya Ramakrishnan

Charles Dickens

Snorkel AI researchers developed UNDERWRITE, a benchmark for evaluating AI agents in commercial insurance underwriting that incorporates proprietary knowledge, noisy tools, and multi-turn interactions. Evaluating frontier models on UNDERWRITE revealed that high accuracy often trades off with efficiency and robustness, with models exhibiting prevalent domain-specific hallucinations and significant reliability issues.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Trace-Level Analysis of Information Contamination in Multi-Agent Systems

30 Apr 2026

Anna Mazhar

Huzaifa Suri

Sainyam Galhotra

Researchers from Cornell University and University of Illinois Urbana-Champaign developed a trace-level analysis method to systematically study information contamination in multi-agent systems. Their work reveals that errors from extracted data can propagate silently, leading to incorrect outcomes without workflow structural changes, or cause expensive detours that still recover to correct answers.

#adversarial-robustness#agentic-frameworks#agents

Paper thumbnail

Security & Privacy

8 of 11 on alphaXiv

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

14 Apr 2026

ServiceNowMILA-Qu bec

Léo Boisvert

Abhay Puri

Chandra Kiran Reddy Evuru

While finetuning AI agents on interaction data -- such as web browsing or tool use -- improves their capabilities, it also introduces critical security vulnerabilities within the agentic AI supply chain. We show that adversaries can effectively poison the data collection pipeline at multiple stages to embed hard-to-detect backdoors that, when triggered, cause unsafe or malicious behavior. We formalize three realistic threat models across distinct layers of the supply chain: direct poisoning of finetuning data, pre-backdoored base models, and environment poisoning, a novel attack vector that exploits vulnerabilities specific to agentic training pipelines. Evaluated on two widely adopted agentic benchmarks, all three threat models prove effective: poisoning only a small number of demonstrations is sufficient to embed a backdoor that causes an agent to leak confidential user information with over 80\% success.

#adversarial-attacks#adversarial-robustness#agents

Paper thumbnail

The Verifier Tax: Horizon Dependent Safety Success Tradeoffs in Tool Using LLM Agents

18 Mar 2026

Tanmay Sah

Vishal Srivastava

Dolly Sah

This research quantifies the trade-offs between safety and capability in tool-using LLM agents by evaluating the impact of runtime safety verifiers on task performance. It reveals a "Safety-Capability Gap" where verifiers block up to 94% of unsafe actions, but agents often fail to recover, incurring a "Verifier Tax" of 2.0-2.8x increased token cost and exhibiting prevalent "Integrity Leak" failures from data hallucination.

#computer-science#cryptography-and-security

Paper thumbnail

MoltGraph: A Longitudinal Temporal Graph Dataset of Moltbook for Coordinated-Agent Detection

28 Feb 2026

Kunal Mukherjee

Cuneyt Gurcan Akcora

Murat Kantarcioglu

MoltGraph introduces a longitudinal temporal graph dataset derived from Moltbook, an agent-native social platform, to investigate coordinated agent behaviors. The study reveals that bursty coordinated activities are associated with a 506.35% higher early interaction rate and a 242.63% higher downstream content exposure through feed-based snapshots compared to non-coordinated content.

#computer-science#cryptography-and-security#social-and-information-networks

Paper thumbnail

Tracking Capabilities for Safer Agents

07 May 2026

EPFL

Martin Odersky

Yaoyu Zhao

Yichen Xu

EPFL researchers developed the `tacit` framework, which employs static type checking with Scala 3's capture checking to provide provably safe constraints for AI agents. This system prevents information leakage and unauthorized side effects by enforcing capability-based security, achieving 100% security against adversarial attacks while maintaining or improving agent performance.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use

06 May 2026

Francisco Javier Arceo

Varsha Prasad Narsing

Retrieval-Augmented Generation (RAG) and agentic AI systems are increasingly prevalent in enterprise AI deployments. However, real enterprise environments introduce challenges largely absent from academic treatments and consumer-facing APIs: multiple tenants with heterogeneous data, strict access-control requirements, regulatory compliance, and cost pressures that demand shared infrastructure. A fundamental problem underlies existing RAG architectures in these settings: retrieval systems rank documents by relevance--whether through semantic similarity, keyword matching, or hybrid approaches--not by authorization, so a query from one tenant can surface another tenant's confidential data simply because it scores highest. We formalize this gap and analyze additional shortcomings--including tool-mediated disclosure, context accumulation across turns, and client-side orchestration bypass--that arise when agentic systems conflate relevance with authorization. To address these challenges, we introduce a layered isolation architecture combining policy-aware ingestion, retrieval-time gating, and shared inference, enforced through server-side agentic orchestration. This approach centralizes security-critical operations--tool execution authorization, state isolation, and policy enforcement--on the server, creating natural enforcement points for multitenant isolation while allowing client-side frameworks to retain control over agent composition and latency-sensitive operations. We validate the proposed architecture through an open-source implementation in OGX, a vendor-neutral framework that implements an OpenAI-compatible, open-source Responses API with server-side multi-turn orchestration. We evaluate it empirically and show that ABAC gating eliminates cross-tenant leakage while introducing negligible overhead.

#agents#computer-science#artificial-intelligence

Paper thumbnail

SAPO: Secure Automated Prompt Optimization via Multi-Agent Collaboration

19 May 2026

MicrosoftAmazon

Emmanuel Aboah Boateng

Zachary Johnson

Tian Xia

SAPO, a multi-agent framework developed at Microsoft, introduces a secure automated prompt optimization method that balances task performance with explicit security constraints for large language models. The system achieved a 100% adversarial robustness score on HarmBench while simultaneously improving aggregated task accuracy by at least 2.6 percentage points compared to single-objective baselines.

Paper thumbnail

When Harmful Intent Dissolves into Technical Detail: How Safe Are Coding Agents Against Cyber Misuse?

19 May 2026

Purdue University

Xiangzhe Xu

Shiwei Feng

Guangyu Shen

Coding agents are increasingly integrated into realistic software development workflows, where they can write, modify, and execute code on behalf of users. This capability creates a distinct safety requirement: agents must refuse requests that would enable malicious cyber activity. Yet in cybersecurity, harmful intent often dissolves into technical detail. A prompt may describe a sequence of legitimate operations without explicitly revealing the downstream consequence they collectively produce. Safe behavior therefore hinges on an agent’s ability to reason from prompt to consequence under partial information. In this paper, we empirically evaluate how safe are coding agents against cyber misuse. We construct a cybersecurity evaluation dataset designed to preserve verifiable maliciousness while removing explicit intent. Our data synthesis pipeline hierarchically partitions the cybersecurity space and generates diverse, intent-obscured requests, validated using an ensemble of LLM judges to ensure implicit but genuine harmful capability. The resulting dataset contains 2.2k samples and exhibits substantially greater domain coverage and implicitness than existing cybersecurity safety benchmarks. Using the resulting dataset, we evaluate nine LLM agents in the OpenHands framework and make three key observations. First, safety performance varies widely across cybersecurity subdomains, highlighting the need for broad domain coverage. Second, per-step guardrail significantly improves detection over prompt-only refusal, but a non-trivial fraction of harmful cases remain undetected. Third, we show that lightweight dry-run simulation, namely allowing the actor model to internally roll out action sequences and plausible consequences, recovers a meaningful portion of the guardrail’s detection gains without requiring real execution. Overall, our results motivate realistic, domain-diverse evaluation for coding-agent misuse prevention and point to dry-run simulation as a promising direction for more effective and efficient guardrail.

Paper thumbnail

Who Decides the Trade-off? Resolution Policy as Delegation Governance in Autonomous Agents

19 May 2026

DOCOMO Innovations

Koji Yamazaki

When an autonomous AI agent’s delegated constraints cannot be simultaneously satisfied, someone must decide which constraint to sacrifice. In current LLM-based agent systems, this decision is made probabilistically by the model’s sampling process, producing outcomes that are unpredictable, unreproducible, and unauditable. We term this the Trust Gap. Through 2,248 experimental probes across two frontier LLMs, we demonstrate that a single fallback instruction reduces deviation from 76% to 0%, establishing that behavioral compliance is achievable. However, behavioral compliance is fundamentally distinct from structural guarantee: a single adversarial override reverses compliance from 0% to 100% (R5), and this pattern generalizes across resolution strategies (R7). We formalize the missing element—Resolution Policy—through the Deterministic Delegation Model (DDM): a principal’s deterministic, pre-committed trade-off strategy that structurally binds intent to execution outcome. Evaluation across complete 2 × 2 factorial designs confirms DDM operates independently of prompt content, injection content, and resolution strategy type. Concurrent work has advanced authorization enforcement; the complementary question—what to do when authorized actions conflict, and by whose authority—is the problem Resolution Policy resolves.

Paper thumbnail

System Optimization & Efficiency

7 of 10 on alphaXiv

XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs

26 Mar 2026

Linzhang Li

Yixin Dong

Guanjie Wang

An engine named XGrammar 2 was developed to enable dynamic and efficient structured generation for agentic large language models, achieving near-zero end-to-end latency and substantially faster grammar compilation for scenarios like tool calling. This system improves function-calling accuracy and ensures 100% correct schema generation by leveraging just-in-time compilation and cross-grammar caching.

#agentic-frameworks#agents#computer-science

Paper thumbnail

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

25 Mar 2026

Jelena Markovic-Voronov

Kayhan Behdin

Yuanda Xu

Researchers from MIT and others propose a batch-level, resource-aware routing framework for Large Language Models (LLMs) that explicitly manages monetary cost, GPU capacity, and model concurrency limits. It leverages integer linear programming with robust optimization to handle performance prediction uncertainty and includes an offline procedure for optimal model instance allocation, achieving up to 24% performance improvement under adversarial batching compared to per-query methods.

#computer-science#artificial-intelligence#machine-learning

Paper thumbnail

Scaling Textual Gradients via Sampling-Based Momentum

18 Nov 2025

UT Austin

University of Chicago

Zixin Ding

Junyuan Hong

Zhan Shi

This research introduces Textual Stochastic Gradient Descent with Momentum (TSGD-M), a method that enhances the scalability and stability of automatic prompt engineering by dynamically reweighting and sampling from past textual gradients. This approach consistently improved test accuracy and reduced variance across multiple benchmarks, such as a 1.4% gain on the MATH task, while effectively overcoming implicit context length limitations in Large Language Models.

#computer-science#artificial-intelligence#computation-and-language

Paper thumbnail

FLASC: Federated LoRA with Sparse Communication

19 May 2026

Carnegie Mellon University

Kevin Kuo

Arian Raje

Kousik Rajesh

Low-rank adaptation (LoRA) is a promising method for finetuning models in communication-constrained settings such as cross-device federated learning (FL). Prior work has explored ways to improve the efficiency of LoRA in federated settings by imposing additional sparsity constraints. However, existing methods for sparse LoRA not only harm accuracy but can in fact increase overall communication costs. We instead propose FLASC, a simple composite method that consists of a PEFT method and compression algorithm. First, we demonstrate that FLASC as a combination of LoRA and sparse Top-K communication outperforms baselines of using a lower LoRA rank or pruning LoRA weights. Second, FLASC-Search efficiently searches the space of rank-and-sparsity configurations by first tuning sparsity at a low rank and then transferring to higher ranks. Across four FL datasets, we demonstrate that FLASC outperforms existing sparse LoRA methods with up to 20% higher accuracy or 10× less communication. Overall, FLASC is a simple yet competitive baseline which can be easily extended to more advanced PEFT and compression methods in the future.

Paper thumbnail

Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

07 May 2026

Anupama Sridhar

Alexander Johansen

Long chain-of-thought reasoning and agentic tool-calling produce traces spanning tens of thousands of tokens, yet Transformer KV caches grow linearly with sequence length, creating a memory bottleneck on commodity hardware. State-space models offer constant-memory recurrence but suffer a memory cliff: retrieval accuracy collapses once the gap between a stored fact and its query exceeds the effective horizon of the recurrent state. We introduce Echo, a KV-cache-free associative recall architecture built around Spectral Koopman Attention (SKA); a drop-in replacement for attention layers that augments SSM blocks with a closed-form dynamical operator whose sufficient statistics are accumulated in constant memory with no KV cache. Echo fits a spectral linear system to the key and value history via kernel ridge regression and retrieves through a learned power-iterated filter, all from

O(r^{2})

streaming state where

r

is a small projection rank. On the Multi-Query Associative Recall benchmark, a pure Mamba-2 SSM fails to exceed chance accuracy (

{\sim}3\%

) across all gap lengths and KV-pair counts, while at the 50M parameter scale SKA-augmented models achieve

100\%

retrieval accuracy on every configuration tested, including distractor gaps of

4{,}096

tokens with

32

KV pairs. Across five additional transfer benchmarks including needle-in-a-haystack, tool-trace, and multi-hop retrieval, SKA consistently outperforms both pure SSM and SSM+Attention hybrids while maintaining constant inference memory. Ablations confirm that the spectral operator, not the prefix masking strategy, drives the retrieval gain.

#agents#attention-mechanisms#computer-science

Paper thumbnail

AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

01 May 2026

Dzung Pham

Kleomenis Katevas

Ali Shahin Shamsabadi

Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi-step tasks such as coding or web-based question answering. While remote, cloud-based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage-based fees. However, agentic workflows are far more resource-intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM-based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single-inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low-cost execution signals, such as token-level log probabilities, AgentStop can reduce wasted energy by 15-20% with minimal impact on task performance (<5% utility drop) for challenging web-based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy-preserving LLM agents on user devices. Our project code and data are available at this https URL.

#agents#computer-science#artificial-intelligence

Paper thumbnail

CAMI: Practical Cost-Aware Agent-Guided Multi-Indexing for Semantic Retrieval

19 May 2026

IBM

Adnan Qidwai

Anand Eswaran

Sonam Mishra

RAG ingestion pipelines frequently augment search corpus index with semantic enrichment indices (e.g., synthetic queries or summaries generated from corpus chunks) that are subsequently queried alongside the base index to improve retrieval via better alignment between document representations and user intent. While these supplementary representations substantially improve retrieval quality, they introduce a computational bottleneck: the configuration space of enrichment types and generator models is combinatorial, and the cost of exhaustive index-time evaluation scales linearly with corpus size. We introduce CAMI (CostAware Multi-Indexing), a framework that formalizes multi-index construction as a budgeted, multi-objective portfolio selection problem. CAMI targets the upstream decision of which enrichment views to generate and materialize before the retrieval backend is applied. CAMI incorporates three primary mechanisms: (i) an agentic discovery phase that proposes corpus-specific representation templates; (ii) an atomic-unit search procedure that evaluates individual enrichment-model pairs and recombines them via fidelity-local closure to identify synergistic portfolios; and (iii) a confidence-aware promotion schedule that prunes unpromising configurations early, decoupling optimization spend from total corpus size. We evaluate CAMI across diverse retrieval corpora. Our findings reveal that the framework systematically isolates high-recall portfolios under strict budget constraints, outperforming standard content-only baselines in challenging settings by up to 9.4% recall@10. Further, CAMI is able to systematically identify these high-recall portfolios using up to 5x less budget compared to random search baselines, making our approach practical in real production scenarios.

Paper thumbnail

Engineering & Operations

1 of 3 on alphaXiv

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

20 Mar 2026

zecheng-zhang

Zecheng Zhang

Zecheng Zhang

Han Zheng

Yue Xu

Evaluating production LLM responses and routing requests across providers in LLM gateways requires fine-grained quality signals and operationally grounded decisions. To address this gap, we present SEAR, a schema-based evaluation and routing system for multi-model, multi-provider LLM gateways. SEAR defines an extensible relational schema covering both LLM evaluation signals (context, intent, response characteristics, issue attribution, and quality scores) and gateway operational metrics (latency, cost, throughput), with cross-table consistency links across around one hundred typed, SQL-queryable columns. To populate the evaluation signals reliably, SEAR proposes self-contained signal instructions, in-schema reasoning, and multi-stage generation that produces database-ready structured outputs. Because signals are derived through LLM reasoning rather than shallow classifiers, SEAR captures complex request semantics, enables human-interpretable routing explanations, and unifies evaluation and routing in a single query layer. Across thousands of production sessions, SEAR achieves strong signal accuracy on human-labeled data and supports practical routing decisions, including large cost reductions with comparable quality.

#computer-science#artificial-intelligence#computation-and-language

Paper thumbnail