close
May 26–29, 2026·San Jose, CA

ACM Conference on AI and Agentic Systems

Building the Future of Agentic & AI Systems

ACM CAIS 2026 — The premier venue for rigorous, reproducible research on compound AI architectures, optimization, and deployment.

61Research Papers
46System Demos

Architectural Patterns & Composition

19 of 25 on alphaXiv
Glia: A Human-Inspired AI for Automated Systems Design and Optimization
03 Apr 2026

An AI system named Glia, developed at MIT CSAIL, autonomously designs and optimizes computer system mechanisms using a human-inspired multi-agent architecture. It generates interpretable algorithms and insights, achieving a 1.4x reduction in mean request completion time over prior methods in LLM serving optimization and demonstrating adaptability to varying conditions.

View blog
#agentic-frameworks#agents#computer-science
Audio1
Retrieval-Augmented LLMs for Security Incident Analysis
04 May 2026

A Retrieval-Augmented Generation (RAG) system is introduced for end-to-end security incident analysis, enabling large language models to answer forensic questions and reconstruct attack sequences from diverse security logs. The system achieved 94% recall for malware traffic analysis and 96% recall with 100% precision for Active Directory attack step detection using models like Claude Sonnet 4, significantly outperforming baselines.

View blog
#agents#ai-for-cybersecurity#computer-science
Audio
Improving Coherence and Persistence in Agentic AI for System Optimization
19 May 2026
MIT logoMIT

Researchers at MIT CSAIL developed Engram, an agentic AI architecture that enables sustained, coherent progress in complex system optimization by transferring distilled knowledge between sequential LLM agents. Engram generated solutions outperforming previous LLM-based methods across diverse tasks and discovered novel algorithms, including a dynamic programming approach for multi-cloud multicast that achieved costs below the human state-of-the-art.

View blog
Audio2
Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation
16 May 2026
University of Washington logoUniversity of WashingtonAllen Institute for AI logoAllen Institute for AI
The scientific ideation process often involves blending facets of existing papers to create new ideas. We contribute Scideator, the first human-LLM system for facet-based scientific ideation. Starting from user-provided papers, Scideator extracts key facets -- purposes, mechanisms, and evaluations -- from these and related papers, allowing users to interactively recombine facets to synthesize ideas. Scideator is driven by three design choices: (1) human-in-the-loop facet recombination, in which users select facets from retrieved papers and the system generates ideas by finding analogies across them via the Faceted Idea Generator module; (2) distance-controlled retrieval via the Analogous Paper Facet Finder module, which surfaces papers ranging from the same topic to entirely different areas to provide a spectrum of directions; and (3) facet-based novelty verification via the Idea Novelty Checker module, a retrieve-then-rerank pipeline that helps users to evaluate idea originality using facets. In a user study with computer science researchers, Scideator provided significantly more creativity support than a baseline using the same backbone LLM without our facet-based modules, particularly in idea exploration and expressiveness. Ablations further show that the facets benefit the novelty checker: facet-based retrieve-then-rerank surfaces more relevant papers than standard retrieval and re-ranking, and a facet-grounded novelty classifier outperforms classifiers that reason over unstructured ideas and papers.
View blog
#computer-science#artificial-intelligence#human-computer-interaction
Audio
Open Agent Specification: Enabling Cross-Framework Comparison of AI Agents
19 May 2026
Oracle

Oracle researchers introduced Agent Spec, a declarative and framework-agnostic language designed to standardize the definition of AI agents and their workflows. This specification enables reproducible cross-framework evaluation, revealing measurable differences in accuracy, latency, and execution behavior across various runtimes for identical agent designs.

View blog
Audio
AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting
23 Oct 2025
Carnegie Mellon University logoCarnegie Mellon UniversityUniversity of Chicago logoUniversity of Chicago
This paper develops an agentic framework that employs large language models (LLMs) for grounded persuasive language generation in automated copywriting, with real estate marketing as a focal application. Our method is designed to align the generated content with user preferences while highlighting useful factual attributes. This agent consists of three key modules: (1) Grounding Module, mimicking expert human behavior to predict marketable features; (2) Personalization Module, aligning content with user preferences; (3) Marketing Module, ensuring factual accuracy and the inclusion of localized features. We conduct systematic human-subject experiments in the domain of real estate marketing, with a focus group of potential house buyers. The results demonstrate that marketing descriptions generated by our approach are preferred over those written by human experts by a clear margin while maintaining the same level of factual accuracy. Our findings suggest a promising agentic approach to automate large-scale targeted copywriting while ensuring factuality of content generation.
View blog
#computer-science#artificial-intelligence#computation-and-language
Audio
Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation
22 Oct 2025

Researchers at Megagon Labs developed a memory-augmented framework that enables large language model agents to adapt without parameter updates by leveraging LLM-generated critiques stored in episodic and semantic memory. This reflective approach demonstrated up to a 24.8% accuracy improvement over retrieval-based baselines across diverse classification tasks.

View blog
#agentic-frameworks#agents#computer-science
Audio
Composing Policy Gradients and Prompt Optimization for Language Model Programs
11 May 2026
University of Notre Dame logoUniversity of Notre DameUC Berkeley logoUC Berkeley

Researchers developed mmGRPO, an online reinforcement learning framework that successfully adapts Group Relative Policy Optimization to optimize multi-module language model programs. Combining mmGRPO with prompt optimization improved performance by up to 11% over baseline methods across various tasks.

View blog
#agentic-frameworks#agents#computer-science
Audio34,587
MARVIS: Modality Adaptive Reasoning over VISualizations
02 Jul 2025
University of FreiburgNew York University logoNew York University

MARVIS introduces a training-free method that extends Vision-Language Models (VLMs) to predict across diverse data modalities, including vision, audio, biological, and tabular data, by transforming data embeddings into visualizations for VLM interpretation. The approach achieves performance within 2.5% of specialized models and significantly outperforms other generalist foundation models by an average of 16.7%, while also enhancing data privacy by avoiding direct exposure of raw data.

View blog
#ai-for-health#computer-science#machine-learning
Audio15
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
08 May 2026
Explicit planning is a critical capability for LLM-based agents solving complex data-centric tasks, which require precise tool calling over external data sources. Existing strategies fall into two paradigms based on planning horizon: (1) full-horizon (FH), which generates a complete plan before execution, and (2) single-step horizon (SH), which interleaves each action (tool call) with incremental reasoning and observation. While step-by-step execution is a common default under the assumption that eager execution monitoring is necessary for adaptability, we revisit this assumption for well-defined data-centric tasks. Our controlled empirical study isolates planning horizon as the key architectural feature and systematically analyzes the effects of topological complexity and tool robustness on both paradigms. Our experiments across Knowledge Base Question Answering and Multi-hop QA show that FH planning with lazy replanning achieves accuracy parity with SH across varying depths, breadths, and robustness levels, while using 2-3x fewer tokens. These findings suggest that for well-defined data-centric tasks, eager step-wise monitoring is often unnecessary, and full-horizon planning with on-demand replanning can offer a more efficient default.
View blog
#agentic-frameworks#agents#computer-science
Audio1
Robust Agent Compensation (RAC): Teaching AI Agents to Compensate
05 May 2026
We present Robust Agent Compensation (RAC), a log-based recovery paradigm (providing a safety net) implemented through an architectural extension that can be applied to most Agent frameworks to support reliable executions (avoiding unintended side effects). Users can choose to enable RAC without changing their current agent code (e.g., LangGraph agents). The proposed approach can be implemented in most existing agent frameworks via their existing extension points. We present an implementation based on LangChain, demonstrate its viability through the τ\tau-bench and REALM-Bench, and show that when solving complex problems, RAC is 1.5-8X or more better in both latency and token economy compared to state-of-the-art LLM-based recovery approaches.
View blog
#agentic-frameworks#agents#computer-science
Audio
A Language for Describing Agentic LLM Contexts
03 May 2026

Researchers at Bar-Ilan University developed ACDL, a formal language for precisely describing the structure and dynamic evolution of LLM agent input contexts, providing standardized visual diagrams. This formalization enhances communication and comparability of agentic systems, with experiments demonstrating that context structure variations can alter agent performance by up to 5 percentage points.

View blog
#agentic-frameworks#agents#computer-science
Audio
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
08 May 2026
We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456-run runtime comparison shows that topology-monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt-only and chat-only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC-verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.
View blog
#agentic-frameworks#agents#computer-science
Audio
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
15 May 2026
Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4×\times worse mean return while using 1.8-2.7×\times more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.
View blog
#agentic-frameworks#agents#ai-for-cybersecurity
Audio
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
15 May 2026
Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7×\times over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below 100-100) to as low as \sim1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with \sim40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.
View blog
#agentic-frameworks#agents#ai-for-cybersecurity
Audio
Tressoir: Unifying Online, Offline, and HIL Design and Evolution of Multi-Agent Systems via Interpretable Blueprints
19 May 2026
MIT logoMIT
We explore a principled approach that jointly designs and evolves the architectures, prompts, tools, and knowledge of multi-agent systems, whether online, offline, or with human guidance. We first propose Interpretable Blueprints (IBs), which pair an online-interpretable system ontology (describing architectures, invariants, domain knowledge, etc.) with offline-materialized components proven to be high-quality or cost-effective. Second, we propose a supervising interpreter that co-interprets the IB and the task to construct a specialized agentic system on the fly, without assuming any pre-existing implementation, thereby enabling maximal adaptation to the task. IBs are also the primary online communication mechanism between agents. Offline learning is a subset of this approach; learning IBs encode learning strategies that let the interpreter orchestrate metrics collection and IB improvement. Human guidance is enabled at every layer, whether by co-editing IBs or by steering online or offline interpretation in ways that the system learns from over time. To instantiate this vision, we develop Tressoir, an IB-centric framework that unifies online, offline, and human-guided evolution under a single mechanism. Tressoir is tailored for long-running, complex projects with tasks that build on each other and require continual learning during or in between executions. Its generality further allows it to bootstrap itself, where its own features are now self-generated with high-level human guidance. We also evaluate Tressoir on shorter-term benchmarks. On SWEBench-Pro’s Qutebrowser subset, Tressoir with Claude 4.6 Opus reaches 75.9% vs. 57.0% for SWE-Agent; on ScreenSpot-Pro, it lifts Gemini 3 Flash from a 69.1% baseline to 83.1%; and on Bird-Critic Flash, Tressoir with Gemini 3 Flash tools scores 56.0%, exceeding SQL-ACT with Claude 4.6 Opus at 52.0%.
View blog
Audio
Expansion-Contraction: A Multi-Agent Graph Traversal Pattern for Compound AI Systems
19 May 2026
AmazonContinental AG
Compound AI systems that coordinate multiple specialized agents offer a promising path for complex reasoning tasks, yet principled architectural patterns for multi-agent coordination over structured data remain under-explored. We introduce Expansion-Contraction, a multi-agent graph traversal pattern in which an expansion phase walks a domain graph outward from a query origin, dynamically spawning ephemeral specialist agents at each node, and a contraction phase aggregates their findings inward to produce a verdict. Agent topology emerges isomorphically from the data graph rather than being hand-designed, and each agent operates on a small local context—avoiding the context-window saturation that degrades single-agent approaches on large graphs. We instantiate the pattern for supply chain root cause analysis, integrating domain-specific tools with temporal lead-time propagation. Across eight datasets (three real-world, five synthetic with controlled depth and width), Expansion-Contraction achieves 98.2% accuracy on a production supply chain (624 cases) and 100% on public benchmarks, outperforming single-agent baselines by 14+ percentage points while degrading gracefully as graph complexity increases. A deterministic depth-priority disambiguation heuristic, motivated by our failure analysis, further improves Dataset A accuracy to 99.5% (621/624, 95% CI [98.6%, 99.9%]). To assess transfer, we evaluate the pattern on a second domain—microservice dependency tracing over a 17-service DAG (100 scenarios)—where Expansion-Contraction reaches 88% overall accuracy and 85% on NLP-complex cases (vs. 55% for the next-best baseline). Investigation caching reduces token usage by up to 93.9%, concurrent path analysis yields up to 1.43× speedup, and a production deployment demonstrates the pattern’s viability for enterprise-scale agentic systems.
View blog
Audio
Vista: Verifier-in-the-Loop Agentic Reinforcement Learning for Quantum Program Synthesis
19 May 2026
University of LiverpoolAalto University logoAalto University

The Vista framework introduces a verifier-in-the-loop agentic reinforcement learning approach for quantum program synthesis, addressing challenges of costly and multi-stage verification in OpenQASM 3.0 circuit generation. It employs hierarchical rewards and budget-aware gated evaluation, achieving superior semantic quality and a 1.77x speedup in verification efficiency.

View blog
Audio1
fastWorkflow: Closing the Performance Gap Between Small and Frontier Language Models for Conversational Agents
19 May 2026
Radiant Logic
Large language models are increasingly deployed in conversational agents that assist humans with complex, multi-step tasks, yet their deployment at scale is constrained by high inference costs, latency, and data privacy concerns. Small language models (SLMs) offer compelling operational advantages but exhibit systematic failure modes in agentic settings, particularly in conversational workflows: domains where tasks are solved interactively by a human and an LLM through structured tool invocation. Despite growing SLM deployment, these agentic failure modes remain poorly characterized and inadequately addressed. We present an empirically-grounded taxonomy categorizing SLM failures across five dimensions: natural language understanding failures, tool management failures, task decomposition and sequencing failures, agentic reasoning failures, and context management failures, and quantify their prevalence on the 𝜏-bench benchmark. Guided by this taxonomy, we introduce fastWorkflow, a dual-mode agentic architectural framework implementing a cascaded NLU pipeline for intent detection and structured parameter extraction with validation, hierarchical context organization that reduces effective action space, explicit task planning with dependency-aware decomposition, and adaptive context management, among other targeted mitigations. On 𝜏-bench, GPTOSS-20B augmented with fastWorkflow achieves 83.47% Pass^1 on the Retail domain and 78% on Airline, surpassing all frontier models evaluated on 𝜏-bench leaderboard including Claude Sonnet 4 (80.5% Retail, 60.0% Airline) and Claude Opus 4.1 (82.4% Retail, 56.0% Airline), while operating at ∼22× lower inference cost. Even Mistral-7B-Instruct with fastWorkflow matches Claude Sonnet 4 on Airline at 60%. Ablation studies confirm that the cascaded NLU pipeline is the most impactful component, with its removal causing performance collapses of 58 points on Retail and 68 points on Airline. Our findings demonstrate that architectural separation of concerns, offloading error-prone operations to structured subsystems while preserving LLM flexibility for planning and recovery, can close the performance gap between small and frontier models in conversational workflow tasks, shifting the cost-performance Pareto frontier for production deployment in domains involving multi-turn, tool-augmented human-LLM collaboration.
View blog
Audio

Evaluation & Benchmarking

10 of 12 on alphaXiv
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
13 May 2026
Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.
View blog
#agentic-frameworks#agents#computer-science
Audio
Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents
03 May 2026
Carnegie Mellon University logoCarnegie Mellon University

This research reveals a disparity between industry marketing and user experience with AI agents, establishing a taxonomy of advertised agent capabilities and empirically identifying five critical usability barriers faced by end-users. The findings suggest that existing usability challenges from large language models are amplified in agentic systems, particularly concerning delegated multi-step workflows and real-world consequences.

View blog
#computer-science#human-computer-interaction
Audio
Reasoning-Intensive Regression
01 May 2026
MIT logoMIT
AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks such as sentiment or similarity analysis, RiR often appears instead in ad-hoc applications such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances.
View blog
#computer-science#artificial-intelligence#computation-and-language
Audio
Willful Disobedience: Automatically Detecting Failures in Agentic Traces
25 Mar 2026

AgentPex introduces an automated, specification-driven system for evaluating AI agent behavior across multi-step execution traces. It systematically extracts behavioral rules from agent prompts and tool schemas to detect procedural violations, even in scenarios where task outcome-based evaluations report success, revealing agent "willful disobedience".

View blog
#computer-science#artificial-intelligence#software-engineering
Audio
OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
16 Feb 2026
Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.
View blog
#agentic-frameworks#agents#computer-science
Audio
Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel
12 Mar 2026

Rockfish Data and Carnegie Mellon University researchers introduce AgentFuel, a framework for generating expressive and customizable evaluations tailored for timeseries data analysis agents. The framework reveals that agents achieve an average accuracy of 66% on e-commerce and 60% on IoT benchmarks, but only 21% on a telecom benchmark, particularly struggling with stateful (34% accuracy) and incident-specific queries (10% accuracy), and demonstrates that integrating AgentFuel into an optimization loop can improve agent accuracy by up to 25%.

View blog
#agentic-frameworks#agents#ai-for-cybersecurity
Audio
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
18 Feb 2026
University of Maryland logoUniversity of MarylandMohamed bin Zayed University of Artificial Intelligence logoMohamed bin Zayed University of Artificial Intelligence

An empirical diagnosis of Moltbook, a large-scale AI-only social platform with over two million LLM agents, reveals that despite extensive interactions, robust socialization, defined as sustained behavioral adaptation and collective structure formation, does not automatically emerge. The study finds consistent semantic diversity at the micro-level, limited agent adaptation to social feedback, and transient influence hierarchies.

View blog
#computer-science#artificial-intelligence#computation-and-language
Audio19
Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models
03 Feb 2026
University of Illinois at Urbana-Champaign logoUniversity of Illinois at Urbana-Champaign

An automated framework, Persuade Me If You Can (PMIYC), was developed by researchers at the University of Illinois Urbana-Champaign to quantify the persuasive abilities and vulnerabilities of large language models (LLMs) in multi-turn dialogues. This framework, using a Normalized Change in Agreement (NCA) metric, revealed that LLMs' susceptibility to persuasion varies significantly between subjective and misinformation claims, with models like GPT-4o showing over 50% greater resistance to misinformation.

View blog
#computer-science#artificial-intelligence#computation-and-language
Audio9
Benchmarking Agents in Insurance Underwriting Environments
31 Jan 2026

Snorkel AI researchers developed UNDERWRITE, a benchmark for evaluating AI agents in commercial insurance underwriting that incorporates proprietary knowledge, noisy tools, and multi-turn interactions. Evaluating frontier models on UNDERWRITE revealed that high accuracy often trades off with efficiency and robustness, with models exhibiting prevalent domain-specific hallucinations and significant reliability issues.

View blog
#agentic-frameworks#agents#computer-science
Audio
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
30 Apr 2026

Researchers from Cornell University and University of Illinois Urbana-Champaign developed a trace-level analysis method to systematically study information contamination in multi-agent systems. Their work reveals that errors from extracted data can propagate silently, leading to incorrect outcomes without workflow structural changes, or cause expensive detours that still recover to correct answers.

View blog
#adversarial-robustness#agentic-frameworks#agents
Audio

Security & Privacy

8 of 11 on alphaXiv
Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain
14 Apr 2026
ServiceNow logoServiceNowMILA-Qu bec
While finetuning AI agents on interaction data -- such as web browsing or tool use -- improves their capabilities, it also introduces critical security vulnerabilities within the agentic AI supply chain. We show that adversaries can effectively poison the data collection pipeline at multiple stages to embed hard-to-detect backdoors that, when triggered, cause unsafe or malicious behavior. We formalize three realistic threat models across distinct layers of the supply chain: direct poisoning of finetuning data, pre-backdoored base models, and environment poisoning, a novel attack vector that exploits vulnerabilities specific to agentic training pipelines. Evaluated on two widely adopted agentic benchmarks, all three threat models prove effective: poisoning only a small number of demonstrations is sufficient to embed a backdoor that causes an agent to leak confidential user information with over 80\% success.
View blog
#adversarial-attacks#adversarial-robustness#agents
Audio
The Verifier Tax: Horizon Dependent Safety Success Tradeoffs in Tool Using LLM Agents
18 Mar 2026

This research quantifies the trade-offs between safety and capability in tool-using LLM agents by evaluating the impact of runtime safety verifiers on task performance. It reveals a "Safety-Capability Gap" where verifiers block up to 94% of unsafe actions, but agents often fail to recover, incurring a "Verifier Tax" of 2.0-2.8x increased token cost and exhibiting prevalent "Integrity Leak" failures from data hallucination.

View blog
#computer-science#cryptography-and-security
Audio
MoltGraph: A Longitudinal Temporal Graph Dataset of Moltbook for Coordinated-Agent Detection
28 Feb 2026

MoltGraph introduces a longitudinal temporal graph dataset derived from Moltbook, an agent-native social platform, to investigate coordinated agent behaviors. The study reveals that bursty coordinated activities are associated with a 506.35% higher early interaction rate and a 242.63% higher downstream content exposure through feed-based snapshots compared to non-coordinated content.

View blog
#computer-science#cryptography-and-security#social-and-information-networks
Audio
Tracking Capabilities for Safer Agents
07 May 2026
EPFL logoEPFL

EPFL researchers developed the `tacit` framework, which employs static type checking with Scala 3's capture checking to provide provably safe constraints for AI agents. This system prevents information leakage and unauthorized side effects by enforcing capability-based security, achieving 100% security against adversarial attacks while maintaining or improving agent performance.

View blog
#agentic-frameworks#agents#computer-science
Audio
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use
06 May 2026
Retrieval-Augmented Generation (RAG) and agentic AI systems are increasingly prevalent in enterprise AI deployments. However, real enterprise environments introduce challenges largely absent from academic treatments and consumer-facing APIs: multiple tenants with heterogeneous data, strict access-control requirements, regulatory compliance, and cost pressures that demand shared infrastructure. A fundamental problem underlies existing RAG architectures in these settings: retrieval systems rank documents by relevance--whether through semantic similarity, keyword matching, or hybrid approaches--not by authorization, so a query from one tenant can surface another tenant's confidential data simply because it scores highest. We formalize this gap and analyze additional shortcomings--including tool-mediated disclosure, context accumulation across turns, and client-side orchestration bypass--that arise when agentic systems conflate relevance with authorization. To address these challenges, we introduce a layered isolation architecture combining policy-aware ingestion, retrieval-time gating, and shared inference, enforced through server-side agentic orchestration. This approach centralizes security-critical operations--tool execution authorization, state isolation, and policy enforcement--on the server, creating natural enforcement points for multitenant isolation while allowing client-side frameworks to retain control over agent composition and latency-sensitive operations. We validate the proposed architecture through an open-source implementation in OGX, a vendor-neutral framework that implements an OpenAI-compatible, open-source Responses API with server-side multi-turn orchestration. We evaluate it empirically and show that ABAC gating eliminates cross-tenant leakage while introducing negligible overhead.
View blog
#agents#computer-science#artificial-intelligence
Audio
SAPO: Secure Automated Prompt Optimization via Multi-Agent Collaboration
19 May 2026
Microsoft logoMicrosoftAmazon

SAPO, a multi-agent framework developed at Microsoft, introduces a secure automated prompt optimization method that balances task performance with explicit security constraints for large language models. The system achieved a 100% adversarial robustness score on HarmBench while simultaneously improving aggregated task accuracy by at least 2.6 percentage points compared to single-objective baselines.

View blog
Audio
When Harmful Intent Dissolves into Technical Detail: How Safe Are Coding Agents Against Cyber Misuse?
19 May 2026
Purdue University logoPurdue University
Coding agents are increasingly integrated into realistic software development workflows, where they can write, modify, and execute code on behalf of users. This capability creates a distinct safety requirement: agents must refuse requests that would enable malicious cyber activity. Yet in cybersecurity, harmful intent often dissolves into technical detail. A prompt may describe a sequence of legitimate operations without explicitly revealing the downstream consequence they collectively produce. Safe behavior therefore hinges on an agent’s ability to reason from prompt to consequence under partial information. In this paper, we empirically evaluate how safe are coding agents against cyber misuse. We construct a cybersecurity evaluation dataset designed to preserve verifiable maliciousness while removing explicit intent. Our data synthesis pipeline hierarchically partitions the cybersecurity space and generates diverse, intent-obscured requests, validated using an ensemble of LLM judges to ensure implicit but genuine harmful capability. The resulting dataset contains 2.2k samples and exhibits substantially greater domain coverage and implicitness than existing cybersecurity safety benchmarks. Using the resulting dataset, we evaluate nine LLM agents in the OpenHands framework and make three key observations. First, safety performance varies widely across cybersecurity subdomains, highlighting the need for broad domain coverage. Second, per-step guardrail significantly improves detection over prompt-only refusal, but a non-trivial fraction of harmful cases remain undetected. Third, we show that lightweight dry-run simulation, namely allowing the actor model to internally roll out action sequences and plausible consequences, recovers a meaningful portion of the guardrail’s detection gains without requiring real execution. Overall, our results motivate realistic, domain-diverse evaluation for coding-agent misuse prevention and point to dry-run simulation as a promising direction for more effective and efficient guardrail.
View blog
Audio
Who Decides the Trade-off? Resolution Policy as Delegation Governance in Autonomous Agents
19 May 2026
DOCOMO Innovations
When an autonomous AI agent’s delegated constraints cannot be simultaneously satisfied, someone must decide which constraint to sacrifice. In current LLM-based agent systems, this decision is made probabilistically by the model’s sampling process, producing outcomes that are unpredictable, unreproducible, and unauditable. We term this the Trust Gap. Through 2,248 experimental probes across two frontier LLMs, we demonstrate that a single fallback instruction reduces deviation from 76% to 0%, establishing that behavioral compliance is achievable. However, behavioral compliance is fundamentally distinct from structural guarantee: a single adversarial override reverses compliance from 0% to 100% (R5), and this pattern generalizes across resolution strategies (R7). We formalize the missing element—Resolution Policy—through the Deterministic Delegation Model (DDM): a principal’s deterministic, pre-committed trade-off strategy that structurally binds intent to execution outcome. Evaluation across complete 2 × 2 factorial designs confirms DDM operates independently of prompt content, injection content, and resolution strategy type. Concurrent work has advanced authorization enforcement; the complementary question—what to do when authorized actions conflict, and by whose authority—is the problem Resolution Policy resolves.
View blog
Audio

System Optimization & Efficiency

7 of 10 on alphaXiv
XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs
26 Mar 2026

An engine named XGrammar 2 was developed to enable dynamic and efficient structured generation for agentic large language models, achieving near-zero end-to-end latency and substantially faster grammar compilation for scenarios like tool calling. This system improves function-calling accuracy and ensures 100% correct schema generation by leveraging just-in-time compilation and cross-grammar caching.

View blog
#agentic-frameworks#agents#computer-science
Audio
Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
25 Mar 2026

Researchers from MIT and others propose a batch-level, resource-aware routing framework for Large Language Models (LLMs) that explicitly manages monetary cost, GPU capacity, and model concurrency limits. It leverages integer linear programming with robust optimization to handle performance prediction uncertainty and includes an offline procedure for optimal model instance allocation, achieving up to 24% performance improvement under adversarial batching compared to per-query methods.

View blog
#computer-science#artificial-intelligence#machine-learning
Audio
Scaling Textual Gradients via Sampling-Based Momentum
18 Nov 2025
UT AustinUniversity of Chicago logoUniversity of Chicago

This research introduces Textual Stochastic Gradient Descent with Momentum (TSGD-M), a method that enhances the scalability and stability of automatic prompt engineering by dynamically reweighting and sampling from past textual gradients. This approach consistently improved test accuracy and reduced variance across multiple benchmarks, such as a 1.4% gain on the MATH task, while effectively overcoming implicit context length limitations in Large Language Models.

View blog
#computer-science#artificial-intelligence#computation-and-language
Audio
FLASC: Federated LoRA with Sparse Communication
19 May 2026
Carnegie Mellon University logoCarnegie Mellon University
Low-rank adaptation (LoRA) is a promising method for finetuning models in communication-constrained settings such as cross-device federated learning (FL). Prior work has explored ways to improve the efficiency of LoRA in federated settings by imposing additional sparsity constraints. However, existing methods for sparse LoRA not only harm accuracy but can in fact increase overall communication costs. We instead propose FLASC, a simple composite method that consists of a PEFT method and compression algorithm. First, we demonstrate that FLASC as a combination of LoRA and sparse Top-K communication outperforms baselines of using a lower LoRA rank or pruning LoRA weights. Second, FLASC-Search efficiently searches the space of rank-and-sparsity configurations by first tuning sparsity at a low rank and then transferring to higher ranks. Across four FL datasets, we demonstrate that FLASC outperforms existing sparse LoRA methods with up to 20% higher accuracy or 10× less communication. Overall, FLASC is a simple yet competitive baseline which can be easily extended to more advanced PEFT and compression methods in the future.
View blog
Audio
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
07 May 2026
Long chain-of-thought reasoning and agentic tool-calling produce traces spanning tens of thousands of tokens, yet Transformer KV caches grow linearly with sequence length, creating a memory bottleneck on commodity hardware. State-space models offer constant-memory recurrence but suffer a memory cliff: retrieval accuracy collapses once the gap between a stored fact and its query exceeds the effective horizon of the recurrent state. We introduce Echo, a KV-cache-free associative recall architecture built around Spectral Koopman Attention (SKA); a drop-in replacement for attention layers that augments SSM blocks with a closed-form dynamical operator whose sufficient statistics are accumulated in constant memory with no KV cache. Echo fits a spectral linear system to the key and value history via kernel ridge regression and retrieves through a learned power-iterated filter, all from O(r2)O(r^{2}) streaming state where rr is a small projection rank. On the Multi-Query Associative Recall benchmark, a pure Mamba-2 SSM fails to exceed chance accuracy (3%{\sim}3\%) across all gap lengths and KV-pair counts, while at the 50M parameter scale SKA-augmented models achieve 100%100\% retrieval accuracy on every configuration tested, including distractor gaps of 4,0964{,}096 tokens with 3232 KV pairs. Across five additional transfer benchmarks including needle-in-a-haystack, tool-trace, and multi-hop retrieval, SKA consistently outperforms both pure SSM and SSM+Attention hybrids while maintaining constant inference memory. Ablations confirm that the spectral operator, not the prefix masking strategy, drives the retrieval gain.
View blog
#agents#attention-mechanisms#computer-science
Audio
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
01 May 2026
Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi-step tasks such as coding or web-based question answering. While remote, cloud-based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage-based fees. However, agentic workflows are far more resource-intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM-based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single-inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low-cost execution signals, such as token-level log probabilities, AgentStop can reduce wasted energy by 15-20% with minimal impact on task performance (<5% utility drop) for challenging web-based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy-preserving LLM agents on user devices. Our project code and data are available at this https URL.
View blog
#agents#computer-science#artificial-intelligence
Audio
CAMI: Practical Cost-Aware Agent-Guided Multi-Indexing for Semantic Retrieval
19 May 2026
IBM
RAG ingestion pipelines frequently augment search corpus index with semantic enrichment indices (e.g., synthetic queries or summaries generated from corpus chunks) that are subsequently queried alongside the base index to improve retrieval via better alignment between document representations and user intent. While these supplementary representations substantially improve retrieval quality, they introduce a computational bottleneck: the configuration space of enrichment types and generator models is combinatorial, and the cost of exhaustive index-time evaluation scales linearly with corpus size. We introduce CAMI (CostAware Multi-Indexing), a framework that formalizes multi-index construction as a budgeted, multi-objective portfolio selection problem. CAMI targets the upstream decision of which enrichment views to generate and materialize before the retrieval backend is applied. CAMI incorporates three primary mechanisms: (i) an agentic discovery phase that proposes corpus-specific representation templates; (ii) an atomic-unit search procedure that evaluates individual enrichment-model pairs and recombines them via fidelity-local closure to identify synergistic portfolios; and (iii) a confidence-aware promotion schedule that prunes unpromising configurations early, decoupling optimization spend from total corpus size. We evaluate CAMI across diverse retrieval corpora. Our findings reveal that the framework systematically isolates high-recall portfolios under strict budget constraints, outperforming standard content-only baselines in challenging settings by up to 9.4% recall@10. Further, CAMI is able to systematically identify these high-recall portfolios using up to 5x less budget compared to random search baselines, making our approach practical in real production scenarios.
View blog
Audio

Engineering & Operations

1 of 3 on alphaXiv
SEAR: Schema-Based Evaluation and Routing for LLM Gateways
20 Mar 2026
Evaluating production LLM responses and routing requests across providers in LLM gateways requires fine-grained quality signals and operationally grounded decisions. To address this gap, we present SEAR, a schema-based evaluation and routing system for multi-model, multi-provider LLM gateways. SEAR defines an extensible relational schema covering both LLM evaluation signals (context, intent, response characteristics, issue attribution, and quality scores) and gateway operational metrics (latency, cost, throughput), with cross-table consistency links across around one hundred typed, SQL-queryable columns. To populate the evaluation signals reliably, SEAR proposes self-contained signal instructions, in-schema reasoning, and multi-stage generation that produces database-ready structured outputs. Because signals are derived through LLM reasoning rather than shallow classifiers, SEAR captures complex request semantics, enables human-interpretable routing explanations, and unifies evaluation and routing in a single query layer. Across thousands of production sessions, SEAR achieves strong signal accuracy on human-labeled data and supports practical routing decisions, including large cost reductions with comparable quality.
View blog
#computer-science#artificial-intelligence#computation-and-language
Audio