Building the Future of Agentic & AI Systems
ACM CAIS 2026 — The premier venue for rigorous, reproducible research on compound AI architectures, optimization, and deployment.
An AI system named Glia, developed at MIT CSAIL, autonomously designs and optimizes computer system mechanisms using a human-inspired multi-agent architecture. It generates interpretable algorithms and insights, achieving a 1.4x reduction in mean request completion time over prior methods in LLM serving optimization and demonstrating adaptability to varying conditions.
View blogA Retrieval-Augmented Generation (RAG) system is introduced for end-to-end security incident analysis, enabling large language models to answer forensic questions and reconstruct attack sequences from diverse security logs. The system achieved 94% recall for malware traffic analysis and 96% recall with 100% precision for Active Directory attack step detection using models like Claude Sonnet 4, significantly outperforming baselines.
View blogResearchers at MIT CSAIL developed Engram, an agentic AI architecture that enables sustained, coherent progress in complex system optimization by transferring distilled knowledge between sequential LLM agents. Engram generated solutions outperforming previous LLM-based methods across diverse tasks and discovered novel algorithms, including a dynamic programming approach for multi-cloud multicast that achieved costs below the human state-of-the-art.
View blogOracle researchers introduced Agent Spec, a declarative and framework-agnostic language designed to standardize the definition of AI agents and their workflows. This specification enables reproducible cross-framework evaluation, revealing measurable differences in accuracy, latency, and execution behavior across various runtimes for identical agent designs.
View blogResearchers at Megagon Labs developed a memory-augmented framework that enables large language model agents to adapt without parameter updates by leveraging LLM-generated critiques stored in episodic and semantic memory. This reflective approach demonstrated up to a 24.8% accuracy improvement over retrieval-based baselines across diverse classification tasks.
View blogResearchers developed mmGRPO, an online reinforcement learning framework that successfully adapts Group Relative Policy Optimization to optimize multi-module language model programs. Combining mmGRPO with prompt optimization improved performance by up to 11% over baseline methods across various tasks.
View blogMARVIS introduces a training-free method that extends Vision-Language Models (VLMs) to predict across diverse data modalities, including vision, audio, biological, and tabular data, by transforming data embeddings into visualizations for VLM interpretation. The approach achieves performance within 2.5% of specialized models and significantly outperforms other generalist foundation models by an average of 16.7%, while also enhancing data privacy by avoiding direct exposure of raw data.
View blogResearchers at Bar-Ilan University developed ACDL, a formal language for precisely describing the structure and dynamic evolution of LLM agent input contexts, providing standardized visual diagrams. This formalization enhances communication and comparability of agentic systems, with experiments demonstrating that context structure variations can alter agent performance by up to 5 percentage points.
View blogThe Vista framework introduces a verifier-in-the-loop agentic reinforcement learning approach for quantum program synthesis, addressing challenges of costly and multi-stage verification in OpenQASM 3.0 circuit generation. It employs hierarchical rewards and budget-aware gated evaluation, achieving superior semantic quality and a 1.77x speedup in verification efficiency.
View blogThis research reveals a disparity between industry marketing and user experience with AI agents, establishing a taxonomy of advertised agent capabilities and empirically identifying five critical usability barriers faced by end-users. The findings suggest that existing usability challenges from large language models are amplified in agentic systems, particularly concerning delegated multi-step workflows and real-world consequences.
View blogAgentPex introduces an automated, specification-driven system for evaluating AI agent behavior across multi-step execution traces. It systematically extracts behavioral rules from agent prompts and tool schemas to detect procedural violations, even in scenarios where task outcome-based evaluations report success, revealing agent "willful disobedience".
View blogRockfish Data and Carnegie Mellon University researchers introduce AgentFuel, a framework for generating expressive and customizable evaluations tailored for timeseries data analysis agents. The framework reveals that agents achieve an average accuracy of 66% on e-commerce and 60% on IoT benchmarks, but only 21% on a telecom benchmark, particularly struggling with stateful (34% accuracy) and incident-specific queries (10% accuracy), and demonstrates that integrating AgentFuel into an optimization loop can improve agent accuracy by up to 25%.
View blogAn empirical diagnosis of Moltbook, a large-scale AI-only social platform with over two million LLM agents, reveals that despite extensive interactions, robust socialization, defined as sustained behavioral adaptation and collective structure formation, does not automatically emerge. The study finds consistent semantic diversity at the micro-level, limited agent adaptation to social feedback, and transient influence hierarchies.
View blogAn automated framework, Persuade Me If You Can (PMIYC), was developed by researchers at the University of Illinois Urbana-Champaign to quantify the persuasive abilities and vulnerabilities of large language models (LLMs) in multi-turn dialogues. This framework, using a Normalized Change in Agreement (NCA) metric, revealed that LLMs' susceptibility to persuasion varies significantly between subjective and misinformation claims, with models like GPT-4o showing over 50% greater resistance to misinformation.
View blogSnorkel AI researchers developed UNDERWRITE, a benchmark for evaluating AI agents in commercial insurance underwriting that incorporates proprietary knowledge, noisy tools, and multi-turn interactions. Evaluating frontier models on UNDERWRITE revealed that high accuracy often trades off with efficiency and robustness, with models exhibiting prevalent domain-specific hallucinations and significant reliability issues.
View blogResearchers from Cornell University and University of Illinois Urbana-Champaign developed a trace-level analysis method to systematically study information contamination in multi-agent systems. Their work reveals that errors from extracted data can propagate silently, leading to incorrect outcomes without workflow structural changes, or cause expensive detours that still recover to correct answers.
View blogThis research quantifies the trade-offs between safety and capability in tool-using LLM agents by evaluating the impact of runtime safety verifiers on task performance. It reveals a "Safety-Capability Gap" where verifiers block up to 94% of unsafe actions, but agents often fail to recover, incurring a "Verifier Tax" of 2.0-2.8x increased token cost and exhibiting prevalent "Integrity Leak" failures from data hallucination.
View blogMoltGraph introduces a longitudinal temporal graph dataset derived from Moltbook, an agent-native social platform, to investigate coordinated agent behaviors. The study reveals that bursty coordinated activities are associated with a 506.35% higher early interaction rate and a 242.63% higher downstream content exposure through feed-based snapshots compared to non-coordinated content.
View blogEPFL researchers developed the `tacit` framework, which employs static type checking with Scala 3's capture checking to provide provably safe constraints for AI agents. This system prevents information leakage and unauthorized side effects by enforcing capability-based security, achieving 100% security against adversarial attacks while maintaining or improving agent performance.
View blogSAPO, a multi-agent framework developed at Microsoft, introduces a secure automated prompt optimization method that balances task performance with explicit security constraints for large language models. The system achieved a 100% adversarial robustness score on HarmBench while simultaneously improving aggregated task accuracy by at least 2.6 percentage points compared to single-objective baselines.
View blogAn engine named XGrammar 2 was developed to enable dynamic and efficient structured generation for agentic large language models, achieving near-zero end-to-end latency and substantially faster grammar compilation for scenarios like tool calling. This system improves function-calling accuracy and ensures 100% correct schema generation by leveraging just-in-time compilation and cross-grammar caching.
View blogResearchers from MIT and others propose a batch-level, resource-aware routing framework for Large Language Models (LLMs) that explicitly manages monetary cost, GPU capacity, and model concurrency limits. It leverages integer linear programming with robust optimization to handle performance prediction uncertainty and includes an offline procedure for optimal model instance allocation, achieving up to 24% performance improvement under adversarial batching compared to per-query methods.
View blogThis research introduces Textual Stochastic Gradient Descent with Momentum (TSGD-M), a method that enhances the scalability and stability of automatic prompt engineering by dynamically reweighting and sampling from past textual gradients. This approach consistently improved test accuracy and reduced variance across multiple benchmarks, such as a 1.4% gain on the MATH task, while effectively overcoming implicit context length limitations in Large Language Models.
View blog