AI Research Brief

Make agents use tools and reusable skills more reliably, Make RAG retrieval and knowledge-base QA more reliable, Strengthen multimodal understanding of charts, documents, and visual evidence

Sat, 06 Jun 2026 00:00:00 +0000

# Make agents use tools and reusable skills more reliably, Make RAG retrieval and knowledge-base QA more reliable, Strengthen multimodal understanding of charts, documents, and visual evidence ## What is worth tracking today Today’s high-signal papers point to: make agents use tools and reusable skills more reliably, make RAG retrieval and knowledge-base QA more reliable, strengthen multimodal understanding of charts, documents, and visual evidence. Open the original paper, check the abstract, evaluation setup, and code/data availability before deciding whether to reproduce or adopt the idea. ## Featured papers: title, takeaway, and verification trail ### 1. make agents use tools and reusable skills more reliably

Self Evolving Agents for Tool Use Skills (Alice Chen, Bob Smith) 2606.00001 PDF

Make agents use tools and reusable skills more reliably. The abstract points to: Agents learn reusable tool use skills through iterative self improvement, unit tests, execution feedback, and evaluation. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 2. make RAG retrieval and knowledge-base QA more reliable

RAG Evaluation under Noisy Retrieval (Dan Wang) 2606.00003 PDF

Make RAG retrieval and knowledge-base QA more reliable. The abstract points to: A benchmark studies retrieval augmented generation reliability under noisy evidence, missing citations, and adversarial documents. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 3. strengthen multimodal understanding of charts, documents, and visual evidence

Multimodal Safety Evaluation for Vision Language Models (Eva Green) 2606.00004 PDF

Strengthen multimodal understanding of charts, documents, and visual evidence. The abstract points to: A safety evaluation suite measures multimodal models across risky visual prompts, jailbreak attempts, and alignment failures. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 4. improve model reasoning, planning, and verification

Efficient Long Context Inference with Cache Compression (Carol Li) 2606.00002 PDF

Improve model reasoning, planning, and verification. The abstract points to: A systems method reduces memory and latency during long context model inference while preserving code reasoning accuracy. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 5. improve code generation, execution feedback, and automated repair

Code Model Repair with Execution Feedback (Frank Moore) 2606.00005 PDF

Improve code generation, execution feedback, and automated repair. The abstract points to: Code models improve patch generation through execution feedback loops, repository tests, and API-aware repair. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 6. improve training-data curation, synthesis, and deduplication

Synthetic Data Curation for Post Training (Henry Liu) 2606.00007 PDF

Improve training-data curation, synthesis, and deduplication. The abstract points to: A data pipeline selects synthetic instruction data for fine-tuning and post-training with quality filters. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ## Other papers worth tracking - [Red Teaming Open Source LLM Guardrails](https://arxiv.org/abs/2606.00017): Tracks model safety, guardrail routing, risk classification, or governance evaluation; useful for safety and policy workflows. - [Preference Optimization for Safer Tool Agents](https://arxiv.org/abs/2606.00012): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [Database Native Retrieval for Enterprise RAG](https://arxiv.org/abs/2606.00013): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [Agentic 3D Modeling through Code Execution](https://arxiv.org/abs/2606.00015): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [Low Rank Adapters as Model Memory Probes](https://arxiv.org/abs/2606.00018): Tracks a concrete training and post-training signal; useful for deciding whether the full paper deserves follow-up. - [Robotics Policies with Memory Grounded Planning](https://arxiv.org/abs/2606.00006): Tracks a concrete robotics and embodied ai signal; useful for deciding whether the full paper deserves follow-up. - [Mechanistic Attribution for Factual Editing](https://arxiv.org/abs/2606.00008): Tracks a concrete interpretability signal; useful for deciding whether the full paper deserves follow-up. - [Chart Understanding for Vision Language Models](https://arxiv.org/abs/2606.00014): Tracks a concrete multimodal models signal; useful for deciding whether the full paper deserves follow-up. - [Video Diffusion Models Need Temporal Tests](https://arxiv.org/abs/2606.00010): Tracks a concrete video generation signal; useful for deciding whether the full paper deserves follow-up. - [Serving Quantized Models with Adaptive Batching](https://arxiv.org/abs/2606.00011): Tracks inference cost, latency, throughput, and deployment constraints; useful for systems optimization. - [Training Data Deduplication for Foundation Models](https://arxiv.org/abs/2606.00016): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [Open Speech Agent Benchmark](https://arxiv.org/abs/2606.00009): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. ## Reading boundaries - Automated ranking favors papers with community, code, and applied-engineering signals. - Briefs are based on titles, abstracts, and public metadata by default, not full-paper review. - External API failures degrade optional signals and are reflected in internal records.

Strengthen multimodal understanding of charts, documents, and visual evidence, Make agents use tools and reusable skills more reliably, Make RAG retrieval and knowledge-base QA more reliable

Fri, 05 Jun 2026 00:00:00 +0000

# Strengthen multimodal understanding of charts, documents, and visual evidence, Make agents use tools and reusable skills more reliably, Make RAG retrieval and knowledge-base QA more reliable ## What is worth tracking today Today’s high-signal papers point to: strengthen multimodal understanding of charts, documents, and visual evidence, make agents use tools and reusable skills more reliably, make RAG retrieval and knowledge-base QA more reliable. Open the original paper, check the abstract, evaluation setup, and code/data availability before deciding whether to reproduce or adopt the idea. ## Featured papers: title, takeaway, and verification trail ### 1. strengthen multimodal understanding of charts, documents, and visual evidence

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models (Youqi Wu, Mohammad Jalali, Farzan Farnia) 2606.04180 PDF

Strengthen multimodal understanding of charts, documents, and visual evidence. The abstract points to: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 2. make agents use tools and reusable skills more reliably

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol (Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude) 2606.03907 PDF

Make agents use tools and reusable skills more reliably. The abstract points to: Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 3. make RAG retrieval and knowledge-base QA more reliable

Automating Information Extraction and Retrieval for Industrial Spare Parts Pooling (Dyuman Bulloni, Rocco Felici, Oliver Avram, Anna Valente) 2606.03367 PDF

Make RAG retrieval and knowledge-base QA more reliable. The abstract points to: Maintenance organizations in manufacturing try to avoid downtime and unnecessary purchasing by reusing existing assets, but the main obstacle is not a lack of parts but a lack of actionable visibility across sites and partners. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 4. make RAG retrieval and knowledge-base QA more reliable

Stationarity-Aware Retrieval-Augmented Time Series Forecasting (Shiqiao Zhou, Holger Schöner, Zipeng Wu, Edouard Fouché, IAG Wilson, Shuo Wang) 2606.04135 PDF

Make RAG retrieval and knowledge-base QA more reliable. The abstract points to: Time series forecasting relies on historical patterns, but real-world series often exhibit non-stationarity and regime shifts that challenge fully parametric forecasters. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 5. make agents use tools and reusable skills more reliably

Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines (Justice Owusu Agyemang, Jerry John Kponyo, Kwame Opuni-Boachie Obour Agyekum, Francisca Adoma Acheampong, Kwame Agyeman-Prempeh Agyekum, James Dzisi Gadze) 2606.03739 PDF

Make agents use tools and reusable skills more reliably. The abstract points to: LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ## Other papers worth tracking - [VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring](https://arxiv.org/abs/2606.03954): Tracks model safety, guardrail routing, risk classification, or governance evaluation; useful for safety and policy workflows. - [MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A](https://arxiv.org/abs/2606.04231): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments](https://arxiv.org/abs/2606.04171): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [When Autoregressive Consistency Hurts Safety Alignment](https://arxiv.org/abs/2606.04168): Tracks model safety, guardrail routing, risk classification, or governance evaluation; useful for safety and policy workflows. - [End-to-End Text Line Detection and Ordering](https://arxiv.org/abs/2606.04166): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [Expert-Aware Refusal Steering](https://arxiv.org/abs/2606.04160): Tracks model safety, guardrail routing, risk classification, or governance evaluation; useful for safety and policy workflows. - [HighTide: An Agent-Curated Open-Source VLSI Benchmark Suite](https://arxiv.org/abs/2606.04126): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing](https://arxiv.org/abs/2606.04101): Tracks inference cost, latency, throughput, and deployment constraints; useful for systems optimization. - [MAOAM: Unified Object and Material Selection with Vision-Language Models](https://arxiv.org/abs/2606.04880): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill](https://arxiv.org/abs/2606.03980): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning](https://arxiv.org/abs/2606.03965): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation](https://arxiv.org/abs/2606.03963): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction](https://arxiv.org/abs/2606.03940): Tracks a concrete multimodal models signal; useful for deciding whether the full paper deserves follow-up. - [Visual Instruction Tuning Aligns Modalities through Abstraction](https://arxiv.org/abs/2606

Strengthen multimodal understanding of charts, documents, and visual evidence, Make agents use tools and reusable skills more reliably, Make RAG retrieval and knowledge-base QA more reliable

Wed, 03 Jun 2026 00:00:00 +0000

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models (Youqi Wu, Mohammad Jalali, Farzan Farnia) 2606.04180 PDF

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol (Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude) 2606.03907 PDF

Automating Information Extraction and Retrieval for Industrial Spare Parts Pooling (Dyuman Bulloni, Rocco Felici, Oliver Avram, Anna Valente) 2606.03367 PDF

Stationarity-Aware Retrieval-Augmented Time Series Forecasting (Shiqiao Zhou, Holger Schöner, Zipeng Wu, Edouard Fouché, IAG Wilson, Shuo Wang) 2606.04135 PDF

Strengthen multimodal understanding of charts, documents, and visual evidence, Make agents use tools and reusable skills more reliably, Make RAG retrieval and knowledge-base QA more reliable

Tue, 02 Jun 2026 00:00:00 +0000

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models (Youqi Wu, Mohammad Jalali, Farzan Farnia) 2606.04180 PDF

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol (Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude) 2606.03907 PDF

Automating Information Extraction and Retrieval for Industrial Spare Parts Pooling (Dyuman Bulloni, Rocco Felici, Oliver Avram, Anna Valente) 2606.03367 PDF

Stationarity-Aware Retrieval-Augmented Time Series Forecasting (Shiqiao Zhou, Holger Schöner, Zipeng Wu, Edouard Fouché, IAG Wilson, Shuo Wang) 2606.04135 PDF

Make agents use tools and reusable skills more reliably, Identify and reduce safety, jailbreak, and alignment risks

Mon, 01 Jun 2026 00:00:00 +0000

# Make agents use tools and reusable skills more reliably, Identify and reduce safety, jailbreak, and alignment risks ## What is worth tracking today Today’s high-signal papers point to: make agents use tools and reusable skills more reliably, make agents use tools and reusable skills more reliably, identify and reduce safety, jailbreak, and alignment risks. Open the original paper, check the abstract, evaluation setup, and code/data availability before deciding whether to reproduce or adopt the idea. ## Featured papers: title, takeaway, and verification trail ### 1. make agents use tools and reusable skills more reliably

Cosmos 3: Omnimodal World Models for Physical AI (Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, et al.) 2606.02800 PDF

Make agents use tools and reusable skills more reliably. The abstract points to: We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 2. make agents use tools and reusable skills more reliably

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models (Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini) 2606.02835 PDF

Make agents use tools and reusable skills more reliably. The abstract points to: Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 3. identify and reduce safety, jailbreak, and alignment risks

Breaking the Information Silo: Semantic Personas for Cross-Domain Recommendation (Jonathan Mayo, Moshe Unger, Konstantin Bauman) 2606.01783 PDF

Identify and reduce safety, jailbreak, and alignment risks. The abstract points to: Digital platforms increasingly operate as isolated information silos, limiting their ability to construct comprehensive user representations across domains. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 4. make agents use tools and reusable skills more reliably

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators (Taras Sereda, Burak Bartan, Ankita Nayak, Tom St. John, Natalie Serrino, Zain Asgar) 2606.02963 PDF

Make agents use tools and reusable skills more reliably. The abstract points to: Production inference increasingly targets a heterogeneous mix of accelerators. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ### 5. improve code generation, execution feedback, and automated repair

EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement (Hui Li, Yangfan Gao, Junlin Shang, Changhao Jiang, Tao Gui, Qi Zhang, et al.) 2606.02739 PDF

Improve code generation, execution feedback, and automated repair. The abstract points to: Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems. ## Other papers worth tracking - [Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models](https://arxiv.org/abs/2606.02914): Tracks task design, metrics, and failure cases; useful for model evaluation and regression testing. - [GloResNet: A lightweight 3D CNN with global topological features for preterm brain injury prediction](https://arxiv.org/abs/2606.02498): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence](https://arxiv.org/abs/2606.02463): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations](https://arxiv.org/abs/2606.02240): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents](https://arxiv.org/abs/2606.02031): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents](https://arxiv.org/abs/2606.02965): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [ATLAS: A Large-Scale Evaluation Benchmark for Adversarial LiDAR Perception](https://arxiv.org/abs/2606.02924): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [Tiny Collaborative Inference for Occlusion-Robust Object Detection](https://arxiv.org/abs/2606.02894): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [Do Transformers Need Three Projections? Systematic Study of QKV Variants](https://arxiv.org/abs/2606.04032): Tracks inference cost, latency, throughput, and deployment constraints; useful for systems optimization. - [Pathway-Structured Privileged Distillation for Deployable Computational Pathology](https://arxiv.org/abs/2606.02877): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [RRISE: Robust Radius Inference via a Surrogate Estimator](https://arxiv.org/abs/2606.02876): Tracks task design, metrics, and failure cases; useful for model evaluation and regression testing. - [Toward a Modular Architecture for Embedded AI Agent Systems at the Edge](https://arxiv.org/abs/2606.02862): Tracks tool use, execution feedback, and reusable capabilities; useful for agent workflow reliability. - [Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing](https://arxiv.org/abs/2606.02822): Tracks retrieval, knowledge-base QA, and evidence reliability; useful for RAG evaluation and enterprise knowledge systems. - [Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection](https://arxiv.org/abs/2606.02812): Tracks