Make agents use tools and reusable skills more reliably, Make RAG retrieval and knowledge-base QA more reliable, Strengthen multimodal understanding of charts, documents, and visual evidence

What is worth tracking today

Today’s high-signal papers point to: make agents use tools and reusable skills more reliably, make RAG retrieval and knowledge-base QA more reliable, strengthen multimodal understanding of charts, documents, and visual evidence. Open the original paper, check the abstract, evaluation setup, and code/data availability before deciding whether to reproduce or adopt the idea.

Featured papers: title, takeaway, and verification trail

1. make agents use tools and reusable skills more reliably

Self Evolving Agents for Tool Use Skills (Alice Chen, Bob Smith) 2606.00001 PDF

Make agents use tools and reusable skills more reliably. The abstract points to: Agents learn reusable tool use skills through iterative self improvement, unit tests, execution feedback, and evaluation. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems.

2. make RAG retrieval and knowledge-base QA more reliable

RAG Evaluation under Noisy Retrieval (Dan Wang) 2606.00003 PDF

Make RAG retrieval and knowledge-base QA more reliable. The abstract points to: A benchmark studies retrieval augmented generation reliability under noisy evidence, missing citations, and adversarial documents. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems.

3. strengthen multimodal understanding of charts, documents, and visual evidence

Multimodal Safety Evaluation for Vision Language Models (Eva Green) 2606.00004 PDF

Strengthen multimodal understanding of charts, documents, and visual evidence. The abstract points to: A safety evaluation suite measures multimodal models across risky visual prompts, jailbreak attempts, and alignment failures. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems.

4. improve model reasoning, planning, and verification

Efficient Long Context Inference with Cache Compression (Carol Li) 2606.00002 PDF

Improve model reasoning, planning, and verification. The abstract points to: A systems method reduces memory and latency during long context model inference while preserving code reasoning accuracy. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems.

5. improve code generation, execution feedback, and automated repair

Code Model Repair with Execution Feedback (Frank Moore) 2606.00005 PDF

Improve code generation, execution feedback, and automated repair. The abstract points to: Code models improve patch generation through execution feedback loops, repository tests, and API-aware repair. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems.

6. improve training-data curation, synthesis, and deduplication

Synthetic Data Curation for Post Training (Henry Liu) 2606.00007 PDF

Improve training-data curation, synthesis, and deduplication. The abstract points to: A data pipeline selects synthetic instruction data for fine-tuning and post-training with quality filters. Verify whether the task setup is realistic, code or data are available, the evaluation covers complex scenarios, and the conclusion can transfer into real systems.

Other papers worth tracking

Reading boundaries