Method
This project implements a verifiable information pipeline: collect a broad paper pool from primary sources, rank it with a multi-signal rules engine, and generate compact bilingual daily briefs.
Collection
The default collector covers AI-related arXiv categories including cs.AI, cs.CL, cs.LG, cs.CV, cs.MA, cs.IR, cs.RO, cs.SD, cs.MM, cs.HC, cs.SE, cs.DC, stat.ML, and eess.AS. Categories and limits are configuration-driven.
Ranking Signals
- Institution background from configured top institutions.
- Community recommendation through optional HF Daily Papers enrichment.
- Community heat through HF upvote tiers.
- Top conference signals such as NeurIPS, ICML, ICLR, ACL, and CVPR.
- Code availability through paper text or GitHub metadata.
- Practitioner relevance keywords such as deployment, inference, agents, RAG, safety, and evaluation.
- Academic impact through optional Semantic Scholar citation counts.
- Open-source heat through optional GitHub stars and trending signals.
- arXiv category weight from configured AI categories.
- Novelty and duplicate penalties for repeated titles.
- Recent topic repetition penalties to reduce consecutive same-topic concentration.
- Safety, ethics, and governance keywords for additional visibility.
Cadence
The production pipeline computes the publication date in Beijing/Taipei time and fetches the target date, or the nearest usable date when arXiv has no usable rows. Without external API keys, production runs continue with arXiv metadata; mock-run is reserved for fixed demo data and offline validation.
Transparency
The frontend presents only the compact brief. Machine-readable details, run reports, and QA results remain in the data directory for inspecting collection scale, pagination, dedupe, and fallback behavior.
Limits
Briefs are generated from titles, abstracts, and public metadata by default. They do not replace full-paper reading, and arXiv preprints must not be described as verified conclusions or conference acceptances.