TL;DR
Model Releases
Tools & Products
Research Papers
Model Releases
Apple has unveiled a new AI architecture that is built around and integrated with Google's Gemini models.
Quasar-Preview is a new AI model release from silx-ai that offers enhanced capabilities for various tasks, representing an advancement in the organization's model offerings. The preview version allows early access and feedback for further development and refinement.
Tools & Products
Minimal coding agent written in Rust, optimized for memory footprint and performance
Run Claude Design locally as an Agent Skill — Cursor, Claude Code & more. Produce polished UI mockups, prototypes, decks & wireframes as self-contained HTML, without claude.ai/design. Best with Opus 4.8.
Open-source observability tool that uses AI agents to self-heal your software
A free, open-source mole.fit — a native macOS GUI for the Mole CLI (mo): clean, uninstall, optimize, analyze disk, and watch live status. Plus long-range history + an MCP server for AI agents
Mobile-first web terminal for AI coding agents (Claude Code, Codex, OpenCode). Server-side virtual terminal enables session persistence & auto-reconnect — disconnect, sleep, refresh, and pick up exactly where you left off. With customizable shortcut keyboard, file workspace & live web preview.
BrandDocs is a set of agent skills that learn your existing Word, PowerPoint and Excel templates and generate new on-brand documents from them. Unlike generic AI document generators, it preserves brand, structure, styles and formulas by construction. Built for Claude Code, Codex and compatible AI agents.
VC Boom scores your pitch deck in under 90 seconds and tells you the single fastest fix, matches you with the right investors from 47,000+ (each with a one-line reason they fit), then drafts personalized cold emails you send from your own inbox. Prep for each investor, then book the calls. Founders using VC Boom have already raised $95M. Built by an 8-year VC who raised hundreds of millions and deployed across 47 startups. Free to start, no subscription.
Sibyl Memory Plugin for Hermes enables persistent memory across long time horizons, and enables relational context previously unavailable. Self-learning and auto-skill creation creates an agent that grows with you. Local SQLite, structured tiers, no vector DB. SDK, CLI, MCP server, Hermes plugin.
The 30 MB open-source AI agent runtime for edge devices. Offline by default — GPIO, UART, MQTT as first-class nodes. Industrial protocols (OPC-UA, Modbus) on the roadmap.
The world can't build compute fast enough to keep up with AI demand. So we took a different path. ZeroGPU is AI infrastructure powered by small language models running on a hybrid edge network reusing compute that already exists. Not every task needs a frontier model. Our purpose-built, edge-optimized models run 10x faster, 50% cheaper and offload 70–80% of production tasks to small models with frontier-level accuracy.
A benchmark for evaluating LLM × harness performance.
Kimi Work is a desktop agent for knowledge work. It connects to local files, uses WebBridge for browser automation, runs scheduled tasks, coordinates agent swarms, creates PPT/Excel/Word/PDF outputs, and includes native finance data tools.
You can tell when an app was vibecoded. So can your users. That generic purple gradient, the pills and badges, emojis everywhere, it all screams: "an AI made this in 20 minutes." Uiverse Design is a library of AI-first design systems you can drop into any project. Each one defines real typography, spacing, color, images and component treatment. All of them ship with a DESIGN.md instructions file, so that your agent knows exactly how to use it. You just sit back, and watch your app transform.
You land in a new city. You open every app you know. Two hours later you're still scrolling, still unsure, still guessing. TravelMind was built for that moment. Swipe through places, tell us what you love — the AI does the rest. It learns your taste and finds the right spot before you even know to look. Your taste. Every city. Live now on iOS and Android.
Most workout apps give you a generic plan and call it personalized. Whistle actually knows you. It reads your Apple Health data and builds a real training plan around your fitness level, recovery, and goals. Detailed workouts, smart progression, all on your iPhone and Apple Watch. Whether you're just getting started or pushing toward a new personal best, your AI coach figures out what you need and when you need it. Your data. Your plan. Your pace.
Research Papers
This article explores how agent harnesses and grep-based techniques are reshaping agentic search methodologies in AI systems.
This research investigates whether large language models can match or outperform traditional classical hyperparameter optimization algorithms.
Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attent...
Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games s...
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable...
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may...
Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Pr...
Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task ma...
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long to...
Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this...
Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adapt...
We present Echo-Memory, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation difference...
Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate t...
Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directl...
AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specif...
Industry News
Microsoft's open source tools were compromised in a security breach that allowed attackers to steal passwords from AI developers.