Watch 9 videos showing the capabilities of Gemini Omni and Gemini 3.5, announced at Google I/O 2026.
TL;DR
Model Releases
Research Papers
Industry News
Model Releases
Tools & Products
β¨ The agentic HTML editor β your local AI agent writes the HTML, you ship it. π 75 Skills Γ 9 Surfaces (magazine Β· deck Β· poster Β· XHS / tweet Β· prototype Β· data report Β· Hyperframes) π‘οΈ Sandboxed preview Β· π€ 1-click to WeChat / X / Zhihu / HTML / PNG π Zero API key β Claude Code / Cursor / Codex / Gemini / Copilot / OpenCode / Qwen / Aider.
Bot that bridges Feishu/Lark messenger with a local Claude Code CLI β streaming cards, per-chat sessions, multiple workspaces
A better /goal for Codex and Claude Code
A new approach enables real-time LLM inference on standard GPUs, achieving throughput of 3,000 tokens per second per request.
Ava is an AI BDR that runs your entire outbound on autopilot. She sources leads from 250M+ professionals, runs multi-channel outreach, and books qualified meetings. Fully autonomously.
/monitor notifies your agent via webhook the moment pages or sites change. Use up to 90% fewer LLM tokens by only ingesting what changes on a page.
Ava Studio researches your product, develops hooks and creative angles, then generates 50+ editable short-form ad variants ready for TikTok, Reels, Meta, and any platform you want to ship on.
Point MCP Bridge at any REST, GraphQL, SOAP, or gRPC API. It auto-generates MCP tool definitions with typed schemas, auth, rate limiting, and response processing. Your LLM agents call enterprise APIs through one standard interface.
FireCoach.ai is the fastest way to clone your sales methodology and coach every rep on your team β at scale, without adding headcount. Build custom AI sales bots trained on your playbook, run rep roleplays, get scored feedback, and identify coaching gaps before they show up on a lost deal.
Integuru generates fast, reliable APIs for any platform, without browsers or RPA. API calls complete in ~3 seconds with 99.9%+ success. Most agents today use browser automation to control web apps that lack official APIs, but this is slow and brittle. Integuru replaces browsers entirely and connects directly with the backend. Integuru covers authentication and edge cases. Integrations get auto-healing, API docs, and a 24/7 on-call maintenance team. Each API is generated end-to-end in minutes.
Terminal UI for personal finance β Plaid sync, CSV import, AI assistant, and MCP server
How Braintrust engineers use Codex with GPT-5.5 to run experiments and code faster.
AISlop is a new command-line tool designed to detect and identify code smells commonly found in AI-generated code.
OpenAI launches Rosalind Biodefense, expanding trusted access to GPT-Rosalind for vetted developers and U.S. government partners advancing biodefense, public health, and pandemic preparedness through frontier AI.
Research Papers
Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a condit...
Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perceptio...
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, ...
Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, w...
Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically qu...
Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, ...
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a represe...
This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its atten...
The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which ta...
We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simul...
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that...
Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and lo...
Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal ...
One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model confi...
Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level re...
Industry News
Sam Altman and Dario Amodei have recently retreated from earlier catastrophic predictions about AI eliminating jobs in the near term. Both leaders are now adopting a more cautious stance on the timeline and severity of AI-driven employment disruption.
Key announcements and insights were shared at the Mistral AI Now Summit held in Paris, showcasing the latest developments from the Mistral AI team.
Amazon discontinued its AI leaderboard to prevent workers from becoming overly focused on chasing usage metrics rather than genuine productivity.
Microsoft's internal data reveals that deploying AI tools is often more costly than hiring additional human workers for the same tasks.
Boston Childrenβs Hospital uses OpenAI technology to improve patient care, reduce operational burden, and help diagnose more than 40 rare disease cases.
Discussion
An mysterious LLM named Hy3 has unexpectedly dominated OpenRouter's model rankings by a significant margin, raising questions about its capabilities and origins.
The article explores various code smells and anti-patterns commonly found in LLM-generated and LLM-influenced code.
An analysis examines whether AI is causing frontend development to enter a similar period of stagnation as the industry's previous lost decade.
Protestware is emerging as a concept for coding agents, potentially incorporating protest or resistance mechanisms into AI-driven development tools.
University of Waterloo students develop AI prototypes like sign language tutors to reshape the future of education and work.