Cainew - Curated AI news for developers

TL;DR

Model Releases

UnDUNE II

Tools & Products

Research Papers

Industry News

Discussion

I let AI build a tool to help me figure out what was waking me up at night

Model Releases

UnDUNE II

RSS

Tools & Products

raiyanyahya/how-to-train-your-gpt

Build a modern LLM from scratch. Every line commented. Explained like we are five.

GitHub

tolibear/goalbuddy

A better /goal for Codex and Claude Code

GitHub

ttnear/Clarc

Native macOS client for Claude Code — a GUI desktop app built with SwiftUI

GitHub

HermannBjorgvin/Clawdmeter

ESP32 desk dashboard that shows Claude Code usage

GitHub

he-yufeng/RepoWiki

Open-source DeepWiki alternative — generate comprehensive wiki documentation for any codebase from terminal or browser

GitHub

Claude Platform on AWS

Claude Platform is now available on AWS, enabling developers to access Anthropic's AI models directly through Amazon Web Services' infrastructure.

RSS

Unagi-cq/cdp-bridge-mcp

MCP server that bridges clients to a real browser through CDP and a companion extension.

GitHub

Jotform Claude App: Build, edit, and analyze forms directly in Claude

Build, edit, and analyze forms directly inside Claude using simple conversations. Create forms, edit fields, add logic, search submissions, and get insights, all by describing what you want. No manual setup or switching tools.

ProductHunt

display.dev: Publish agent-generated HTML behind company auth

display.dev is the easiest way to publish agent-generated artifacts behind company authentication. One command gives your HTML and Markdown files a permanent URL. Your colleagues sign in securely via OTP or Google/Microsoft SSO, and drive iteration with in-line comments.

ProductHunt

Show HN: OpenGravity – A zero-install, BYOK vanilla JS clone of Antigravity

OpenGravity is a vanilla JavaScript implementation of Antigravity that requires no installation and allows users to provide their own API keys (BYOK).

GitHub

Research Papers

Interaction Models

RSS

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without ex...

HuggingFace

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-...

HuggingFace

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average traject...

HuggingFace

Model Merging Scaling Laws in Large Language Models

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves...

HuggingFace

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The system forks the agent process and its filesystem 5times faster than Docker, achieving >95% prompt-cache reuse on replay. We demonstrate the model through three applications. First...

HuggingFace

Can Muon Fine-tune Adam-Pretrained Models?

Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption s...

HuggingFace

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework...

HuggingFace

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lex...

HuggingFace

DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by incompleteness, incorrectness, and redundancy, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. W...

HuggingFace

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose s...

HuggingFace

PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature ...

HuggingFace

Pixal3D: Pixel-Aligned 3D Generation from Images

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguo...

HuggingFace

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter b...

HuggingFace

G-Zero: Self-Play for Open-Ended Generation from Zero Data

Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-δ, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Pr...

HuggingFace

Industry News

GM just laid off IT workers to hire those with stronger AI skills

General Motors has laid off IT workers while prioritizing hiring employees with strong artificial intelligence skills, reflecting the automotive industry's strategic shift toward AI-driven capabilities.

RSS

ICE Agents Have List of 20M People on Their iPhones Thanks to Palantir

Immigration and Customs Enforcement agents have access to a list of 20 million individuals stored on their iPhones through data provided by Palantir Technologies.

RSS

A consistent pattern of lying': trial exposes what insiders think of Sam Altman

Trial proceedings reveal internal perspectives suggesting a consistent pattern of lying attributed to Sam Altman, raising questions about his credibility and leadership practices.

RSS

Discussion

I let AI build a tool to help me figure out what was waking me up at night

A user deployed AI technology to build a diagnostic tool that helped identify the underlying causes of their nighttime sleep disruptions.

RSS