Cainew

Curated AI news for developers

TL;DR

Model Releases

Google introduces Gemma 4 12B, a compact multimodal AI model that combines text and image understanding without requiring separate encoders for improved efficiency. This unified architecture aims to make advanced multimodal capabilities more accessible for deployment.

RSS

Tools & Products

OpenSquilla — Token-Efficient AI Agent with same budget, higher intelligence density

GitHub

ktx is an executable context layer for data and analytics agents 🐙 Allow Claude Code, Codex, and any AI agent to query data accurately through MCP with skills, memory and a semantic layer

GitHub

Proxy API OpenAI-compatible que usa automação com Playwright para rotear requisições para modelos do Qwen com suporte a múltiplas contas, tools e sessões persistentes.

GitHub

Spectron is agent memory built on one ACID substrate. Graph, vectors, documents, and structured rows commit in one transaction. Every fact carries provenance. Corrections supersede, never overwrite. Hybrid retrieval fuses vectors, graph, BM25, and keywords. Traces feed back into ranking. Tri-temporal facts, multi-tenant scopes, and MCP support. No stitched stores. No sync pipelines.

ProductHunt

Composer is a real-time, multiplayer markdown editor where people and agents can work side-by-side. Instantly share markdown generated by your agent with teammates, edit in real time, leave comments, suggestions, and share context. Your agents join as true collaborators, working directly alongside you and your team.

ProductHunt

A new Linux technique allows users to leverage their Nvidia GPU's VRAM as additional swap space, effectively expanding available memory for system operations. This workaround can improve performance for memory-intensive tasks on systems with limited RAM.

GitHub

Research Papers

Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consistency. We address this by drawing inspiration from classical bundle adjustment (BA), which can be viewed as an iterative information propagation process between poses and local geometry. Inspired by ...

HuggingFace

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it s...

HuggingFace

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic an...

HuggingFace

A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds ra...

HuggingFace

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a ...

HuggingFace

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and ...

HuggingFace

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-follow...

HuggingFace

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting ...

HuggingFace

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achie...

HuggingFace

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a...

HuggingFace

We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions from the previous model, a...

HuggingFace

Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these ...

HuggingFace

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to joi...

HuggingFace

A Stanford Law study found that AI systems outperform law professors in certain legal tasks, suggesting AI has reached competitive capability levels in specialized knowledge domains. This research highlights AI's potential to transform professional services and education.

RSS

Industry News

A survey reveals that over 60% of people are turning to AI systems for mental health and psychological support. This trend highlights growing reliance on AI for mental wellness, raising questions about effectiveness, safety, and the role of human professionals.

RSS

OpenAI outlines its public policy agenda for AI, including safety, youth protection, workforce transition, and global standards to ensure AI benefits society.

OpenAI

Discussion

This piece argues that AI agents need standardized protocols and frameworks similar to RSS for discoverability and interoperability. Establishing such standards could improve how AI agents are discovered, shared, and integrated across platforms.

RSS