APRIL :// 2026

30. apr.

AI News Roundup — April 2026

April has been a explosive month of ai too

It was not just “new chatbot, bigger benchmark, better coding model.” It was world models, 3D, voice, video, layout, agents, biological research, motion, memory, and tools that are starting to feel less like demos and more like engines.

The big shift this month was physicality. Not physical like robots everywhere tomorrow, but physical in the sense that models are now trying to understand space, time, objects, screens, hands, motion, voices, games, and worlds.

LLM´S

OpenAI introduced GPT-Rosalind, a science-facing model aimed at biology and research reasoning. This one feels different from a normal chat model. It is more about using AI as a research partner for hard scientific problems, not just a general assistant.

Moonshot released Kimi K2.6, continuing the push around long-context, agentic, code-capable models coming out of China. Kimi has become one of the names to watch when the conversation turns to agents that can actually do work.

Xiaomi released MiMo-V2.5-Pro, a huge open Mixture-of-Experts model. This is interesting because Xiaomi is clearly not treating AI as a side experiment. MiMo points toward a broader stack around agents, coding, long context, and open model releases.

DeepSeek returned with DeepSeek V4, one of the bigger LLM releases of the month. The important part is not only benchmark performance. It is the full package: code, long context, agent work, efficiency, and price pressure on everyone else.

Tencent released HY3, and this is worth correcting clearly: HY3 is an LLM, not a 3D model. It belongs here with the core models. It is another large MoE-style release aimed at reasoning, coding, and long-context use.

Alibaba had a packed Qwen month. Qwen3.6 was the broad release, while Qwen3.6-27B and Qwen3.6-35B-A3B showed how much effort is going into strong smaller coding and agent models. Qwen3.5-Omni also matters because it keeps pushing the Qwen family deeper into multimodal work.

Z.AI shipped GLM-5.1 and pushed GLM-5V-Turbo. The GLM direction is worth tracking because it leans into agents, visual-language tasks, and practical deployment instead of only chasing leaderboard screenshots.

Anthropic released Claude Opus 4.7, continuing its focus on coding, long-running work, reliability, and high-quality agent behaviour. Anthropic also had two interesting research and culture notes this month: Claude Mythos / Glasswing, and the research post on emotion concepts. The Claude Code leaked prompt note made the rounds too, but I would treat that more as AI culture/drama than as a core model release.

Google had multiple practical releases. Gemma 4 matters because local and edge models matter. Not everything useful has to be frontier-cloud-scale. Gemini 3.1 Flash Live is more directly about real-time multimodal agents, with voice, video, and live interaction. Gemini 3.1 Flash TTS pushed Google further into controllable speech.

Meta also showed up with Muse Spark / MSL, pointing toward AI systems for creative ideation and multimodal design work.

Image, editing, and vision got more practical

The image models this month got smarter

OpenAI’s Image 2.0 was one of the biggest image releases of the month. The important part is not just prettier images. It is usability: stronger prompt following, better text rendering, cleaner layouts, and more reliable results for posters, diagrams, product concepts, UI-style visuals, and marketing assets. That matters because image models are becoming real design tools, not just art toys.

Baidu released ERNIE-Image, one of the strongest open image releases in this cluster. The big thing here is structured visual generation: text inside images, posters, comic panels, storyboards, and layouts. That matters for actual creative work because most real design work is not “make me a pretty fantasy landscape.” It is signs, flyers, event assets, UI concepts, diagrams, posters, and social media graphics.

Google DeepMind’s Vision Banana is one of the most important research signals on the list. The idea is strange but simple: image generators can become general vision learners. If a model can generate and edit images well enough, it may also learn useful internal representations for segmentation, depth, geometry, and other visual tasks.

EditCrafter and SpatialEdit point toward a more controllable editing future, where changes are spatial and intentional instead of just prompt-and-hope.

UniGenDet sits in the detection and vision space. TokenLight is worth checking if you follow lighting, tokens, and visual control. Prompt Relay is interesting because the prompt itself starts acting more like a relay between model stages instead of one static instruction.

There were also several useful projects in the image and design layer: FML, PSDesigner, RealRestorer, See-Through, TokenDial, and Gen-Searcher.

DreamLite is a compact, on-device AI model launched by ByteDance.

The pattern is clear: image AI is becoming less about one-shot generation and more about controlled editing, restoration, search, layout, lighting, and visual reasoning.

Video is merging with performance, physics, and interaction

Video was another huge part of April.

Happy Horse was one of the most exciting releases a open source model capable of generating better videos than seedance 2.0 in a lot of cases.

Vanast and MMPhysVideo sit closer to video, motion, and physical consistency.

CoInteract was one of the more interesting human-object interaction projects. This is exactly the kind of thing that matters if AI video is going to move beyond clips and into believable scenes. It is not enough that a human appears in a frame. The model has to understand contact, object handling, timing, and how bodies interact with things.

OmniShow also belongs in this video cluster, especially if you care about multimodal controllable generation. Motif VideoShowcase is more product-facing, but still worth tracking because the commercial layer is moving just as fast as the research pages.

Large Performance Model is one of my favorite April signals. It is basically about character performance: not just lip sync, not just talking heads, but listening, speaking, reacting, moving, and maintaining identity over time. That is a big deal for AI hosts, guides, NPCs, interactive installations, live avatars, and game characters.

NVIDIA Research released Kimodo, focused on controllable human motion generation. Anima also sits in the character and animation lane. VGGRPO points toward better training and reward optimization for visual generation.

Alibaba’s Wan 2.7 also belongs here. Wan has become one of those video model names that keeps showing up in creative workflows, and 2.7 adds another step in the broader video generation and editing race.

There were a lot of motion and human-centric projects worth saving too: HandX, LGTM, MegaFlow, Pulse of Motion, ActionPlan, RealMaster, LumosX, daVinci MagiHuman, and Meta Tribe v2.

The big takeaway: video is not just getting prettier. It is getting more performative, more physical, and more connected to characters, motion, and interaction.

World models were the real April story

This was probably the biggest shift.

World models are not just video generators. They are systems that try to model space, state, memory, interaction, and what happens next.

Tencent’s HY-World 2.0 is one of the clearest examples. It is about reconstructing, generating, and simulating 3D worlds, not just making a short video clip.

Alibaba’s Happy Oyster also points in this direction. It is framed around interactive 3D environments where users can direct, explore, and steer generated spaces.

NVIDIA Research’s Lyra 2.0 is another major one. It generates camera-controlled walkthroughs and lifts them toward 3D, which starts to feel less like “AI video” and more like “AI scene exploration.”

GameWorld, MultiWorld, InSpatio-World, Waypoint 1.5, and Matrix Game V3 all point toward the same thing: interactive worlds are becoming a serious AI category.

Video to World and WorldAgents are especially interesting because they connect world modeling to agents and environment understanding. If agents are going to do real work, they need some kind of internal model of the space or task they are operating in.

Then there is VOID, one of the cleanest examples of “video generation is becoming simulation.” VOID is about removing objects from video while accounting for interactions and physical consequences. Not just “paint over the pixels,” but “what should happen if this thing was never there?”

Hybrid Memory for Dynamic Video World Models tackles another key world-model problem: memory. If something leaves the frame and comes back later, the model should remember it. That sounds obvious to humans, but for video models it is still hard.

OpenGame belongs in this broader section too. It is more agentic game creation than world simulation, but it points toward the same convergence: AI systems that can generate playable environments, not just static media.

World models feel like the place where video, games, robotics, 3D, and agents start collapsing into one big category.

3D and spatial intelligence kept quietly getting stronger

3D also had a strong month.

WildDet3D is one of the most practical spatial AI projects here. It predicts 3D bounding boxes from a single image and supports different prompt types. That matters for robotics, AR, object understanding, and any creative system that needs to understand the physical world.

UniMesh tries to unify 3D mesh understanding and generation. This is exactly the kind of direction I want to see more of: not just generating a mesh, but understanding it, editing it, and looping between text, image, and 3D.

UniGeo focuses on 3D indoor object detection and geometry-aware learning. It is more technical, but it belongs here because indoor 3D understanding is important for AR, robotics, scanning, game design, and spatial workflows.

NUMINA also sits in the embodied and visual-spatial lane. RetimeGS and LagerNVS are worth checking if you follow Gaussian splatting, novel view synthesis, and dynamic scene representation.

AniGen focuses more on animation 3D generation,

Audio and voice had a real month too

Audio is getting more controllable and more open.

Google’s Gemini 3.1 Flash TTS is important because it treats speech less like “read this text” and more like directed performance. Tone, timing, mood, speaker setup, and delivery style are becoming part of the prompt.

ACE-Step 1.5 and the ACE-Step 1.5 XL collection are worth checking on the music generation side. This is especially relevant if you care about open music models and creative audio workflows.

OmniVoice points toward more general voice generation. LongCat-AudioDiT is another technical audio model worth watching because diffusion-based audio generation is moving quickly.

PrismAudio sits in the video-to-audio direction, which is going to matter more as AI video becomes more common. A silent video model is not enough. The next step is matching sound, ambience, voice, and timing.

Cohere released Cohere Transcribe, pushing into enterprise speech recognition. That is not as flashy as music or voice acting, but transcription is one of those boring tools that becomes extremely important once you start building real workflows.

Efficiency, compression, and benchmarks mattered more than people think

The less flashy technical side was also busy.

TurboQuant from Google Research is one of the big efficiency stories. Compression matters because every jump in model ability creates a new memory and deployment problem.

RotorQuant is another one to check, especially if you care about KV-cache compression and local inference. It is not a glamorous category, but it is one of the places where “can we actually run this?” gets solved.

Ternary Bonsai is also important because it pushes extreme compression with 1.58-bit ternary language models. That matters for edge devices, local AI, education, workshops, and anyone who cannot throw datacenter hardware at every problem.

ARC-AGI-3 belongs here as a benchmark signal. Static puzzle solving is one thing. Interactive environments and world-model learning are a much harder test.

CUA Suite is another useful benchmark/tooling signal for computer-use agents. If models are going to operate screens and apps, we need better ways to test that.

Tools and creator workflows

ML-Intern is interesting because it points toward agents that can work inside machine learning workflows, not just write generic code.

Open CoDesign is more of a creative workflow tool than a foundation model, but it belongs in the roundup because it connects AI to collaborative design work. That matters for studios, workshops, and teams that want AI to sit inside a process instead of replacing the process.

The ComfyUI Dynamic VRAM post also matters. This is not a new model, but it is very relevant to creators. Better VRAM handling means bigger workflows, fewer crashes, and more practical local experimentation.

Newsai

Cris Kevin Bjørndal