HF Breakout Models, May 11–17: DeepSeek V4, Gemma 4, and the Open TTS Surge

This is the first issue of a weekly digest tracking Hugging Face models with download growth that outpaces the baseline by an order of magnitude. This week's scan covers 33 models across text generation, multimodal reasoning, video synthesis, image generation, and speech — all released or achieving significant traction in the past few weeks. Every model listed here reached its current download count from a near-zero baseline within weeks of release, making exact week-over-week ratios technically unverifiable (Hugging Face doesn't expose per-day or per-week download deltas), but the directional signal is clear: these models are being pulled hard.

How to read the cards: each entry lists the model's Hugging Face ID, current monthly downloads, license, primary task, the one or two deployment paths most relevant to builders, and a bottom-line read on commercial fit. License shorthand: MIT and Apache 2.0 = free commercial use, no strings. OpenRAIL-M = usage restrictions apply, check the model card. Custom / Community = read the fine print before shipping.

LLMs: text generation

Seven text-generation models hit breakout numbers this window. The two biggest — DeepSeek-V4-Pro and DeepSeek-V4-Flash — absorbed the bulk of downloads, but the more interesting signal is the cluster of agentic-coding-focused releases from Zhipu, MiniMax, Xiaomi, Ant Group, and Zyphra.

DeepSeek-V4-Pro

deepseek-ai/DeepSeek-V4-Pro 1

Field	Value
Monthly downloads	3,140,341
License	MIT — commercial use: yes
Parameters	1.6T total / 49B activated (MoE)
Context	1M tokens
Released	May 6, 2026

DeepSeek-V4-Pro is a mixture-of-experts model with 1.6 trillion total parameters, 49 billion active per token, using a hybrid CSA+HCA (Compressed Sparse Attention + Hybrid Context Attention) design and trained with the Muon optimizer. 1 Three reasoning modes ship in the model card: Non-think (fastest), Think High, and Think Max — letting callers trade latency for depth at inference time. On LiveCodeBench v6, the Think Max mode scores 93.5%; on Codeforces it reaches a 3,206 rating.1

DeepSeek's model card claims the Pro variant "requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2." 1 That's a meaningful efficiency story if you're running inference at scale.

Build-on signals: vLLM, SGLang, KTransformers, standard Transformers. Also available on the DeepSeek API platform. The MIT license removes every friction point for shipping commercial products. The 1M-token context opens up long-document workflows that were impractical at 128K.

DeepSeek-V4-Flash

deepseek-ai/DeepSeek-V4-Flash 2

Field	Value
Monthly downloads	1,804,238
License	MIT — commercial use: yes
Parameters	284B total / 13B activated (MoE)
Context	1M tokens
Released	May 6, 2026

The compact sibling of V4-Pro. Flash activates only 13B parameters per token — substantially cheaper to serve. According to the model card, "DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks." 2 Benchmarks: MMLU-Pro (Max) 86.2%, LiveCodeBench v6 91.6%, GPQA Diamond (Max) 88.1%. 2 The ecosystem around it is already significant: 8 finetunes and 49 quantizations available within days of release.

Build-on signals: same as Pro (vLLM / SGLang / KTransformers / Transformers). Flash is the realistic self-hosting option for most indie teams — Pro at 1.6T requires serious multi-GPU infrastructure.

GLM-5.1

zai-org/GLM-5.1 3

Field	Value
Monthly downloads	250,268 (in ~4 days)
License	Not explicitly stated on model card — verify before shipping
Parameters	754B (MoE with DSA)
Released	May 13, 2026

GLM-5.1 from Zhipu AI (z.ai) reached 250k downloads in roughly four days. 3 The model is built for long-horizon agentic tasks — specifically the scenario where other models "exhaust their repertoire early" and stop improving when given more compute. Zhipu AI describes the design goal directly: "Previous models — including GLM-5 — tend to exhaust their repertoire early… GLM-5.1, by contrast, is built to stay effective on agentic tasks over much longer horizons." 3

SWE-bench Pro: 58.4% (highest among open models at time of release). 3 Terminal-Bench 2.0: 63.5%. AIME 2026: 95.3%.

Build-on signals: SGLang, vLLM, xLLM, Transformers, KTransformers. ⚠️ License is not explicitly stated on the model card — check the z.ai GitHub before building commercial products.

MiniMax-M2.7

MiniMaxAI/MiniMax-M2.7 4

Field	Value
Monthly downloads	585,072
License	Proprietary MiniMax license — commercial use via API/platform only
Parameters	229B
Released	April 20, 2026

M2.7 is MiniMax's "self-evolving" model — an internal version of M2.7 "autonomously optimized a programming scaffold over 100+ rounds... achieving a 30% performance improvement." 4 GDPval-AA ELO: 1495 (top among open-weight models at time of release). SWE-bench Pro: 56.2%, MLE-Bench Lite medal rate: 66.6%. 4

Supports Agent Teams for multi-agent collaboration. Inference via vLLM, SGLang, Transformers, NVIDIA NIM, and the MiniMax Agent platform.

Build-on signals: Strong if you're building on the MiniMax platform or API. ⚠️ The proprietary license means you can't self-host for commercial use — the model weights are available, but commercial deployment requires going through MiniMax's platform terms.

MiniMax M2.7 benchmark overview: Agent Teams multi-agent interface

Image from: MiniMaxAI/MiniMax-M2.7 Model Card

ZAYA1-8B

Zyphra/ZAYA1-8B 5

Field	Value
Monthly downloads	144,833
License	Not explicitly stated — verify from Zyphra GitHub before shipping
Parameters	9B total / 760M activated (MoE with CCA attention)
Released	May 6, 2026

Zyphra's ZAYA1-8B activates only 760 million parameters per token despite a 9B parameter count, and was trained entirely on AMD GPUs. On AIME 2026 it scores 89.1% — competitive with models 10–30× its active parameter count. 5 GPQA Diamond: 71.0%, LiveCodeBench v6: 65.8%. The model card describes it as excelling at "detailed long-form reasoning especially for mathematical and coding tasks." 5

Build-on signals: Custom vLLM and Transformers forks required; 13 quantizations available. The small active footprint makes this plausible on consumer hardware. ⚠️ License unconfirmed on model card — check Zyphra's GitHub (github.com/Zyphra) before using commercially.

MiMo-V2.5-Pro

XiaomiMiMo/MiMo-V2.5-Pro 6

Field	Value
Monthly downloads	59,207
License	Not explicitly stated — verify from Xiaomi GitHub
Parameters	1.02T total / 42B activated (MoE, 384 experts)
Context	1M tokens
Released	April 28, 2026

Xiaomi's flagship open model. 1.02T total parameters, 42B active, hybrid sliding-window + global attention with a 3-layer Multi-Token Prediction head. 6 SWE-bench Verified: 78.9%, HLE: 48%, Claw-Eval: 64.0. 6

Build-on signals: SGLang, vLLM. Full deployment requires 32 GPUs — positioned at the research lab / serious infrastructure end of the spectrum, not for solo indie devs without significant compute.

Ring-2.6-1T

inclusionAI/Ring-2.6-1T 7

Field	Value
Monthly downloads	1,468 (released May 15 — 2 days ago)
License	MIT — commercial use: yes
Parameters	1T
Context	256K (via YaRN extension from 128K)
Released	May 15, 2026

Ring-2.6-1T comes from inclusionAI, an Ant Group initiative. 7 It uses an asynchronous reinforcement learning training paradigm called IcePop and ships with two reasoning effort levels (high and xhigh). PinchBench score: 87.60 (above GPT-5.4 xHigh per the model card). 7 AIME 2026 (xhigh): 95.83, ARC-AGI-V2: 66.18.

At 1,468 downloads in two days, this is too early to call — but the MIT license and the Ant Group provenance make it worth watching. Deploy path: SGLang, multi-node (tp=8, pp=4).

Multimodal: vision + language (and audio)

Ten models this window accept image or video input alongside text. Google's Gemma 4 family alone generated over 24 million downloads across its variants, making it the single biggest open-source multimodal release so far this year by raw pull count.

Gemma 4 family (Google DeepMind)

Google DeepMind released four Gemma 4 models on May 7. Three of them hit breakout numbers independently:

google/gemma-4-31B-it — 9,858,626 downloads 8 google/gemma-4-26B-A4B-it — 8,416,904 downloads 9 google/gemma-4-E4B-it — 6,107,009 downloads 10

All three: Apache 2.0, commercial use: yes.

The 31B dense model targets coding and complex reasoning — MMLU-Pro 85.2%, AIME 2026 89.2%, LiveCodeBench v6 80.0%, GPQA Diamond 84.3%. 8 It has 256K context, hybrid sliding-window + global attention, and 156 community finetunes already.

The 26B-A4B-it is the MoE sibling: 25.2B total parameters, only 3.8B activated per token. Google's framing: "This makes it an excellent choice for fast inference compared to the dense 31B model since it runs almost as fast as a 4B-parameter model." 9 Benchmarks are close to the 31B: MMLU-Pro 82.6%, AIME 2026 88.3%, GPQA Diamond 82.3%.

The E4B (8B effective params, 4.5B computation) is the on-device variant with native audio understanding. It accepts text, image, and audio as input, runs on Ollama and llama.cpp, and has 128K context. 10 Benchmarks are lower (MMLU-Pro 69.4%), but it targets laptop/mobile inference where the 31B can't run.

Build-on signals: Transformers, vLLM, SGLang, llama.cpp, Ollama. The E4B in particular opens up fully local multimodal apps — the model runs on current MacBook hardware. Apache 2.0 means no license negotiation required.

Qwen3.6 family (Alibaba)

Qwen/Qwen3.6-35B-A3B — 5,477,343 downloads 11 Qwen/Qwen3.6-27B — 3,413,560 downloads 12

Both: Apache 2.0, commercial use: yes. Both released April 24.

Qwen3.6-35B-A3B is an MoE model: 35B total, 3B active per token, with Gated DeltaNet + MoE hybrid architecture and 256K context (extensible to 1M). 11 SWE-bench Verified: 73.4%, Terminal-Bench 2.0: 51.5%, GPQA Diamond: 86.0%, AIME 2026: 92.7%. 352 quantizations already exist in the ecosystem.

Qwen3.6-27B is the dense flagship: 28B parameters, same 256K-extensible-to-1M context. 12 SWE-bench Verified: 77.2%, Terminal-Bench 2.0: 59.3%, AIME 2026: 94.1% — the SWE-bench number on the 27B dense model is higher than the 35B MoE, which is unusual. Both models preserve thinking traces from historical messages, enabling multi-turn reasoning workflows where the model can recall its own prior chain of thought. 11

Build-on signals: vLLM, SGLang, KTransformers, Transformers, Qwen-Agent, Qwen Code. The agentic coding focus and thinking-trace preservation make these practical for code-generation pipelines where multi-turn context matters.

Kimi-K2.6

moonshotai/Kimi-K2.6 13

Field	Value
Monthly downloads	2,230,311
License	Modified MIT — commercial use: yes (with conditions)
Parameters	1.1T total / 32B activated (MoE, 384 experts)
Context	256K
Released	May 12, 2026

Kimi-K2.6 from Moonshot AI is a native multimodal model: text, image, and video input, text output. The architecture includes MLA (Multi-head Latent Attention) and a 400M MoonViT vision encoder. 13 What distinguishes it from other large MoE models is the agent swarm capability: "Scaling horizontally to 300 sub-agents executing 4,000 coordinated steps, K2.6 can dynamically decompose tasks into parallel, domain-specialized subtasks." 13

HLE-Full (with tools): 54.0% — highest among open models at time of release. BrowseComp: 83.2%, LiveCodeBench v6: 89.6%, AIME 2026: 96.4%. 13

Build-on signals: vLLM, SGLang, KTransformers, Kimi Code CLI, Moonshot API. Native INT4 quantization ships with the model. The agent swarm architecture is the commercial pitch here: multi-step autonomous workflows without external orchestration frameworks.

Nvidia Nemotron-3-Nano-Omni

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 14

Field	Value
Monthly downloads	~1.24M (across BF16 + FP8 + NVFP4 variants)
License	NVIDIA Open Model Agreement — commercial use: yes
Parameters	31B total / ~3B active (Mamba2-Transformer hybrid MoE)
Context	256K
Released	April 28, 2026

This model accepts video, audio, image, and text simultaneously. The architecture is a Mamba2-Transformer hybrid MoE with CRADIO v4-H vision encoder and Parakeet speech encoder. 14 Targeted use cases: document intelligence (OCR and chart reading), video+speech analysis, GUI/agentic workflows, and ASR with word-level timestamps. The NVFP4 variant runs on a single RTX 5090 at 21GB VRAM.

Build-on signals: vLLM, TensorRT-LLM, TensorRT Edge-LLM (Jetson Thor), SGLang, llama.cpp, Ollama. The Jetson Thor path makes this relevant for edge hardware deployments. NVIDIA's commercial license is permissive but different from MIT/Apache — worth reading if you plan to redistribute.

MiniCPM-V-4.6

openbmb/MiniCPM-V-4.6 15

Field	Value
Monthly downloads	56,518 (first hours — released May 17)
License	Apache 2.0 — commercial use: yes
Parameters	1B (SigLIP2-400M vision + Qwen3.5-0.8B LLM)
Released	May 17, 2026 (today)

OpenBMB's 1B-parameter multimodal model is built specifically for on-device use: iOS, Android, and HarmonyOS demo apps are open-sourced alongside the weights. 15 The model achieves 13 on the Artificial Analysis Intelligence Index benchmark, "outperforming Qwen3.5-0.8B's score of 10 with 19x fewer token cost." 15 Architecture: LLaVA-UHD v4 with mixed 4x/16x visual token compression, over 50% FLOP reduction versus prior versions.

Build-on signals: vLLM, SGLang, llama.cpp, Ollama. GGUF/BNB/AWQ/GPTQ quantized variants available. For mobile app developers who need on-device vision, this is one of very few models small enough to run on current phone hardware with Apache 2.0 licensing.

SenseNova-U1-8B-MoT

sensenova/SenseNova-U1-8B-MoT 16

Field	Value
Monthly downloads	~13,600
License	Apache 2.0 — commercial use: yes
Parameters	~18B (8B understanding + 8B generation MoT)
Released	April 27, 2026 (weights); May 10, 2026 (technical report)

SenseTime's SenseNova-U1 eliminates both the visual encoder (VE) and variational auto-encoder (VAE) that every other multimodal model relies on. The NEO-unify architecture instead models language and vision as a "unified compound," enabling native interleaved image-text generation, image editing with chain-of-thought reasoning, and infographic generation. 16 The model "models think-and-act across language and vision natively" rather than using adapters to translate between modalities. 16

Build-on signals: Custom inference code on GitHub, GGUF available, vLLM/SGLang support in development. Still early-stage from a deployment tooling perspective — but the architectural novelty (no VE, no VAE) makes it worth tracking for anyone building unified image generation + understanding workflows.

Intern-S2-Preview

internlm/Intern-S2-Preview 17

Field	Value
Monthly downloads	1,059 (released May 15 — 2 days ago)
License	Not explicitly stated — verify before shipping
Parameters	36B (continued pretrained from Qwen3.5)
Released	May 15, 2026

Shanghai AI Laboratory's Intern-S2-Preview is the first open-source model with material crystal structure generation capability alongside general language and vision tasks. 17 The task-scaling approach increases the difficulty, diversity, and coverage of scientific tasks during training. It outperforms the larger Intern-S1-Pro on certain scientific benchmarks despite being smaller.

Build-on signals: LMDeploy, vLLM, SGLang, Claude Code integration, official API. Relevant for teams building tools for materials science, chemistry, or computational research. Too new to have meaningful download velocity data yet.

Video generation

The open video generation space solidified this week around the LTX 2.3 base. Three models drove the bulk of activity.

Sulphur-2-base

SulphurAI/Sulphur-2-base 18

Field	Value
Monthly downloads	970,124
License	Apache 2.0 — commercial use: yes
Parameters	9B (based on LTX 2.3 DiT)
Released	~May 8, 2026

SulphurAI's Sulphur-2-base is described on the model card as "an uncensored video generation model based on LTX 2.3 supporting both t2v and i2v natively." 18 It includes a built-in prompt enhancer (served via a GGUF LM). The Apache 2.0 license inherited from the LTX 2.3 base makes it one of the most permissive video generation models available. Four quantizations, eight community Spaces.

Build-on signals: ComfyUI via LTX 2.3 workflows, LM Studio for prompt enhancement. BF16 and FP8 mixed variants available. The large download number suggests this is already the de-facto community baseline for LTX 2.3-based video generation.

LTX2.3-10Eros

TenStrip/LTX2.3-10Eros 19

Field	Value
Monthly downloads	135,648
License	Apache 2.0 (inherited from LTX 2.3) — commercial use: yes
Parameters	21B (BF16 checkpoint with CLIP and VAEs)
Released	~May 10, 2026

A community merge of Sulphur-2-base optimized specifically for image-to-video. TenStrip describes the approach: "This is a different merge attempt for ideal I2V use. It uses layer scaled merges of different steps, it's not a straight weight merge. It behaves much nicer than lora load and respects prompt." 19 SulphurAI endorsed this as the recommended I2V variant of Sulphur-2.

Build-on signals: ComfyUI with 10S-Comfy-nodes (GitHub). FP8 and BF16 variants, Kijai split files. A GGUF quantization from vantagewithai has 62.7k additional downloads. For I2V workflows, this appears to be the community-preferred starting point right now.

Image from: Motif-Technologies/Motif-Video-2B Model Card

Motif-Video-2B

Motif-Technologies/Motif-Video-2B 20

Field	Value
Monthly downloads	3,623
License	Apache 2.0 — commercial use: yes
Parameters	2B
Released	April 14, 2026 (Diffusers support: May 15)

Motif-Video 2B was built with fewer than 10 million training clips and under 100,000 H200 GPU hours. It achieves a VBench total score of 83.76% — higher than Wan2.1-14B (83.69%) at 7 times fewer parameters. 20 The team is candid about the gap: "In our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy." 20 Output: up to 720p at 121 frames (~5 seconds at 24fps).

Build-on signals: Diffusers (official support added May 15), ComfyUI, GGUF quantized weights. VRAM: ~30GB base, ~15GB with FP8 + CPU offload. The Diffusers integration makes this the easiest of the open video models to drop into an existing Python pipeline right now.

Audio: TTS and ASR

The TTS space had its busiest week yet in terms of model releases. OmniVoice's 600-language coverage stands apart, but the ASR competition is also tightening at the top.

Image from: k2-fsa/OmniVoice Model Card

OmniVoice

k2-fsa/OmniVoice 21

Field	Value
Monthly downloads	2,061,515
License	Open-source (MIT-adjacent) with anti-misuse disclaimer
Parameters	0.6B (Qwen3-0.6B base)
Released	~April 1, 2026

OmniVoice is a zero-shot TTS model supporting over 600 languages, built on a diffusion language model architecture (not a traditional seq2seq TTS). 21 The paper (arXiv 2604.00688) claims "the broadest language coverage among zero-shot TTS models" and "state-of-the-art voice cloning quality from a short reference audio." 21 Inference speed: RTF as low as 0.025 (40× faster than real-time). Supports voice cloning, voice design (gender/age/pitch/accent control), non-verbal markers such as [laughter], and pronunciation correction.

Build-on signals: pip install omnivoice, Hugging Face Spaces (100+ community demos), Google Colab, GitHub. 12 adapters, 25 finetunes, 7 quantizations. For apps targeting non-English speakers, this is the most deployment-ready multilingual TTS option available in open-source today.

Cohere Transcribe

CohereLabs/cohere-transcribe-03-2026 22

Field	Value
Monthly downloads	289,729
License	Apache 2.0 — commercial use: yes
Parameters	2B
Released	March 26, 2026

Cohere Transcribe holds the #1 position on the Open ASR Leaderboard with a mean word error rate of 5.42. 22 The architecture is a large Conformer encoder with a lightweight Transformer decoder, supporting 14 languages. RTFx: 524.88 on the leaderboard — substantially faster than most dedicated ASR models of comparable accuracy. 22

Build-on signals: Native Transformers, vLLM, mlx-audio, transformers.js (WebGPU browser demo), Rust implementation. An iOS app (Whisper Memos) and Android app (Whisperian) already use it. 23 community quantizations. The Apache 2.0 license and top leaderboard position make this the default choice for new ASR pipelines in the 14 supported languages.

Granite Speech 4.1

ibm-granite/granite-speech-4.1-2b 23

Field	Value
Monthly downloads	228,763
License	Apache 2.0 — commercial use: yes
Parameters	2B (16 Conformer blocks + 2-layer Q-former + Granite 4.0 1B LLM backbone)
Released	April 29, 2026

IBM's Granite Speech 4.1 2B ranks #2 on the Open ASR Leaderboard with mean WER 5.33, just ahead of Cohere Transcribe (5.42) and its own predecessor Granite 4.0 Speech (5.52). 23 The model was trained on 174,000 hours of audio from public corpora. Six languages: English, French, German, Spanish, Portuguese, Japanese. It also supports bidirectional speech translation (X→En and En→X), punctuation, truecasing, and keyword list biasing. 23

Two additional variants: granite-speech-4.1-2b-plus (speaker attribution + timestamps) and granite-speech-4.1-2b-nar (non-autoregressive, faster). Apache 2.0 throughout.

Build-on signals: Transformers ≥ 4.52.1, vLLM, speculative decoding. The bidirectional AST support and keyword biasing make it more flexible for enterprise transcription workflows than Cohere Transcribe, which doesn't list those features.

Granite Speech 4.1 — ASR leaderboard accuracy-speed trade-off

Image from: ibm-granite/granite-speech-4.1-2b Model Card

Supertonic 3

Supertone/supertonic-3 24

Field	Value
Monthly downloads	20,208
License	OpenRAIL-M (model) / MIT (code) — check usage restrictions
Parameters	~99M (ONNX assets)
Released	May 6, 2026 (v3)

Supertone's lightweight ONNX TTS expanded from 5 languages to 31 in v3. 24 No GPU required — it runs entirely on CPU via ONNX Runtime. The model card states it "runs fast on CPU, even compared with much larger baselines measured on A100 GPU, and uses substantially less memory." 24 Supports expression tags: <happy>, <sad>, <angry>.

Build-on signals: pip install supertonic, 9 community Spaces. At 99M parameters, this is realistically deployable on serverless edge infrastructure with zero GPU cost. ⚠️ OpenRAIL-M model license has specific usage restrictions — review before commercial deployment.

Supertonic 3 — lightweight CPU TTS with 31 language support

Image from: Supertone/supertonic-3 Model Card

VieNeu-TTS-v2

pnnbao-ump/VieNeu-TTS-v2 25

Field	Value
Monthly downloads	31,360
License	Not explicitly stated — verify before shipping
Parameters	0.3B
Released	May 8, 2026

A Vietnamese-English bilingual TTS with code-switching support — the model can transition between Vietnamese and English within a single synthesis pass. 25 Instant voice cloning from 3–5 seconds of reference audio. Podcast-style generation supported. At 31k downloads in 9 days, it has the fastest pull rate among the smaller audio models this week.

Build-on signals: Small enough for serverless inference. Relevant for apps targeting Vietnamese-speaking users. ⚠️ License not confirmed — check before commercial use.

Also released this week (too early for full velocity assessment):

Dramabox (ResembleAI/Dramabox) 26 — prompt-driven expressive TTS built on LTX-2.3 (IC-LoRA), with neural watermarking (Resemble Perth). 936 downloads in first days. LTX-2 Community License.
Scenema Audio (ScenemaAI/scenema-audio) 27 — zero-shot expressive voice cloning with scene-aware audio (rain, crowds, etc.), 13 languages, 8-step distilled latent diffusion. 209 downloads. LTX-2 Community License.

Both require 24GB VRAM and use the LTX-2 Community License — check the terms carefully if you plan commercial deployment.

Image generation

Image from: HiDream-ai/HiDream-O1-Image Model Card

HiDream-O1-Image

HiDream-ai/HiDream-O1-Image 28

Field	Value
Monthly downloads	14,285 (first ~10 days)
License	MIT — commercial use: yes
Parameters	8B (Pixel-level Unified Transformer, no external VAE)
Released	May 8, 2026

HiDream-O1-Image eliminates the VAE and disjoint text encoder that standard diffusion models rely on. The Pixel-level Unified Transformer (UiT) "natively encodes raw pixels, text, and task-specific conditions in a single shared token space." 28 Native max resolution: 2,048 × 2,048. GenEval Overall: 0.90 (beats FLUX.2 Dev at 0.87). DPG-Bench: 89.83. The Artificial Analysis T2I Arena ranked the Dev-2604 variant at #8 among all models, "positioned to be the new leading open weights Text to Image model." 28

Build-on signals: Python CLI (inference.py), Flask web demo, 11 community Spaces, Diffusers-compatible. Three variants: Full (50-step), Dev (28-step distilled), Dev-2604 (latest). MIT license with no usage restrictions.

Z-Anime

SeeSee21/Z-Anime 29

Field	Value
Monthly downloads	14,991
License	Apache 2.0 — commercial use: yes
Parameters	6B (S3-DiT, full fine-tune on Alibaba Z-Image Base)
Released	April 26, 2026

A full fine-tune — not a LoRA merge — of Alibaba's Z-Image Base architecture, trained specifically for anime-style generation. 29 Supports natural language prompting (not tag lists), full negative prompts, LoRA training ready. VRAM requirements: ~12GB BF16, ~6GB FP8, ~4.2GB GGUF Q4.

Z-Anime — anime-style image generation fine-tune on Z-Image Base

Image from: SeeSee21/Z-Anime Model Card

Build-on signals: ComfyUI (official workflow JSON), Diffusers (ZImagePipeline.from_pretrained()), 6 community Spaces. Distilled 8-step and 4-step variants bring inference time down substantially. The 4.2GB Q4 variant runs on consumer 8GB VRAM cards.

Also notable:

LumiPic (oumoumad/LumiPic) 30 — SDR-to-HDR LoRA based on the LumiVid research (arXiv 2604.11788). Outputs ARRI LogC3-encoded EXR files. Works across Qwen-Image-Edit-2511 and FLUX.2-klein bases. Apache 2.0. 2,322 downloads.
JoyAI-Image-Edit (jdopensource/JoyAI-Image-Edit) 31 — JD.com's spatial image editing model supporting object move, rotation, and camera control via instruction prompts (arXiv 2605.04128). Apache 2.0. 280 downloads.

Quick-scan license table

Model	License	Commercial use	Quickest deploy path
DeepSeek-V4-Pro	MIT	Yes	vLLM / SGLang / DeepSeek API
DeepSeek-V4-Flash	MIT	Yes	vLLM / SGLang
GLM-5.1	Unverified	Verify first	SGLang / vLLM
MiniMax-M2.7	Proprietary	API/platform only	MiniMax platform
ZAYA1-8B	Unverified	Verify first	Custom vLLM fork
MiMo-V2.5-Pro	Unverified	Verify first	SGLang
Ring-2.6-1T	MIT	Yes	SGLang (multi-node)
Gemma-4-31B-it	Apache 2.0	Yes	vLLM / Ollama
Gemma-4-26B-A4B-it	Apache 2.0	Yes	vLLM / Ollama
Gemma-4-E4B-it	Apache 2.0	Yes	Ollama / llama.cpp
Qwen3.6-35B-A3B	Apache 2.0	Yes	vLLM / SGLang
Qwen3.6-27B	Apache 2.0	Yes	vLLM / SGLang
Kimi-K2.6	Modified MIT	Yes (with conditions)	vLLM / Kimi Code CLI
Nemotron-3-Nano-Omni	NVIDIA Open Model	Yes	vLLM / TensorRT / Ollama
MiniCPM-V-4.6	Apache 2.0	Yes	Ollama / llama.cpp
SenseNova-U1-8B-MoT	Apache 2.0	Yes	GitHub custom inference
Intern-S2-Preview	Unverified	Verify first	LMDeploy / vLLM
Sulphur-2-base	Apache 2.0	Yes	ComfyUI
LTX2.3-10Eros	Apache 2.0	Yes	ComfyUI
Motif-Video-2B	Apache 2.0	Yes	Diffusers
OmniVoice	MIT-adjacent	Yes (anti-misuse)	pip install omnivoice
Cohere Transcribe	Apache 2.0	Yes	Transformers / vLLM
Granite Speech 4.1	Apache 2.0	Yes	Transformers
Supertonic 3	OpenRAIL-M	Check restrictions	pip install supertonic
VieNeu-TTS-v2	Unverified	Verify first	—
Dramabox	LTX-2 Community	Check terms	Python SDK
Scenema Audio	LTX-2 Community	Check terms	Python SDK
HiDream-O1-Image	MIT	Yes	Diffusers / Spaces
Z-Anime	Apache 2.0	Yes	ComfyUI / Diffusers
LumiPic	Apache 2.0*	Yes	ComfyUI
JoyAI-Image-Edit	Apache 2.0	Yes	GitHub inference
RE-USE	NVIDIA NSCLv1	Non-commercial	Python SDK
LocalVQE	Apache 2.0	Yes	GGML / Python SDK

*LumiPic license is Apache 2.0-compatible per the model card; the base models have their own licenses.

On the watch list

Two models released in the past two days have download counts too low for meaningful velocity signals yet:

Ring-2.6-1T (1,468 downloads, May 15) — MIT license, Ant Group provenance, strong benchmark numbers. The async RL training paradigm and dual reasoning-effort levels are novel. Check next week.
Intern-S2-Preview (1,059 downloads, May 15) — material science applications are a narrow vertical, but the "crystal structure generation" capability has no open-source competitor. License unconfirmed.

Two areas remain unscanned from this run: text-to-3D pipelines (TencentARC/Pixal3D flagged as having traction) and text-to-audio/music pipelines. Both will be included in next week's scan.

Also worth noting: the audio-to-audio category produced two models not covered in full cards above:

RE-USE (nvidia/RE-USE) 32 — 9.6M-parameter universal speech enhancement (8–48kHz). NVIDIA NSCLv1 license means non-commercial use only. 3,516 downloads.
LocalVQE v1.2 (LocalAI-io/LocalVQE) 33 — 1.3M-parameter real-time AEC + noise suppression + dereverberation running ~10× realtime on CPU. Apache 2.0, 1,978 downloads. Built for real-time voice call quality.