Meituan pushes open avatar video deeper into startup territory - Startup Fortune

Ai / Business | Meituan has released LongCat-Video-Avatar 1.5, an open avatar video model aimed at realistic audio-driven digital humans. The release raises a sharper question for AI video startups: whether talking-head generation is becoming a commodity, while value moves toward workflow, trust, and compliance.

Meituan has released LongCat-Video-Avatar 1.5, and the important part is not just better talking heads. It is that a delivery and local services giant is now putting pressure on a market many startups hoped to own.

Meituan is not the first name most people outside China associate with synthetic video. That is exactly why LongCat-Video-Avatar 1.5 matters. The company behind one of China’s largest local commerce platforms has open-sourced an upgraded avatar video framework for audio-driven digital humans, including single-speaker clips, multi-person conversations, singing, animated characters, and longer video continuation.

The release appeared on Hugging Face and developer channels this week, with the Meituan LongCat team presenting it as a production-ready upgrade rather than a research toy. According to the model card on Hugging Face, LongCat-Video-Avatar 1.5 uses Whisper-large-v3 instead of Wav2Vec2 for its audio encoder, supports Audio-Text-to-Video, Audio-Text-Image-to-Video, and video continuation, and uses 8-step inference through DMD2-based distillation to make generation faster.

That combination is the real story. Talking avatar products have been around for years, but the category has often been split between polished commercial tools and open models that require patience, hardware, and a tolerance for strange facial movement. Meituan is trying to narrow that gap. Better lip sync, steadier identity preservation, and cleaner full-body motion do not sound dramatic until you remember what businesses actually use these tools for: customer training, product explainers, sales videos, education, internal communications, and creator content that has to look good enough to publish.

Meituan’s move should not be read as a random side project. Large internet platforms have the data pipelines, engineering depth, and infrastructure discipline that modern generative video needs. They also understand high-volume consumer behavior. A company that handles commerce, delivery, travel, advertising, local services, and merchants has plenty of reasons to care about scalable video creation.

For small businesses, video has become a tax on attention. Restaurants, shops, tutors, agencies, and service providers all need more video than they can reasonably produce with people, cameras, scripts, and editing time. If avatar generation becomes reliable enough, a merchant can turn a product update or seasonal offer into a localized presenter video without booking a studio. That is not futuristic. It is simply cheaper.

This is where Chinese platform companies have an advantage that Western observers sometimes underestimate. They are not just building models for benchmark screenshots. They can test the economics of content generation against real commercial demand. If a model helps merchants sell more, explain more, or support customers faster, the use case does not need much philosophical defense.

LongCat-Video-Avatar 1.5 also lands in a market where Western firms such as HeyGen have been moving quickly. HeyGen announced Avatar V in April 2026 and has been pushing identity-preserving avatar generation as a premium enterprise and creator product. Meituan’s project page compares LongCat-Video-Avatar 1.5 with HeyGen, Kling Avatar 2.0, and OmniHuman-1.5, focusing on stability, consistency, and lip motion. Those comparisons should be treated as vendor claims, but the direction is clear. The open side of the market is no longer satisfied with being visibly behind.

The Startup Problem

For AI video startups, the uncomfortable question is whether avatar generation can remain a high-margin product when capable models keep moving into open or semi-open distribution. If the basic ability to animate a presenter from audio and an image becomes broadly available, the value shifts away from the model itself and toward workflow, trust, compliance, distribution, editing, and enterprise controls.

That does not kill the startup opportunity. It changes it. A business user usually does not want a model checkpoint. They want a secure workspace, brand controls, team approvals, voice management, translation, consent records, and output that will not create legal trouble. Startups that package those layers well can still win. Startups that only charge for access to a talking-head generator will face a harder market.

The MIT license on the Hugging Face release makes the pressure more direct for developers, though the usual practical barriers remain. Running this kind of model still takes technical skill and meaningful compute. LongCat’s own instructions reference CUDA, PyTorch, FlashAttention, INT8 options, and separate workflows for single-person and multi-person generation. That is not the same thing as a browser tool anyone can use in five minutes.

Still, open releases have a habit of moving from difficult to routine. ComfyUI support, hosted demos, wrappers, and smaller optimized versions tend to follow fast when developers are interested. Once that happens, agencies and software builders can assemble products on top of the model rather than paying for every second of generated avatar video from a closed vendor.

Regulation Is Coming Alongside Adoption

The timing also matters because synthetic media is entering a more regulated phase. China already requires labeling for AI-generated synthetic content distributed online, while the European Union is moving toward transparency obligations for AI-generated content under the AI Act. For avatar video, this is not a side issue. A realistic digital spokesperson can be useful, but it can also mislead people quickly if provenance, consent, and labeling are weak.

That may favor companies that build compliance into the product from the start. Enterprise buyers will want to know whether a generated video includes watermarks, metadata, identity permissions, and audit trails. Consumers may not inspect those details, but regulators and brand lawyers will. A model can be impressive and still be unusable for serious businesses if the surrounding governance is thin.

LongCat-Video-Avatar 1.5 is therefore less about one model beating another than about the next stage of the avatar market. The technology is becoming cheaper, more available, and more global. The winners will be the companies that turn that capability into trusted production systems. Watch how fast developers build around Meituan’s release, and watch how quickly commercial avatar platforms respond on price, quality, and compliance. That will tell us whether talking-head video is still a premium software category, or whether it is becoming another feature inside the broader content stack.

Also read: Ordinary WiFi is becoming a new biometric risk for startups • Why every website needs an llms.txt file before AI rewrites the web • China’s AI start-up funding surge shows capital is chasing scale