Wan

◯ Open source

Alibaba Cloud's open-source AI suite for generating and editing images and videos with precise control over text, color, and characters.

Paidvideo#image-generation #video-generation #open-source #api

Visit Wan

Contains affiliate link

8.0

our score

Quick verdict

Wan 2.7 delivers powerful open-source image and video generation with deep multi-image control, though pricing transparency is lacking.

At a glance

Best for: Creative studios and developers needing controllable AI video and image generation
Not for: Buyers requiring transparent self-serve pricing before signup
Standout feature: Multi-image reference with up to 9-image fusion
Pricing range: Not disclosed in source
Free tier: Yes
Primary use case: Controlled video and image generation with references

What is Wan?

Wan (万相) is Alibaba Cloud’s flagship generative AI platform for images and video, anchored by the Wan 2.7 model suite. Positioned at the intersection of creative production and enterprise AI infrastructure, it offers text-to-image, image-to-image, text-to-video, and video-to-video capabilities alongside advanced editing tools. The platform targets professional creators, marketing agencies, e-commerce operators, and developers who need fine-grained control over visual outputs rather than simple one-shot generation. Unusually for a major cloud vendor, Alibaba Cloud has open-sourced the Wan models, hosting weights on GitHub while simultaneously offering a managed API and web interface through its Alibaba Cloud console. This dual delivery model means teams can either self-host for maximum data sovereignty and customization or consume the service via API for scalability. The marketing emphasis is on controllability—precise color proportions, multi-image reference fusion, pixel-level editing, and temporal continuity in video—suggesting a product philosophy that treats AI as a production tool rather than a novelty. With support for twelve languages, chart rendering, and complex video narrative extension, Wan is clearly designed for global commercial use cases ranging from advertising campaign development to interactive media production. Its integration into the broader Alibaba Cloud ecosystem also implies access to enterprise-grade security, storage, and billing infrastructure, though this comes with the expectation that users will navigate Alibaba’s account and console framework. For organizations already embedded in Alibaba Cloud services, Wan presents a natural extension of their existing stack, while the open-source release lowers the barrier for academic researchers and independent developers who want to experiment with or fine-tune the underlying architecture.

How it works

Users access Wan through two primary channels: a browser-based creative console labeled “立即体验” (Try Now) and a developer-facing API platform. In the web UI, creators select a modality—such as image generation, video editing, or portrait customization—and provide prompts alongside optional reference assets. For image tasks, users can upload up to nine reference images to guide composition, style, and subject consistency, or use the interactive box-selection tool to restrict edits to specific regions. For video, the system accepts multi-dimensional instruction prompts that modify plot, environment, and camera work, while the temporal extension feature lets users anchor generation to a starting or ending frame—or both—to ensure narrative continuity. The “creative video replication” mode analyzes a reference video’s motion dynamics, effects, and camera movements and applies them to new content. Developers consuming the API integrate these same capabilities programmatically, likely through standard REST endpoints authenticated via Alibaba Cloud credentials, though the scraped content did not specify protocol details. Outputs appear to be managed within the Alibaba Cloud ecosystem, implying that storage, billing, and quota management are handled through the vendor’s existing cloud console. The workflow is inherently asset-heavy: instead of relying solely on text prompts, Wan expects users to supply reference images, frame anchors, or source videos to steer the model. This design rewards users who maintain organized media libraries and understand visual storytelling structure, but it may feel complex to novices accustomed to single-prompt generation. Once assets are uploaded, the model processes them through the selected pipeline—image fusion, video continuation, or portrait customization—and returns results that can be iteratively refined using the same control tools.

Key features

01Multi-dimensional video editing with multi-image reference

Edit existing video through natural-language instructions that target plot, environment, and visual style simultaneously. The system supports multi-image reference for precise control over subjects and scenes, enabling creators to reshoot concepts without a physical camera. For example, a director can change a scene from daytime to rainy night, add a background character, and adjust the color grade using only text prompts and reference stills. This matters for post-production teams that need to alter backgrounds or insert characters while preserving motion integrity, dramatically reducing the cost of reshoots and visual effects compositing.

02Creative video replication with motion and camera transfer

Upload a reference video and transfer its dynamic characteristics—including action choreography, special effects, and camera movements—to new footage. This allows creators to clone the cinematic grammar of a source clip onto different subjects or settings, dramatically speeding up stylistic iteration for advertising and short-form content. For instance, a brand could replicate the sweeping camera motion and glow effects of a flagship product video across an entire SKU catalog without re-filming each item. The result is a unified visual language across multiple assets that would normally require expensive motion-capture or manual keyframe work.

03Temporal extension with first and last frame anchoring

Control video narrative flow by fixing a first frame, last frame, or both, then generating the intervening or subsequent footage. The continuation modes include simple extension, extension plus tail frame, and open-ended generation, giving editors timeline precision that reduces the need for manual stitching. A filmmaker can provide an opening establishing shot and a closing reaction shot, then let Wan generate the middle sequence to bridge the two moments with coherent motion and lighting. This capability turns the model into a generative in-betweening tool, useful for animatics, storyboard previsualization, and social-media clip expansion.

04Ultra-strong text rendering in 12 languages and charts

Generate images containing readable text in up to twelve languages, as well as charts, mathematical formulas, and infographics. Unlike many diffusion models that garble typography, Wan targets information-design use cases such as presentation visuals, educational content, and product infographics. A marketing team could generate a slide containing bilingual headers, a bar chart, and body copy in a single pass, then refine the layout using the box-selection editor instead of rewriting the prompt dozens of times. This stability makes the tool practical for business communications where accuracy matters more than artistic abstraction.

05Precise color proportion control for brand consistency

Specify exact color distributions before generation rather than relying on prompt luck. This “no color blind box” approach lets brand teams lock Pantone-level consistency across campaign assets, ensuring that generated images match corporate identity requirements without extensive post-processing. For example, a retailer launching a seasonal sale can force 60 percent crimson, 30 percent white, and 10 percent gold across every generated banner, eliminating the random hue shifts common in standard diffusion samplers. The result is a predictable palette that aligns with strict brand guidelines.

06Interactive box-selection editing for pixel-level changes

Draw a bounding box over any region of an image and apply edits only within that area. This pixel-level alignment tool lets users add objects, change textures, or modify facial features without regenerating the entire canvas, saving time and preserving background detail. A portrait photographer could refine a subject’s eye color or add earrings inside a masked region while keeping the original lighting, hair, and clothing untouched, maintaining production continuity across a large set of assets. It effectively brings Photoshop-style masking into a generative workflow without requiring external compositing software.

Pricing breakdown

Web Experience

Not specified

Creators exploring Wan 2.7 capabilities before committing to integration.

Generation quotas not disclosed in source
Requires Alibaba Cloud account login
Feature set may be subset of API

API

Popular

Not specified

Developers building image and video generation into production applications.

Per-request pricing not disclosed
Rate limits not visible in source
Requires Alibaba Cloud API credentials

Reality check: The scraped homepage did not contain specific per-image, per-video, or token-based pricing, nor did it list subscription tiers, overage penalties, or free-credit allowances. Buyers should verify current rates on the Alibaba Cloud API platform before budgeting.

Pros & cons

What works

+Supports up to 9-image fusion for complex compositional reference control
+First/last frame anchoring gives precise temporal continuity in video generation
+Accurately renders text, charts, formulas, and 12 languages inside images
+Interactive box-selection editing enables pixel-level regional modifications
+Open-source model weights available alongside managed cloud API access

What doesn't

−Pricing tables and API rate limits absent from public marketing page
−No disclosed maximum video duration, resolution, or frame-rate specifications
−Commercial licensing terms and usage rights not visible in scraped content
−Full feature access appears gated behind Alibaba Cloud account ecosystem

Best use cases

Creative studios and ad agencies

Perfect fit

Multi-image reference, precise color control, and video replication align directly with campaign production pipelines that demand brand consistency.

Comic and storyboard artists

Perfect fit

The 12-image consecutive group generation mode supports visual narrative sequencing from a single aesthetic foundation.

E-commerce product designers

Good fit

Strong text rendering and color accuracy help generate product infographics and branded assets, though SKU integration depends on API workflow.

Application developers

Good fit

Open-source weights and API access provide flexible deployment options, but pricing transparency is needed for scalable architecture planning.

Casual hobbyists

Mixed fit

The feature depth is powerful, but the Alibaba Cloud account requirement and lack of visible free-tier details may create friction for occasional users.

Who should skip Wan

Honest no-go cases — save your trial period.

→Buyers who need transparent self-serve pricing without contacting sales
→Users requiring public documentation on maximum video length and resolution before signup
→Teams wanting simple one-click generation without uploading reference images or frames
→Projects needing clearly stated commercial usage rights on the marketing page

Alternatives to consider

Alternative

Pick it when

Skip it when

Runway Gen-3 Alpha
Pick when you need a polished English-first video editing UI with known subscription tiers and community tutorials.
Skip when you require open-source model weights, multi-image reference control, or Alibaba Cloud ecosystem integration.
Kling AI
Pick when you want Chinese-market-optimized video generation with strong human motion and physical simulation.
Skip when you need precise color proportion control, 12-language text rendering, or open-source deployment.
Stable Diffusion / Stable Video Diffusion
Pick when you need fully self-hosted, community-driven open-source image and video generation with no vendor lock-in.
Skip when you want integrated managed APIs, interactive box-editing tools, and video replication in one platform.
Midjourney
Pick when you prioritize high-fidelity artistic static images and an active Discord-based creative community.
Skip when you need video generation, precise regional editing, text/chart rendering, or API-first integration.

vs Wan

Frequently asked questions

Is Wan open source?

Yes. Alibaba Cloud open-sourced the Wan models; weights are available on GitHub, while the company also offers a managed API and web interface.

How many reference images can I use?

Wan supports up to nine images for creative fusion and reference, plus up to twelve consecutive images for continuous group storytelling.

Can Wan generate readable text and charts inside images?

Yes. The model supports ultra-long text, twelve languages, charts, mathematical formulas, and infographics with stable output.

Does Wan support video continuation?

Yes. It offers first-frame anchoring, last-frame anchoring, open-ended continuation, and continuation-plus-tail-frame modes.

Can I edit only a specific part of an image?

Yes. Interactive box-selection editing lets you draw a boundary and apply changes only inside that region for pixel-level precision.

What does the API cost?

Specific pricing was not present in the scraped marketing page. Contact Alibaba Cloud or visit the API platform for current rates and quotas.

The bottom line

Wan 2.7 is an easy recommendation for professional creative teams, developers, and e-commerce operators who need controllable, reference-driven image and video generation. Its standout combination of multi-image fusion, pixel-level box editing, temporal frame anchoring, and robust text rendering places it among the most feature-complete generative media suites currently available from a major cloud provider. The open-source release is a significant differentiator, giving enterprises the option to self-host sensitive workloads while still benefiting from Alibaba Cloud’s managed infrastructure.

However, the lack of visible pricing, rate limits, and output specifications on the public marketing page creates procurement friction. Buyers evaluating Wan against tools like Runway or Kling will need to contact Alibaba Cloud sales or navigate the API console to understand true costs. Additionally, teams outside the Alibaba Cloud ecosystem should factor in integration overhead.

You should adopt Wan if your workflow demands precise visual control—such as brand-locked color palettes, multi-character consistency, or video narrative extension—and you are comfortable within Alibaba Cloud’s infrastructure. You should skip it if you need immediate pricing transparency, simple no-account access, or if your stack requires tight integration with non-Alibaba environments without self-hosting the open-source weights. My mind would change if the public pages added clear per-unit pricing, disclosed maximum video durations and resolutions, and clarified commercial licensing terms without requiring a sales conversation.

Try Wan

Wan