Keling VS Jimeng: A Preliminary Exploration of 'Multimodality'

11/04 2025 437

Currently, the two hottest AI-generated video platforms in China are undoubtedly Keling and Jimeng.

As a film and television outsider and AI enthusiast, I decided to form a pure AI 'film and television team' to see how it would perform.

Before starting, there was one more question: text-to-image + image-to-video or text-to-video?

Both platforms offer these functions, so which path should be taken?

AI's answer: Using the 'text-to-image + image-to-video' method offers higher controllability, while the 'text-to-video' method makes the video more 'dynamic.'

Considering cost and efficiency, I chose to prioritize controllability.

01 Step 1: AI Screenwriter, Writing the Script

To make a movie, you first need a screenwriter to write the script.

I took a previously published article from my public account and sent it as a PDF to Gemini 2.5 Pro, widely recognized for its powerful performance.

I have to admit that, in terms of writing, AI is more than capable of being a screenwriter.

The storyboard script was well-written, something I, as a layman, could never have done.

Especially the image-to-video instructions, which covered professional aspects like scene, action, camera movement, and style, solved a major headache.

02 Step 2: AI Artist, Drawing the 'Storyboard'

As mentioned earlier, using 'text-to-image + image-to-video' enhances controllability.

Now that the script is ready, the next task is to draw the 'storyboard.'

I handed over the first-frame image instructions written by the AI screenwriter to Tencent Hunyuan, an AI text-to-image model.

Compared to text, AI's ability in the image domain is noticeably weaker.

Fortunately, Hunyuan's artistic skills were reliable, and most of the images it produced were of good quality, meeting expectations.

Of course, there were also some 'flawed' results:

For example, 'A horizontal glowing progress bar at the bottom of the screen, with the slider at the starting position. The background is blurred colorful light.'

AI clearly failed to understand the instructions.

And, 'A highly precise automotive production line with countless robotic arms working in sync, sparks flying during welding, full of industrial beauty.'

This time, it was a logical error—a group of robotic arms seemed to be destroying a finished car.

03 Step 3: AI Director, Making the Images 'Move'

I sent the static images and image-to-video instructions generated earlier to the directors of this shoot—Keling and Jimeng.

However, soon after 'filming' began, I encountered a clash between ideal and reality.

Initially, I was quite satisfied with the few 5-second videos generated by AI.

After all, the dynamic effects and lighting changes were impressive to a layman, and the footage was very smooth.

But as more videos were generated, the directors began to reveal their flaws, each producing some strange results.

Problem 1: 'The Director Doesn't Follow the Script'

This was the most common issue—intolerable 'disobedience.'

Let's look at a laughable example:

Image-to-video instruction:

Scene description and action: The car's headlights activate, starting as a thin line and then suddenly brightening, emitting a sharp beam. A faint energy glow flows along the car's aerodynamic lines.

Camera movement: Slow and dramatic tilt-up shot, starting from the front wheel and moving up to the windshield, making the car feel powerful.

Visual style and texture: 'Hero close-up.' Cinematic, refined, high-end. Add a slight lens flare effect.

The instructions clearly stated that the car's headlights activate, but in Keling's generated video, a beam shot out from the middle of the car, which was somewhat puzzling.

In comparison, Jimeng's video effect was slightly better.

Problem 2: Physical and Logical Errors That 'Would Make Newton Silent'

AI is adept at solving physics problems, but when it comes to generating videos, it seems to haven't fully mastered the physical rules of the real world.

'Object penetration' was a common issue, with both Jimeng and Keling generating videos that had this problem, such as:

Image-to-video instruction:

Scene description and action: All machines work with astonishing speed and perfect coordination, demonstrating extreme efficiency. Robotic arms grab packages, while unmanned vehicles smoothly avoid and navigate.

Camera movement: A long, smooth tracking shot (long take) inside the warehouse, showing the entire process in one continuous shot.

Visual style and texture: Industrial aesthetics, technological, orderly. Clean footage, smooth movements.

Additionally, the realization of physical motion seemed very unreasonable:

Image-to-video instruction:

Scene description and action: A shiny golden stone is thrown into the water, creating large, vibrant, colorful ripples that spread rapidly, instantly illuminating the entire water surface.

Camera movement: Top-down view, slow zoom-in.

Visual style and texture: Poetic, joyful. Use the burst of ripples to symbolize the instant release of dopamine.

In Jimeng's generated video, the golden stone didn't get thrown in but rather emerged directly from the water:

Keling, on the other hand, completed the instruction better:

Problem 3: Short-Term Amnesia

AI-generated videos have a major flaw—their consistency is extremely poor.

Within just 5 seconds, the protagonist of a shot could undergo significant changes. For example:

Scene description and action: The minute hand on the dial rotates smoothly at high speed. As soon as it stops, a soft, glowing pulse animation appears on the watch screen.

Camera movement: Static close-up.

Visual style and texture: Modern, simple, efficient. The pulse animation is crisp, representing a 'delivery' reminder.

In Jimeng's generated video, let's not discuss how well the 'high speed' and 'pulse animation' were realized—the dial itself changed completely:

The same issue didn't occur in Keling's generated video:

If consistency can't be guaranteed within such a short time, the overall viewing experience of the video will certainly suffer.

04 Usage Experience

For AI platforms generating videos from images, they are indeed powerful tools but not yet qualified directors.

First, let's talk about Keling. Its performance was relatively better.

It did a decent job simulating the physical world and creating dynamic realism, generally conforming to real-world physics.

Secondly, Keling had a deeper understanding of the concepts in the instructions and was more capable of artistic interpretation. It could grasp not just the literal meaning but also the abstract concepts and emotions behind the text to a certain extent.

Moreover, Keling didn't seem to be a rigid machine that only did what it was told. Some scenes in its generated videos weren't explicitly described in the text but still reflected the theme to varying degrees.

In this sense, it leaned more toward being an 'artist' willing to push boundaries and experiment.

Additionally, according to feedback from 'film critic' Gemini 2.5 Pro, Keling demonstrated strong mastery of camera language, successfully executing complex camera movements like 'push-pull zoom' and 'tilt up.'

For Keling's generation of abstract CG scenes, the film critic believed it had reached professional standards in terms of technical quality and aesthetics.

However, while this 'artist' let its imagination run wild, it also brought some issues:

Lower image fidelity and frequent scene reconstruction.

Selective execution of user instructions and off-track creativity.

These are the inevitable costs of Keling's 'director philosophy,' as the generated videos may differ significantly from the envisioned scenes.

Now, let's talk about Jimeng. Compared to an artist, it was more conservative.

Jimeng's advantage in video generation lay in its extremely high image fidelity and stability.

The main subject of each shot rarely underwent significant distortion or deformation, and the footage was relatively stable.

This meant that the quality of videos generated by Jimeng's 'image-to-video' function largely depended on the quality of the images.

Furthermore, Jimeng could more accurately execute instructions for complex compositions, showing greater reliability in understanding and following instructions.

However, Jimeng's drawback was its difficulty in achieving physical realism and its lack of dynamic logic.

Many of the bizarre scenes mentioned earlier were mostly from Jimeng. This reflected its insufficient depth of concept understanding and lack of narrative ability.

Additionally, the 'film critic' pointed out that Jimeng's understanding and execution of camera language were relatively weak, almost unable to perform more complex film camera movements, reducing the expressiveness of the videos.

05 Final Thoughts

The videos generated by both models have proven that for general platform users, high-difficulty instructions often lead to failed results, and the technological boundaries have not yet been broken.

From a technological perspective:

In the field of AI-generated video, there is still a trade-off between the two core technological routes of 'fidelity' and 'creativity,' which cannot be take into account both (balanced).

Additionally, video duration is currently a major limitation.

Most AI video generation platforms, both domestic and foreign, strictly control the duration of individual videos to within 5-10 seconds.

The content that a single video can express is limited, and generating long, coherent videos remains a significant challenge in this field.

For users, this increases the difficulty of writing prompts.

If written too detailedly, the model may not understand and cannot fully express the content within a few seconds;

If written too vaguely, the content generated by the model often deviates significantly from the user's intent.

From a cost perspective:

Local deployment using powerful devices and computing power, followed by fine-tuning the model, may be the only reliable way to generate high-quality videos.

However, this cost is not something the average user can afford.

Even for these two online application platforms, the membership prices are not cheap.

If purchasing credits individually, the most basic configuration on Jimeng—using the Video 3.0 model + 720P + 5-second video—costs 1 yuan per video;

On Keling, using the standard mode + 5-second video costs 2 yuan per video.

But based on my experience, to generate videos that reach the level of average short videos, configuration upgrades and multiple generations with constant debugging are definitely necessary.

And during this process, costs will inevitably rise.

Therefore, directors, cameramen, and post-production teachers can breathe a sigh of relief.

To let AI generate visually appealing films, we might as well exercise a bit more patience.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.