Analysis: Effects of CTN Caption Structure in AI Video Generation

Experimental Setup

Kapwing Platform

A video creation platform that maintains strict adherence to input text structure.

Platform Link →

Examples and Analysis

Example 1 - Car Incident

Concatenated:

"a car drove recklessly through an open field flipping over the car was severely damaged and a group of guys started playing beer pong"

LLM Fused:

"A car flipped while driving recklessly through a field, then some guys started playing beer pong."

Concatenated Version

LLM Fused Version

Example 2 - Game Fatality

Concatenated:

"a player performs a fatality move in mortal kombat another character is killed in the game"

LLM Fused:

"A player performs a fatality move in Mortal Kombat, killing another character."

Concatenated Version

LLM Fused Version

Example 3 - Paper Airplane

Concatenated:

"a man is folding a piece of paper a paper airplane is being created"

LLM Fused:

"A man folds a piece of paper into a paper airplane."

Concatenated Version

LLM Fused Version

Example 4 - Soccer Goal

Concatenated:

"a soccer player kicked the ball with precision the ball successfully went into the goal"

LLM Fused:

"A soccer player kicks the ball precisely into the goal."

Concatenated Version

LLM Fused Version

Example 5 - Stage Performance

Concatenated:

"a boy decided to perform on stage the audience watched and listened to his singing"

LLM Fused:

"A boy performs on stage, singing to the audience."

Concatenated Version

LLM Fused Version

Key Findings

Aspect Concatenated Approach LLM Fusion Approach
Temporal Ordering Explicit and clear Less distinct transitions
Causal Relationship Strongly preserved Partially weakened
Video Generation More accurate scene transitions Merged scenes with less distinction
Narrative Structure Clear separation of events Smoother but less structured

Impact on AI Video Generation

The concatenated approach consistently leads to more accurate representation of temporal sequence and causal relationships across all examples. This further validates our choice of maintaining explicit temporal-causal structure in CTN captions.