Back to main page
Supplementary CTN Caption Structure · AI Video Generation

Effects of CTN Caption Structure in AI Video Generation

A side-by-side comparison of concatenated cause-effect prompts and LLM-fused phrasings, and how each shapes the temporal and causal fidelity of generated video.

§ I · Experimental setup

Experimental setup.

Platform
Kapwing

A video creation platform that maintains strict adherence to input text structure.

Visit platform
§ II · Side-by-side examples

Examples and analysis.

01Car Incident

Concatenated

a car drove recklessly through an open field flipping over the car was severely damaged and a group of guys started playing beer pong

LLM Fused

A car flipped while driving recklessly through a field, then some guys started playing beer pong.

Concatenated VersionCTN
LLM Fused VersionFused

02Game Fatality

Concatenated

a player performs a fatality move in mortal kombat another character is killed in the game

LLM Fused

A player performs a fatality move in Mortal Kombat, killing another character.

Concatenated VersionCTN
LLM Fused VersionFused

03Paper Airplane

Concatenated

a man is folding a piece of paper a paper airplane is being created

LLM Fused

A man folds a piece of paper into a paper airplane.

Concatenated VersionCTN
LLM Fused VersionFused

04Soccer Goal

Concatenated

a soccer player kicked the ball with precision the ball successfully went into the goal

LLM Fused

A soccer player kicks the ball precisely into the goal.

Concatenated VersionCTN
LLM Fused VersionFused

05Stage Performance

Concatenated

a boy decided to perform on stage the audience watched and listened to his singing

LLM Fused

A boy performs on stage, singing to the audience.

Concatenated VersionCTN
LLM Fused VersionFused
§ III · Findings

Key findings.

Aspect Concatenated Approach LLM Fusion Approach
Temporal Ordering Explicit and clear Less distinct transitions
Causal Relationship Strongly preserved Partially weakened
Video Generation More accurate scene transitions Merged scenes with less distinction
Narrative Structure Clear separation of events Smoother but less structured

Impact on AI video generation.

The concatenated approach consistently leads to more accurate representation of temporal sequence and causal relationships across all examples. This further validates our choice of maintaining explicit temporal-causal structure in CTN captions.