Supplementary CTN Caption Structure · AI Video Generation

Effects of CTN Caption Structure in AI Video Generation

A side-by-side comparison of concatenated cause-effect prompts and LLM-fused phrasings, and how each shapes the temporal and causal fidelity of generated video.

§ I · Experimental setup

Experimental setup.

Platform

Kapwing

A video creation platform that maintains strict adherence to input text structure.

Visit platform

§ II · Side-by-side examples

Examples and analysis.

01Car Incident

Concatenated

a car drove recklessly through an open field flipping over the car was severely damaged and a group of guys started playing beer pong

LLM Fused

A car flipped while driving recklessly through a field, then some guys started playing beer pong.

Concatenated VersionCTN

LLM Fused VersionFused

02Game Fatality

Concatenated

a player performs a fatality move in mortal kombat another character is killed in the game

LLM Fused

A player performs a fatality move in Mortal Kombat, killing another character.

Concatenated VersionCTN

LLM Fused VersionFused

03Paper Airplane

Concatenated

a man is folding a piece of paper a paper airplane is being created

LLM Fused

A man folds a piece of paper into a paper airplane.

Concatenated VersionCTN

LLM Fused VersionFused

04Soccer Goal

Concatenated

a soccer player kicked the ball with precision the ball successfully went into the goal

LLM Fused

A soccer player kicks the ball precisely into the goal.

Concatenated VersionCTN

LLM Fused VersionFused

05Stage Performance

Concatenated

a boy decided to perform on stage the audience watched and listened to his singing

LLM Fused

A boy performs on stage, singing to the audience.

Concatenated VersionCTN

LLM Fused VersionFused

§ III · Findings

Key findings.

Aspect	Concatenated Approach	LLM Fusion Approach
Temporal Ordering	Explicit and clear	Less distinct transitions
Causal Relationship	Strongly preserved	Partially weakened
Video Generation	More accurate scene transitions	Merged scenes with less distinction
Narrative Structure	Clear separation of events	Smoother but less structured

Impact on AI video generation.

The concatenated approach consistently leads to more accurate representation of temporal sequence and causal relationships across all examples. This further validates our choice of maintaining explicit temporal-causal structure in CTN captions.