Overview

Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions, evaluated automatically to ensure caption quality and relevance; and (2) a dedicated Cause-Effect Network (CEN) architecture with separate encoders for capturing cause and effect dynamics independently, enabling effective learning and generation of captions with causal-temporal narrative. Extensive experiments demonstrate that CEN is more accurate in articulating the causal and temporal aspects of video content than the second best model (GIT): 17.88 and 17.44 CIDEr on the MSVD and MSR-VTT datasets, respectively. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning.

Try the Prompt

You can also try the example prompt used for CTN captions generation by visiting this link.

You are an advanced language model tasked with generating causal-temporal narrative captions for a video. However, you cannot directly access the video itself. Instead, you will be provided with a series of captions that outline the key events and scenes in the video. Your task is to generate a concise Cause and Effect scenario, based on the information provided in the descriptive captions. Be careful, your generated Cause and Effect statements should fulfill the following requirements: 
1. Your narrative should be grounded in the information provided by the descriptive captions. 
2. Cause and Effect scenario is relevant. 
3. It should not introduce any new events or details not mentioned. 
4. Avoid implying conclusions. 
5. Maintain temporal consistency with the provided captions. 
6. Use plain English and direct sentences. 
7. Cause and Effect statements each limited to a maximum of 15 words. 
8. Do not include any additional text before or after the JSON object. 

Here are the examples of Cause and Effect: 
[Examples]: 
[{'Cause': 'the student overslept due to a malfunctioning alarm clock', 'Effect': 'missed catching the bus to school'}, {'Cause': 'she absentmindedly skipped applying moisturizer after taking a long hot shower', 'Effect': 'her skin became dry and flaky'}, {'Cause': 'he carelessly neglected taking his prescribed allergy medication', 'Effect': 'suffered a severe sneezing fit'}, {'Cause': 'the exhausted soccer player recklessly fouled an opponent in the penalty area', 'Effect': 'the opposing team was awarded a crucial penalty kick'}, {'Cause': 'due to unforeseen road closures they found themselves stuck in heavy traffic', 'Effect': 'missed out on experiencing the opening act of the concert'}] 
Now please generate only one Cause and Effect presented in a JSON format based on the following descriptive captions. 
[Descriptive Captions]:
1. 'a car crashes and guys play beer pong'
2. 'a car driving through an open field kicking up dirt'
3. 'a car flipping over'
4. 'a car get wracked'
5. 'a car is being flipped over'
6. 'a dirt vehicle riding and rolling'
7. 'a dune buggy flipping over'
8. 'a four wheeler wrecking'
9. 'a monster truck flips on its side then several young men shout while playing beer pong'
10. 'a person drives an offroad car around a field'
11. 'a person flipping a go kart while a crowd cheers'
12. 'a race truck is crashing'
13. 'a truck rolls over itself and boys cheer on a friend'
14. 'a truck tumbles over on itself'
15. 'a tumbler crashes on a dirt road and then a group of guys play beer pong'
16. 'a type of monster truck crashes and men are shown celebrating'
17. 'a vehicle flips over'
18. 'an off road vehicle crashing'
19. 'crashing of a car while driving'
20. 'footage from a monster truck style event followed by a frat party'
[Causal Temporal Narrative]:

Download Benchmarks

Our work introduces a novel benchmark for video captioning called Causal-Temporal Narrative (CTN) captions, generated using the LLM for popular datasets MSRVTT and MSVD.

The MSRVTT-CTN and MSVD-CTN benchmark datasets are licensed under the Creative Commons Attribution Non Commercial No Derivatives 4.0 International (CC BY-NC-ND 4.0) license.

Motivation

Comparison of original vs. Causal-Temporal Narrative (CTN) ground truth captions to illustrate the inclusion of causal-temporal narrative.
Comparison of original vs. Causal-Temporal Narrative (CTN) ground truth captions to illustrate the inclusion of causal-temporal narrative.

CTN captions Benchmark Generation

CTN caption generation pipeline. θ indicates a threshold.
CTN caption generation pipeline. θ indicates a threshold.

Comparison of LLMs for CTN Caption Generation

Automatic Evaluation for CTN Caption Generation

Labeling Unlabeled Videos with CTN Captions Generation

CTN caption generation application.
Our CTN caption generation approach can be effectively applied to label unlabeled videos. By extracting frames from an unlabeled video and generating image captions using a state-of-the-art model (e.g., GIT), we can create input for our LLM-based CTN caption generation pipeline. The LLM generates a coherent and contextually relevant CTN caption that captures the cause-effect relationships and temporal dynamics in the video, enabling accurate and informative labeling of unlabeled video content. This application demonstrates the versatility and effectiveness of our approach in streamlining the process of labeling large-scale video datasets and facilitating video understanding and retrieval tasks.

CEN Architecture

CEN Architecture.
The two-stage Cause-Effect Network (CEN) architecture. Stage 1: Separate Cause (Ecause) and Effect (Eeffect) video encoders, pretrained using CLIP-ViT, learn specialized video representations. Corresponding text encoders (Tcause and Teffect) encode the cause and effect portions of the CTN caption. Contrastive losses are applied to align the video and text embeddings. Stage 2: The learned cause and effect video features are encoded separately (Enccause and Enceffect) and concatenated before being input to the decoder, which generates the final CTN caption.

Quantitative Results

Method MSVD MSRVTT
ROUGE-L CIDEr SPICE ROUGE-L CIDEr SPICE
SEM-POS 25.39 37.16 14.46 20.11 26.01 12.09
AKGNN 25.11 35.08 14.55 21.42 25.90 11.99
GIT 27.51 45.63 15.58 24.51 32.43 13.70
CEN (Ours) 31.46 63.51 19.25 27.90 49.87 15.76
Comparison of our CEN architecture against SOTA methods on the MSVD and MSR-VTT datasets. The best results in each category are in bold. R-L, C, and S denote ROUGE-L, CIDEr, and SPICE scores, respectively.

Qualitative Results

Qualitative examples across scenarios like video games, paper folding, soccer, and singing. CEN (Ours) captions accurately capture causal narratives and temporal sequences from ground truth, outperforming SOTA video captioning methods.
Qualitative examples across scenarios like video games, paper folding, soccer, and singing. CEN (Ours) captions accurately capture causal narratives and temporal sequences from ground truth, outperforming SOTA video captioning methods.

Model Weights

Download the model weights from here. We will release the code soon.