RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

What is RoadSocial

RoadSocial is a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. It differentiates itself from existing datasets by capturing the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. RoadSocial highlights:

14M frames, 414K social comments
13.2K videos (7.9K minutes)
674 unique video tags (total 100K+)
260K high-quality socially-informed QA pairs
Scalable QA generation pipeline using social video narratives
12 challenging video QA tasks for generic road event understanding
New tasks to test robustness of Video LLMs to hallucination
Improves generic road event understanding capability of Video LLMs
Critical insights onto zero-shot capabilities of 18 Video LLMs

Dataset Examples

Dataset Statistics

VideoQA Leaderboard

GPT-3.5 score (scale: 0-100) is reported for all tasks except Temporal Grounding (TG). Overall scores are reported for ALL QA tasks, Road-event Tasks (RT), Generic QAs, and Specific QAs.

Abbreviations: F, C, I, H, WR, KE, VP, DS, WY, CQ, AD, IN, CF, AV, IC.

#	Model	Params	Factual			Complex				Imaginative			Hallucination		Overall	Overall	Overall	Overall
#	Model	Params	WR	KE	VP	DS	WY	CQ	TG	AD	IN	CF	AV	IC	(ALL)	(RT)	(Generic)	(Specific)
1	Dolphin	9B	61.3	34.5	67.8	35.8	25.2	37.2	0.01	49.8	39.1	45.5	71.8	21.3	40.8	42.5	29.8	46.5
2	GPT-4o	-	77.0	66.6	84.3	70.2	70.8	72.1	7.8	77.7	76.4	77.0	90.0	67.6	69.8	70.0	69.5	74.4
3	Gemini-1.5-Pro	-	77.7	56.7	85.4	61.9	61.4	60.1	18.6	72.1	70.2	75.7	72.3	48.7	63.4	64.7	60.1	68.3
4	InternVL2	76B	72.4	51.3	81.4	57.1	59.0	62.1	1.07	70.5	67.0	69.2	58.6	27.6	56.4	59.1	55.5	65.1
5	Qwen2-VL	72B	76.6	56.6	85.1	60.2	64.0	67.6	0.01	71.9	72.4	71.6	37.0	40.2	58.6	60.3	58.3	68.8
6	LLaVA-Video	72B	75.8	52.4	76.8	52.4	55.0	52.2	9.94	68.3	63.7	64.9	83.5	24.7	56.7	59.6	51.1	63.3
7	LLaVA-OV	72B	75.1	54.1	78.7	53.0	53.3	54.1	3.99	67.8	61.9	63.1	45.1	19.9	52.5	55.5	51.8	63.0
8	VITA	8x7B	66.6	52.1	71.6	48.1	55.6	56.3	2.27	66.7	66.0	62.4	56.3	22.0	52.2	54.9	49.8	60.4
9	Tarsier	34B	73.7	58.1	78.2	58.2	59.0	58.8	0.32	71.6	71.1	67.4	83.2	82.3	63.5	61.8	58.4	66.1
10	ARIA	25.3B	75.4	53.1	86.2	58.4	56.9	70.2	8.96	75.1	74.7	74.0	86.4	29.2	62.4	65.4	56.7	68.5
11	InternVL2	8B	67.7	51.7	78.0	55.7	59.3	60.9	0.77	66.7	66.8	70.0	68.1	26.1	56.0	58.7	53.7	64.0
12	Mini-CPM-V 2.6	8B	77.7	57.6	80.6	55.0	50.5	57.5	0.4	61.6	52.3	59.3	73.5	30.0	54.7	56.9	51.0	62.0
13	IXC-2.5	7B	78.5	58.7	85.4	61.7	65.3	68.5	0.69	73.9	75.6	75.7	85.8	29.2	63.3	66.4	60.7	70.3
14	Tarsier	7B	69.9	54.7	72.3	52.0	53.4	55.2	0.11	69.5	69.3	63.5	79.1	67.3	58.9	58.1	54.0	61.7
15	LongVU	7B	73.0	53.0	76.3	51.1	50.2	55.0	0.84	59.7	55.8	58.2	48.9	32.7	51.2	52.9	47.7	59.7
16	Qwen2-VL	7B	75.5	52.8	76.1	52.7	57.7	56.4	0.59	69.2	71.6	65.9	37.5	39.6	54.6	56.0	52.6	63.9
17	LLaVA-Video	7B	74.6	50.1	76.7	52.1	50.1	50.3	1.43	60.4	53.8	58.7	61.8	23.5	51.1	53.6	47.6	59.7
18	LLaVA-OV	7B	73.4	51.2	77.2	50.7	51.7	51.2	0.97	62.8	55.4	58.6	45.4	21.1	50.0	52.6	48.4	59.8
19	LLaVA-OV ft.	7B	80.9	64.1	85.7	64.1	68.7	65.1	4.49	74.2	70.9	71.7	95.4	87.6	69.4	67.8	65.1	69.7

Submit your results: This email address is being protected from spambots. You need JavaScript enabled to view it.

Video Presentation

References: InternVL2 [4]; MM-AU [5]; VITA [6]; BDD-X [8]; LLaVA-OV [10]; ARIA [11]; Dolphin [13]; DRAMA [15]; LingoQA [16]; GPT-4o [18]; Rank2Tell [24]; LongVU [25]; DriveLM [26]; ROAD [27]; Gemini-1.5-Pro [28]; Tarsier [30]; Qwen2-VL [31]; SUTD-TrafficQA [32]; BDD-OIA [33]; Mini-CPM-V [35]; IXC-2.5 [36]; LLaVA-Video [37]

Citation

@misc{parikh2025roadsocialdiversevideoqadataset, 

     author = {Chirag Parikh and Deepti Rawat and Rakshitha R. T. and Tathagata Ghosh and Ravi Kiran Sarvadevabhatla}, 

     year = {2025}, 

     eprint = {2503.21459}, 

     archivePrefix = {arXiv}, 

     primaryClass = {cs.CV}, 

    url = {https://arxiv.org/abs/2503.21459},

RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

Chirag Parikh, Deepti Rawat,Rakshitha R. T., Tathagata Ghosh, Ravi Kiran Sarvadevabhatla

iHub-Data|CVIT|IIIT Hyderabad

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

[Paper] [arxiv] [Dataset] [Code] [Checkpoints]