Articles
		
					
				Vol. 12 (2025)
			
		
		
			A Multimodal and Dynamically Updatable Benchmark for Aviation Question Answering with Large Language Models
		
										
												
				
											
															
									"Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China" & "School of Computer Science, Fudan University, Shanghai 200438, China"
																	
																											 
											
															
									Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China
																	
																											 
											
															
									Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China
																	
																											 
											
															
									Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China
																	
																											 
											
															
									Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China
																	
																											 
									 
			 
			
				
	
	
				Abstract
		With the rapid advancement of artificial intelligence, large-scale language models (LLMs) have demonstrated strong capabilities in open-domain question answering, knowledge retrieval, and decision support. However, in safety-critical and knowledge-intensive industries such as aviation, existing evaluation benchmarks fall short in domain adaptation, comprehensiveness, and dynamic updating. As aviation increasingly integrates intelligent automation and robotic systems for maintenance, inspection, and manufacturing, reliable language-model evaluation becomes crucial for ensuring the safety and autonomy of such systems. This paper proposes a multimodal, multi-level benchmark dataset tailored to aviation QA tasks, alongside an automated updating mechanism and a multi-dimensional evaluation framework. The methodology integrates knowledge extraction from multimodal aviation documents, diverse QA pair generation, iterative complexity enhancement, and quality validation. Furthermore, dynamic updating is achieved via a hybrid strategy combining imitation and expansion, complemented by differentiated filtering and prompt optimization. To ensure rigorous assessment, a ten-dimension evaluation framework is introduced, covering accuracy, completeness, relevance, explainability, and safety, among others. By providing a reliable and dynamically evolvable benchmark, this work supports the integration of LLMs into robotic and automated decision-support systems in aviation, enabling more intelligent, autonomous, and safety-assured operations. Experimental results using aviation textbooks confirm the effectiveness of the proposed approach in generating high-quality, dynamically evolvable QA datasets. This work provides both methodological innovation and practical tools for the evaluation of LLMs in aviation, with potential extension to other knowledge-intensive domains.
	
				
			References
		
					
									- [1] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020; 33: 1877-1901. 
- [2] Ahmed W. Artificial Intelligence in Aviation: A Review of Machine Learning and Deep Learning Applications for Enhanced Safety and Security. Premier J Artif Intell. 2025; 3: 100013. 
- Kim CY, Kim SY, Cho SH, Kim YM. Bridging the Language Gap: Domain-Specific Dataset Construction for Medical LLMs. In: Guo J, et al., editors. Generalizing from Limited Resources in the Open World. Communications in Computer and Information Science. Vol. 2160. Singapore: Springer; 2024. (IJCAI 2024). https://doi.org/10.1007/978-981-97-6125-8_11 
- [3] Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding. Int Conf Learn Represent. 2021. 
- [4] Clark P, Cowhey I, Etzioni O, et al. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457. 2018. 
- Zellers R, Holtzman A, Bisk Y, Farhadi A, Choi Y. HellaSwag: can a machine really finish your sentence? Proc 57th Annu Meet Assoc Comput Linguist. 2019: 4791-4800. https://doi.org/10.18653/v1/P19-1472 
- [5] Srivastava A, Rastogi A, Rao A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models (BIG-bench). arXiv preprint arXiv:2206.04615. 2022. 
- [6] Liu H, Jiang H, Zhang S, Zhang H, Sun M. KoLA: benchmarking large language models for higher-order cognition with Bloom’s taxonomy. arXiv preprint arXiv:2306.01874. 2023. 
- Lin TY, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. Eur Conf Comput Vis. 2014: 740-55. https://doi.org/10.1007/978-3-319-10602-1_48 
- Antol S, Agrawal A, Lu J, et al. VQA: visual question answering. Proc IEEE Int Conf Comput Vis. 2015: 2425-33. https://doi.org/10.1109/ICCV.2015.279 
- [7] Chen W, Wu H, Zeng W, Li H. TabFact: a large-scale dataset for table-based fact verification. Int Conf Learn Represent. 2020. 
- [8] Zhong V, Xiong C, Socher R. TableQA: a large-scale dataset for question answering on tabular data. arXiv preprint arXiv:2006.06434. 2020. 
- [9] Lu P, Mishra S, Xia T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering (ScienceQA). Adv Neural Inf Process Syst. 2022; 35: 2507-2521. 
- [10] Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. Proc 28th ACM Int Conf Inf Knowl Manag. 2021:2577-85. 
- [11] Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. What disease does this patient have? A large-scale open-domain question answering dataset from medical exams (MedQA). arXiv preprint arXiv:1909. 00229. 2019. 
- [12] Guha N, Danks D, Hajian S, et al. LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models. arXiv preprint arXiv: 2308.11462. 2023. 
- Chen Z, Chen W, Xu Z, Wang WY. FinQA: a dataset of numerical reasoning over financial data. Proc Conf Empir Methods Nat Lang Process. 2021: 3697-3711. https://doi.org/10.18653/v1/2021.emnlp-main.300 
- Lozano Tafur C, Camero RG, Aldana Rodríguez D, Daza Rincón JC, Rativa Saenz E. Applications of artificial intelligence in air operations: a systematic review. Results Eng. 2025; 25: 103742. Available from: https://doi.org/10.1016/j.rineng.2024.103742 
- Das A, Kottur S, Moura JMF, Lee S, Batra D, Parikh D. Embodied question answering. Proc IEEE Conf Comput Vis Pattern Recognit. 2018. https://doi.org/10.1109/CVPR.2018.00008 
- Sermanet P. RoboVQA: Multimodal Long-Horizon Reasoning for Robotics. arXiv preprint arXiv:2311.00899. 2023. https://doi.org/10.1109/ICRA57147.2024.10610216 
- [13] Yan F. RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation. arXiv preprint arXiv:2412.07215. 2024. 
- Balcı E, Sarıgül M, Ata B. Benchmarking large language model reasoning in indoor robot navigation. 33rd Signal Processing and Communications Applications Conference (SIU). 2025: 1-4. https://doi.org/10.1109/SIU66497.2025.11111749