Articles
		
					
				Vol. 12 (2025)
			
		
		
			A Multimodal and Dynamically Updatable Benchmark for Aviation Question Answering with Large Language Models
		
										
												
				
											
															
									"Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China" & "School of Computer Science, Fudan University, Shanghai 200438, China"
																	
																											 
											
															
									Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China
																	
																											 
											
															
									Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China
																	
																											 
											
															
									Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China
																	
																											 
											
															
									Department of Standard and Data Technology Research, China Aero-Polytechnology Establishment, Beijing 100028, China
																	
																											 
									 
			 
			
				
	
	
				Abstract
		With the rapid advancement of artificial intelligence, large-scale language models (LLMs) have demonstrated strong capabilities in open-domain question answering, knowledge retrieval, and decision support. However, in safety-critical and knowledge-intensive industries such as aviation, existing evaluation benchmarks fall short in domain adaptation, comprehensiveness, and dynamic updating. As aviation increasingly integrates intelligent automation and robotic systems for maintenance, inspection, and manufacturing, reliable language-model evaluation becomes crucial for ensuring the safety and autonomy of such systems. This paper proposes a multimodal, multi-level benchmark dataset tailored to aviation QA tasks, alongside an automated updating mechanism and a multi-dimensional evaluation framework. The methodology integrates knowledge extraction from multimodal aviation documents, diverse QA pair generation, iterative complexity enhancement, and quality validation. Furthermore, dynamic updating is achieved via a hybrid strategy combining imitation and expansion, complemented by differentiated filtering and prompt optimization. To ensure rigorous assessment, a ten-dimension evaluation framework is introduced, covering accuracy, completeness, relevance, explainability, and safety, among others. By providing a reliable and dynamically evolvable benchmark, this work supports the integration of LLMs into robotic and automated decision-support systems in aviation, enabling more intelligent, autonomous, and safety-assured operations. Experimental results using aviation textbooks confirm the effectiveness of the proposed approach in generating high-quality, dynamically evolvable QA datasets. This work provides both methodological innovation and practical tools for the evaluation of LLMs in aviation, with potential extension to other knowledge-intensive domains.
	
				
			References
		
					
									- [1] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020; 33: 1877-1901. 
- [2] Ahmed W. Artificial Intelligence in Aviation: A Review of Machine Learning and Deep Learning Applications for Enhanced Safety and Security. Premier J Artif Intell. 2025; 3: 100013. 
- Kim CY, Kim SY, Cho SH, Kim YM. Bridging the Language Gap: Domain-Specific Dataset Construction for Medical LLMs. In: Guo J, et al., editors. Generalizing from Limited Resources in the Open World. Communications in Computer and Information Science. Vol. 2160. Singapore: Springer; 2024. (IJCAI 2024). https://doi.org/10.1007/978-981-97-6125-8_11 
- [3] Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding. Int Conf Learn Represent. 2021. 
- [4] Clark P, Cowhey I, Etzioni O, et al. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457. 2018. 
- Zellers R, Holtzman A, Bisk Y, Farhadi A, Choi Y. HellaSwag: can a machine really finish your sentence? Proc 57th Annu Meet Assoc Comput Linguist. 2019: 4791-4800. https://doi.org/10.18653/v1/P19-1472 
- [5] Srivastava A, Rastogi A, Rao A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models (BIG-bench). arXiv preprint arXiv:2206.04615. 2022. 
- [6] Liu H, Jiang H, Zhang S, Zhang H, Sun M. KoLA: benchmarking large language models for higher-order cognition with Bloom’s taxonomy. arXiv preprint arXiv:2306.01874. 2023. 
- Lin TY, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. Eur Conf Comput Vis. 2014: 740-55. https://doi.org/10.1007/978-3-319-10602-1_48 
- Antol S, Agrawal A, Lu J, et al. VQA: visual question answering. Proc IEEE Int Conf Comput Vis. 2015: 2425-33. https://doi.org/10.1109/ICCV.2015.279 
- [7] Chen W, Wu H, Zeng W, Li H. TabFact: a large-scale dataset for table-based fact verification. Int Conf Learn Represent. 2020. 
- [8] Zhong V, Xiong C, Socher R. TableQA: a large-scale dataset for question answering on tabular data. arXiv preprint arXiv:2006.06434. 2020. 
- [9] Lu P, Mishra S, Xia T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering (ScienceQA). Adv Neural Inf Process Syst. 2022; 35: 2507-2521. 
- [10] Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. Proc 28th ACM Int Conf Inf Knowl Manag. 2021:2577-85. 
- [11] Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. What disease does this patient have? A large-scale open-domain question answering dataset from medical exams (MedQA). arXiv preprint arXiv:1909. 00229. 2019. 
- [12] Guha N, Danks D, Hajian S, et al. LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models. arXiv preprint arXiv: 2308.11462. 2023. 
- Chen Z, Chen W, Xu Z, Wang WY. FinQA: a dataset of numerical reasoning over financial data. Proc Conf Empir Methods Nat Lang Process. 2021: 3697-3711. https://doi.org/10.18653/v1/2021.emnlp-main.300 
- Lozano Tafur C, Camero RG, Aldana Rodríguez D, Daza Rincón JC, Rativa Saenz E. Applications of artificial intelligence in air operations: a systematic review. Results Eng. 2025; 25: 103742. Available from: https://doi.org/10.1016/j.rineng.2024.103742 
- Das A, Kottur S, Moura JMF, Lee S, Batra D, Parikh D. Embodied question answering. Proc IEEE Conf Comput Vis Pattern Recognit. 2018. https://doi.org/10.1109/CVPR.2018.00008 
- Sermanet P. RoboVQA: Multimodal Long-Horizon Reasoning for Robotics. arXiv preprint arXiv:2311.00899. 2023. https://doi.org/10.1109/ICRA57147.2024.10610216 
- [13] Yan F. RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation. arXiv preprint arXiv:2412.07215. 2024. 
- Balcı E, Sarıgül M, Ata B. Benchmarking large language model reasoning in indoor robot navigation. 33rd Signal Processing and Communications Applications Conference (SIU). 2025: 1-4. https://doi.org/10.1109/SIU66497.2025.11111749 
			
		
											
				- 
											Aleksandar Petrovic,
											Nebojsa Bacanin,
											Luka Jovanovic,
											Jelena Cadjenovic,
											Jelena Kaljevic,
											Miodrag Zivkovic,
											Milos Antonijevic,
										
						Computer-Vision Unmanned Aerial Vehicle Detection System Using YOLOv8 Architectures
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 11 (2024)
					
									
- 
											P.C. Prabhu Kumar,
											P. Penchala Prasanth,
											P. Hemalatha,
											Karthik J Kulakarni,
										
						A Framework for Fully Automated Home using IoT Reliable Protocol Stack and Smart Gateway
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 7 (2020)
					
									
- 
											Ashwin Kumar,
											Mihir Singh  Kothari,
											Santanu Mitra,
										
						Novel Bio-inspired Inverse Kinematics for Fault-Tolerant Multilegged Robots
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 8 (2021)
					
									
- 
											 Jan Sovcik,
											 Jan Sovcik,
											Martin Kajan,
											Martin Kajan,
											FrantiSek Duchoň,
											FrantiSek Duchoň,
											Martin Florek,
											Martin Florek,
											Peter Hubinsky,
											Peter Hubinsky,
											Khanh Duong Quang ,
											Khanh Duong Quang ,
										
						Coordinated Movement of Multiple Robots in Outdoor Environment with Obstacles Pages
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 1 No. 2 (2014)
					
									
- 
											Rui Chen,
											Xiaohong Chen,
											Long Bai,
											Qian Tang,
										
						Theoretical Analysis and Experimental Study of Time-Varying Electric Field and Electrostatic Adhesion Force Generated by Interdigital Electrode Arrays
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 1 No. 2 (2014)
					
									
- 
											 Faqin Gao ,
										
						Adaptive Tracking Algorithm of Weak GNSS Signal
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 2 No. 2 (2015)
					
									
- 
											Tzu-Chi Chan,
											Yu-Ping Hong,
											Jyun-Sian Yang,
											Jia-Hong Yu,
											Arindam Dutta,
											Sabbella Veera Venkata Satyanarayana Reddy,
										
						Design and Analysis of a High-Precision Horizontal Machine Tools 
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 10 (2023)
					
									
- 
											 Cheng-Shiu Chung,
											Hongwu Wang,
											Matthew J. Hannan,
											Annmarie R. Kelleher,
											 Rory A. Cooper ,
										
						Daily Task-Oriented Performance Evaluation for Commercially Available Assistive Robotic Manipulators
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 3 No. 1 (2016)
					
									
- 
											 Zhiyuan Wang,
											Renwei Liu,
											Xueyang Chen,
											Todd Sparks,
											Frank Liou,
										
						Industrial Robot Trajectory Stiffness Mapping for Hybrid Manufacturing Process
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 3 No. 1 (2016)
					
									
- 
											Kuo  Xiong,
											Xuefeng  Sun,
											Qingxin  Meng,
										
						Modeling and Control Experiments of a Fishtail-Like Pneumatic Soft Actuator
					
					,
					
						International Journal of Robotics and Automation Technology: Vol. 11 (2024)
					
									
				
		
		
						You may also start an advanced similarity search for this article.