Dieses Ergebnis aus BASE kann Gästen nicht angezeigt werden. Login für vollen Zugriff.

Test and Evaluation of Large Language Models to Support Informed Government Acquisition ; Annual Acquisition Research Symposium Proceedings & Presentations

Title:	Test and Evaluation of Large Language Models to Support Informed Government Acquisition ; Annual Acquisition Research Symposium Proceedings & Presentations
Authors:	Chandrasekaran, Jaganmohan; Mayer, Brian B.; Frase, Heather; Lanus, Erin; Butler, Patrick; Adams, Stephen C.; Gregersen, Jared; Ramakrishnan, Naren; Freeman, Laura J.
Publication Year:	2025
Collection:	VTechWorks (VirginiaTech)
Description:	As large language models (LLMs) continue to advance and find applications in critical decision-making systems, robust and thorough test and evaluation (T&E) of these models will be necessary to ensure we reap their promised benefits without the risks that often come with LLMs. Most existing applications of LLMs are in specific areas like healthcare, marketing, and customer support and thus these domains have influenced their T&E processes. When investigating LLMs for government acquisition, we encounter unique challenges and opportunities. Key challenges include managing the complexity and novelty of Artificial Intelligence (AI) systems and implementing robust risk management practices that can pass muster with the stringency of government regulatory requirements. Data management and transparency are critical concerns, as is the need for ensuring accuracy (performance). Unlike traditional software systems developed for specific functionalities, LLMs are capable of performing a wide variety of functionalities (e.g., translation, generation). Furthermore, the primary mode of interaction with an LLM is through natural language. These unique characteristics necessitate a comprehensive evaluation across diverse functionalities and accounting for the variability in the natural language inputs/outputs. Thus, the T&E for LLMs must support evaluating the models linguistic capabilities (understanding, reasoning, etc.), generation capabilities (e.g., correctness, coherence, and contextually relevant responses), and other quality attributes (fairness, security, lack of toxicity, robustness). T&E must be thorough, robust, and systematic to fully realize the capabilities and limitations (e.g., hallucinations and toxicity) of LLMs and to ensure confidence in their performance. This work aims to provide an overview of the current state of T&E methods for ascertaining the quality of LLMs and structured recommendations for testing LLMs, thus resulting in a process for assuring warfighting capability. ...
Document Type:	conference object
File Description:	application/pdf
Language:	English
Relation:	https://hdl.handle.net/10919/141680
Availability:	https://hdl.handle.net/10919/141680
Rights:	Public Domain (U.S.) ; http://creativecommons.org/publicdomain/mark/1.0/
Accession Number:	edsbas.E9160C47
Database:	BASE