Dieses Ergebnis aus BASE kann Gästen nicht angezeigt werden. Login für vollen Zugriff.

LLM Self-Correction with DECRIM: DECOMPOSE, CRITIQUE, AND REFINE for Enhanced Following of Instructions with Multiple Constraints

Title:	LLM Self-Correction with DECRIM: DECOMPOSE, CRITIQUE, AND REFINE for Enhanced Following of Instructions with Multiple Constraints
Authors:	Palmeira Ferraz, Thomas; Mehta, Kartik; Lin, Yu-Hsiang; Chang, Haw-Shiuan; Oraby, Shereen; Liu, Sijia; Subramanian, Vivek; Chung, Tagyoung; Bansal, Mohit; Peng, Nanyun
Contributors:	Télécom Paris; Institut Mines-Télécom Paris (IMT)-Institut Polytechnique de Paris (IP Paris); Amazon; Meta AI; University of Massachusetts Amherst (UMass Amherst); University of Massachusetts System (UMASS); Princeton University; Université de Caroline du Nord à Chapel Hill = University of North Carolina Chapel Hill (UNC-Chapel Hill); University of North Carolina System (UNC); University of California Los Angeles (UCLA); University of California (UC); Association for Computational Linguistics
Source:	Findings of the Association for Computational Linguistics: EMNLP 2024 ; https://hal.science/hal-04912541 ; Findings of the Association for Computational Linguistics: EMNLP 2024, Association for Computational Linguistics, Nov 2024, Miami, United States. pp.7773-7812, ⟨10.18653/v1/2024.findings-emnlp.458⟩
Publisher Information:	CCSD; Association for Computational Linguistics
Publication Year:	2024
Subject Terms:	few-shot generation; system 2; self-correction; human evaluation; automatic evaluation; LLM-as-a-judge; analysis; prompting; benchmarking; language resources; automatic creation and evaluation of language resources; NLP datasets; evaluation and metrics; évaluation et métriques; Système 2; auto-correction; jeux de données NLP; création et évaluation automatiques de ressources linguistiques; ressources linguistiques; évaluation comparative (benchmarking); incitation (prompting); analyse; génération en few-shot; LLM-en-tant-que-juge; évaluation automatique; évaluation humaine; [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
Subject Geographic:	Miami; United States
Description:	International audience ; Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral’s performance by 7.3%on RealInstruct and 8.0%on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks. ; Le suivi des instructions est une capacité essentielle des grands modèles de langage (LLMs). Cependant, des études récentes ont montré que les LLMs ont souvent du mal à respecter des instructions comportant plusieurs contraintes (par exemple, une demande de créer un post sur les réseaux sociaux « sur un ton humoristique » sans utiliser de « hashtag »). Malgré cela, la plupart des évaluations se concentrent uniquement sur des données synthétiques. Pour remédier à cette lacune, nous introduisons RealInstruct, le premier benchmark conçu pour évaluer la capacité des LLMs à suivre des ...
Document Type:	conference object
Language:	English
Relation:	info:eu-repo/semantics/altIdentifier/arxiv/2410.06458; ARXIV: 2410.06458
DOI:	10.18653/v1/2024.findings-emnlp.458
Availability:	https://hal.science/hal-04912541; https://hal.science/hal-04912541v1/document; https://hal.science/hal-04912541v1/file/Instructions_with_multiple_constraints_EMNLP2024__Camera_ready.pdf; https://doi.org/10.18653/v1/2024.findings-emnlp.458
Rights:	https://about.hal.science/hal-authorisation-v1/ ; info:eu-repo/semantics/OpenAccess
Accession Number:	edsbas.772E401E
Database:	BASE