Researchers at the University of Wisconsin-Madison propose a fine-tuning approach that uses a carefully designed synthetic dataset that includes numerical key-value retrieval tasks

July 3, 2024

It is observed that LLMs often struggle to retrieve relevant information from the middle of long input contexts and exhibit “lost-in-the-middle” behavior. The research paper addresses the critical issue of large language models (LLMs) performance in processing longer context inputs. In particular, LLMs such as GPT-3.5 Turbo and Mistral 7B often struggle to accurately retrieve information and maintain their reasoning capabilities across large-scale text data. This limitation affects their effectiveness in tasks that require processing and reasoning across long passages, such as multi-document question answering (MDQA) and flexible-length question answering (FLenQA).

Current methods to improve the performance of LLMs in long-term context settings typically involve fine-tuning real-world datasets. However, these datasets often contain outdated or irrelevant information, which can lead to hallucinations and other inaccuracies. Traditional datasets such as MDQA and FLenQA have shown that LLMs tend to exhibit “lost-in-the-middle” behavior, where their performance is optimal at the beginning or end of the input context but degrades when information is in the middle.

A team of researchers from the University of Wisconsin-Madison proposes a novel fine-tuning approach that uses a carefully designed synthetic dataset to address these challenges. This dataset includes numerical key-value retrieval tasks designed to improve LLMs’ ability to process long contexts more effectively. By using synthetic data that avoids the pitfalls of outdated or irrelevant information, the researchers hope to improve LLMs’ information retrieval and reasoning abilities without inducing hallucinations.

The proposed synthetic dataset consists of simple key-value retrieval tasks from dictionaries, where each task involves multiple dictionaries, each with a few keys. For example, the dataset for Mistral 7B contains 350 examples, each of which contains 85 dictionaries, resulting in prompts with approximately 3900 tokens. Fine-tuning is done in the response part of these tasks, hiding other elements to focus the model’s learning process.

Experiments show that this approach significantly improves the performance of LLMs on long-context tasks. For example, fine-tuning GPT-3.5 Turbo on the synthetic data resulted in a 10.5% improvement on the 20-document MDQA benchmark, ranking tenth. In addition, this method mitigates the lost-in-the-middle phenomenon and reduces primacy bias, resulting in more accurate information retrieval across the entire input context. The performance of the models fine-tuned on the synthetic data was compared with those fine-tuned on real datasets, with the synthetic approach showing better results in maintaining consistent accuracy across different context positions.

The study presents an innovative approach to fine-tuning LLMs using synthetic data, which significantly improves their performance in long-term contexts. The proposed method shows significant improvements over traditional fine-tuning techniques by addressing the lost-in-the-middle phenomenon and reducing primacy bias. This research highlights the potential of synthetic datasets in overcoming the limitations of real-world data and paves the way for more effective and reliable LLMs in processing large-scale textual information.

Visit the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Þjórsárden.

Join our … Telegram channel And LinkedInphew.

If you like our work, you will Newsletter..

Don’t forget to join our 45k+ ML SubReddit

Shreya Maji is a consulting intern at MarktechPost. She did her Bachelor of Technology from the Indian Institute of Technology (IIT), Bhubaneswar. As an AI enthusiast, she likes to stay updated with the latest developments. Shreya is particularly interested in the real-world applications of cutting-edge technology, especially in the field of data science.

🐝 Subscribe to the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many more…