Together AI Unveils RedPajama v2: A 30 Trillion Token Dataset for Advanced Language Models

Together AI Unveils RedPajama v2: A 30 Trillion Token Dataset for Advanced Language Models
Image: together.ai

Together AI, the research organization behind notable language models like Llama, Mistral, Falcon, MPT, and RedPajama, has unveiled a substantial addition to its dataset collection - RedPajama v2. This new dataset, containing a staggering 30 trillion tokens, is aimed at further advancing the capabilities of large language models (LLMs).

High-quality data is a fundamental requirement for the training of advanced open LLMs, and RedPajama v2 plays a crucial role in this context. The data's quality has a direct bearing on the performance of language models, including Llama, Mistral, Falcon, and MPT. These models, often referred to as large language models (LLMs), are pivotal in tasks related to understanding and generating natural language. Therefore, ensuring that they are trained on high-quality data is paramount.

The creation of a top-notch dataset for training LLMs is not a straightforward process. Many challenges arise due to the nature of web data. Converting HTML to plain text can introduce abnormalities, and web sources often have low-quality content with inherent biases. Gathering the correct dataset, cleaning and refining it, and ensuring its suitability for LLM training is a resource-intensive endeavor. Although several community projects have aimed to tackle this challenge, such as C4, RedPajama-1T, Refinedweb (Falcon), Dolma (AI2), and SlimPajama, they often only cover specific subsets of the CommonCrawl web crawls and provide limited data filtering options.

In March of the current year, researchers from Together AI introduced RedPajama-1T, a dataset comprising 1 trillion high-quality English tokens. This dataset, which was over 190,000 times the size of previous offerings, marked a significant step in providing quality data for LLM training. RedPajama-1T was a notable milestone, but the researchers decided to go even further. Their commitment to improving the landscape of LLM training data led to the creation of RedPajama v2.

RedPajama v2 is a massive dataset, containing an impressive 30 trillion tokens, making it the largest publicly available dataset designed specifically for machine learning systems. It's assembled using 84 CommonCrawl web crawls and additional publicly accessible web data. This dataset consists of raw data in plain text, complemented by more than 40 high-quality annotations and deduplication clusters. These annotations have been generated by using a range of machine learning classifiers to appraise the data's quality. Additionally, minhash results are provided for potential fuzzy deduplication.

One of the standout features of RedPajama v2 is its comprehensive coverage of the CommonCrawl, making it a substantial resource for language model training data. The dataset includes data in English, French, Spanish, German, and Italian, and this multilingual aspect enhances its usability for a broader audience.

The researchers from Together AI recognize the value of annotations in understanding and filtering the dataset effectively. These annotations are instrumental in the quality control and optimization of the dataset. Furthermore, the team has plans to expand the set of high-quality annotations by including additional signals that can be beneficial to the LLM developer community.

This new dataset, RedPajama v2, has a profound impact on the field of large language models. Its extensive coverage of web data and the quality of annotations make it an invaluable resource for training LLMs. Moreover, the dataset is constructed in a way that facilitates flexibility in filtering, customizing, and reweighting the data according to specific requirements.

Together AI's dedication to providing high-quality data for training language models like Llama, Mistral, Falcon, MPT, and RedPajama is evident in the release of RedPajama v2. As the development of language models continues to advance, the availability of a dataset of this magnitude will be a significant contribution to the research community and the broader field of natural language processing.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Topainews.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.