AI2 releases Dolma, the largest open dataset for language model training

The Allen Institute for AI (AI2) has announced the release of Dolma, a new open dataset for training large language models. Dolma is the biggest open dataset to date, containing 3 trillion tokens from various sources of web content, academic publications, code, books, and encyclopedic materials. Dolma is intended to be the basis for AI2’s planned open language model, OLMo, which will be free to use and modify by the research community.

Why Dolma?

AI2’s researchers said that they created Dolma to address the lack of access and transparency in the current landscape of language models. Most of the existing large language models, such as GPT-4 and Claude, are trained on proprietary datasets that are not publicly available or documented. This makes it difficult for researchers to study, improve, or critique these models and their data sources.

AI2 releases Dolma, the largest open dataset for language model training
AI2 releases Dolma, the largest open dataset for language model training

AI2’s goal is to create a more open and collaborative platform for language model research, where anyone can inspect, use, or contribute to the dataset and the model. Dolma is the first data artifact that AI2 is making available for this purpose, and it will be followed by OLMo, the open language model that will be trained on Dolma.

What’s in Dolma?

Dolma is a massive collection of text data that covers a wide range of domains and topics. It includes:

  • Web content from Common Crawl, such as news articles, blogs, forums, reviews, and social media posts.
  • Academic publications from Semantic Scholar, such as papers, abstracts, citations, and metadata.
  • Code from GitHub, such as repositories, README files, comments, and documentation.
  • Books from Project Gutenberg and Open Library, such as fiction, non-fiction, poetry, and classics.
  • Encyclopedic materials from Wikipedia and Wikidata, such as articles, summaries, facts, and links.

Dolma also incorporates images that are relevant to the text content, such as illustrations, diagrams, charts, and photos. These images are interleaved with the text tokens to create a multimodal dataset that can support vision-language tasks.

AI2’s researchers said that they curated Dolma using various processes and techniques to ensure its quality and diversity. They applied filtering and deduplication methods to remove low-quality or redundant content. They also used sampling and weighting strategies to balance the distribution of different sources and domains. They documented all their choices and rationales in a data sheet that accompanies Dolma.

How to use Dolma?

Dolma is openly available for download on the Hugging Face Hub under AI2’s ImpACT license. The ImpACT license is a new license that AI2 created for medium-risk artifacts, such as datasets and models. The license requires users of Dolma to:

  • Provide contact information and intended use cases
  • Disclose any Dolma-derivative creations
  • Distribute those derivatives under the same license
  • Agree not to apply Dolma to various prohibited areas, such as surveillance or disinformation

AI2 also provides a removal request form for anyone who worries that their personal data may have been included in Dolma.

Dolma can be used for training or evaluating large language models that can handle diverse and complex text inputs. AI2 has already used Dolma to train OpenFlamingo, an open-source reproduction of DeepMind’s Flamingo model. OpenFlamingo is a multimodal framework that can perform various vision-language tasks using interleaved sequences of images and text.

AI2 hopes that Dolma will enable more research and innovation in the field of natural language processing. They invite the research community to join them in building OLMo, the open language model that will be based on Dolma.

Leave a Reply

Your email address will not be published. Required fields are marked *