A year in the making, BigScience’s AI language model is finally available

After more than a year of planning and training, a volunteer-led project has produced an open source language model that they claim is as powerful as OpenAI’s GPT-3, but free and open for anyone to use (if they have the computing power). Dubbed Bloom, the model is available in open source along with the code and datasets used to create it. Brooklyn-based AI startup Hugging Face has released a free web app that lets anyone try Bloom without having to download it.

Bloom is the brainchild of BigScience, an international, community-powered project with the goal of making large natural language models widely available for research. Large language models, or “LLMs” for short, can translate, summarize and write text with humanlike nuance — more or less. (See GPT-3.) But they’ve been historically costly to create, keeping them out of reach of researchers and firmly within the hands of Big Tech companies like Meta, Google and Microsoft.

That’s finally changing, thanks in part to the efforts of BigScience. The group’s more than 1,000 volunteer researchers — supported by ethicists, philosophers, legal scholars and engineers from startups and large tech companies alike — spent months working toward Bloom, which rivals in scale LLMs made by firms like OpenAI and Alphabet’s DeepMind. One of the largest open source models to work across multiple languages, Bloom is designed to be applied in a range of research applications, such as extracting information from historical texts.

Bloom is able to generate text in 46 natural languages and dialects and 13 programming languages,” reads a blog post shared with TechCrunch ahead of the release. “Although it was never trained on any of those specific tasks, Bloom can be asked to produce summaries or translations of text, output code from instructions, and follow prompts to perform original tasks such as writing recipes, extracting information from a news article, or composing sentences using a newly-defined invented word … Bloom’s performance will continue to improve as the workshop continues to experiment and advance on top of Bloom.”

BigScience’s backers also hope that Bloom will spur new investigations into ways to combat the problems that plague all LLMs, including bias and toxicity. LLMs have a tendency to spout falsehoods and exhibit prejudices against religions, sexes, races and people with disabilities. They also struggle with the basic tenets of writing, often changing the subject of a conversation without a segue and endlessly repeating — or even contradicting — themselves.

“[Bloom] shows the continued power of open source and open science even for expensive, large foundational models,” Richard Socher, the CEO of You.com and formerly chief scientist at Salesforce, told TechCrunch via email. Socher isn’t involved with BigScience. “It also shows that in AI, no organization has a major edge for very long. Once an organization shows something is doable, the same capabilities will appear six to 12 months after in other places.”

Humble beginnings

BigScience’s origins lie in discussions years ago between Hugging Face chief science officer Thomas Wolf, GENCI’s Stéphane Requena and IDRIS‘ Pierre-François Lavallée. The founders envisioned creating software, datasets, LLMs and tools to explore the social impact of AI, which only in recent years has received increased attention from the research community.

Soon, steering committees were formed to give members of BigScience — who hailed from more than 60 countries and 250 institutions — scientific and general advice, design collaborative tasks and organize workshops, hackathons and public events. Different working groups were charged with tackling challenges like data governance, proving theorems in mathematics and archival strategies, as well as privacy and informed consent and other legal issues.

Bloom is the sum total of their work. It was trained using $7 million worth of publicly funded (through grants) compute time on the Jean Zay supercomputer located near Paris, France, which ranks among the most powerful machines in the world.

A robust discussion is ongoing in academic circles about the carbon impact of AI training; data centers aren’t particularly environmentally friendly. But BigScience says that Jean Zay, thanks to its unique cooling system and nuclear power source, was able to train Bloom with a carbon footprint equivalent to a Paris-to-New York flight.

Like all language models, Bloom is essentially a statistical tool to predict words. Fed an enormous number of examples from a 1.6-terabyte training dataset, Bloom learned how likely words are to occur based on patterns, including the semantic context of surrounding text. For example, given a typical email ending in the fragment “Looking forward…” Bloom might complete it with “… to hearing back.”

One goal of the BigScience working groups was to collect data that was sufficiently representative to train Bloom. Because of systemic biases in public data sources, non-English LLMs traditionally haven’t performed as well as their English-language counterparts. Drawing on books, academic publications, radio transcriptions, podcasts and websites, the 341-billion-word dataset used to train Bloom aims to encode different cultural contexts across languages, including Swahili, Catalan, Bengali and Vietnamese.

The BigScience groups hand-picked nearly two-thirds of the dataset from 500 sources, soliciting suggestions from community groups including the African natural-language-processing community Masakhane, LatinX in AI and Machine Learning Tokyo. They redacted for privacy and filtered for quality, for example attempting to reduce an over-representation of porn sites, which tend to contain sexist associations.

Bloom isn’t completely bias-free — no LLM is. But the hope is that by maintaining transparency around the training data, it’ll be easier for researchers to get to the root of Bloom’s predictions and decision making.

Large in size

At 176 billion parameters, Bloom is roughly the size of GPT-3. Parameters in machine learning are the parts of the LLM learned from training data and tend to correlate with the effectiveness of the model on a task like generating text.

Generally speaking, higher-parameter models require more compute power to train. A 2020 study from AI21 Labs pegged the expenses for developing a text-generating model with only 1.5 billion parameters at as much as $1.6 million; Bloom trained on 384 Nvidia A100 GPUs for three months. That fact has made it difficult for the community to use large, state-of-the-art language models like Microsoft’s and Nvidia’s Megatron-Turing Natural Language Generation (MT-NLG), which has 530 billion parameters.

BigScience claims that researchers will have the ability to use Bloom for less than $40 per hour on a cloud provider. But aiming to remove even this barrier to access, the organization plans to release smaller, less hardware-intensive versions of Bloom and is developing a distributed system that will allow labs to share the model across their servers. An API is also in the works.

Bloom joins a burgeoning ecosystem of open source, highly capable LLMs with wide commercial and research uses. In February, open AI research group EleutherAI released GPT-NeoX-20B, which at the time outperformed other public language models across several benchmarks. Months later, Meta open-sourced OPT-175B, which the company claimed was the first 175-billion-parameter language model to be made available to the AI community.

They’ve been put to good use — businesses have already sprung up around EleutherAI’s models. But some researchers fear abuse. At the University of Maryland, researchers discovered that it’s possible for LLMs to generate false news and cybersecurity reports that are convincing enough to fool experts. Another paper co-authored by researchers at Meta explores the potential harm that might arise from LLMs that give poor advice, particularly medical or psychological prognoses.

Many companies that offer access to LLMs through an API, like OpenAI, apply filters to weed out problematic text. But open source models obviously have no such protections.

In recognition of the potential for misuse, Bloom comes with documentation that outlines its capabilities and limitations. Using it requires agreeing to a legal license that commits researchers to not use the model for malicious ends. BigScience plans to monitor how the model is applied and adjust the license and documentation as necessary.

“We’re slated to add more languages, make the model smaller so it’s easier to use at the same level of performance, and we’ll support community efforts to expand it,” the blog post continues. “Bloom is a living family of models that will grow, not a one-and-done model.”