Here’s Proof You Can Train AI Models Without Abusing Copyrighted Material

In 2023, OpenAI told the UK Parliament that it would “impossibleTo train leading AI models without using copyrighted material. It’s a popular stance in the AI ​​world, where OpenAI and other major players have used online content to train models that power chatbots and image generators, sparking a wave of lawsuits alleging copyright infringement Is.

Two announcements on Wednesday provide evidence that large language models can indeed be trained using copyrighted material without permission.

A group of researchers backed by the French government has released the largest AI training dataset composed entirely of text into the public domain. And the nonprofit Fairly Trend announced that it has was awarded its first certification For a large language model built without copyright infringement, it shows that technology like the one behind ChatGPT can be built differently from the AI ​​industry’s controversial norm.

“There is no fundamental reason why someone couldn’t do LLM training properly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after leaving his executive role at image generation startup Stability AI because he disagreed with its policy of scraping content without permission.

Fairly Trained offers certification to companies that want to prove that they have trained their AI models on data that they either own, are licensed, or are in the public domain. When the nonprofit launched, some critics pointed out that it had not yet identified a large language model that met those needs.

Today, Fairly Trend announced that it has validated its first large language model. It’s called KL3M and was developed by Chicago-based legal tech consulting startup 273 Ventures using a curated training dataset of legal, financial and regulatory documents.

Company co-founder Jillian Bommarito says the decision to train KL3M in this way stemmed from the company’s concern with “risk-averse” clients such as law firms. “They are concerned about provenance, and they need to know that the output is not based on tainted data,” she says. “We are not relying on fair use.” Customers were interested in using generative AI for tasks like summarizing legal documents and drafting contracts, but didn’t want to get caught up in intellectual property lawsuits like OpenAI, Stability AI and others did.

Bommarito says 273 Ventures had not worked on any large language models before, but decided to train it as an experiment. “Our test is to see if this is possible,” she says. The company has created its own training data set, the Kelvin Legal Datapack, which contains thousands of legal documents reviewed for compliance with copyright law. Are.

Although the dataset is small (about 350 billion tokens, or units of data) compared to datasets compiled by OpenAI and others who have extensively scoured the Internet, Bommarito says the KL3M model performed much better than expected, Something she credits to how carefully the data was examined beforehand. “Clean, high-quality data can mean you don’t need to build the model so big,” she says. Curating the dataset helps make a finished AI model specific to the task at hand. Can be found for which it is designed. 273 Ventures is now offering spots on a waiting list to clients who want to purchase access to this data.

clean sheet

Companies wishing to emulate KL3M may find more help in the future in the form of freely available violation-free datasets. On Wednesday, researchers released what they claim is the largest available AI dataset for language models made entirely of public domain material. The Common Corpus, as it’s called, is a collection of texts of roughly the same size as the data used to train OpenAI’s GPT-3 text generation model and was posted on the open source AI platform Hugging Face Is.

The dataset was created from sources such as the US Library of Congress and public domain newspapers digitized by the National Library of France. Pierre-Carl Langlais, project coordinator of the Common Corpus, calls it “a corpus large enough to train cutting-edge LLMs”. In big AI parlance, the dataset contains 500 million tokens, with OpenAI’s most capable models widely believed to have been trained on several trillion.

Leave a Reply

Your email address will not be published. Required fields are marked *

Discover more from news24alerts

Subscribe now to keep reading and get access to the full archive.

Continue reading