Two Problems
1. Current A.I.'s are trained using material that's often illegal (copyright infringement). Similar to humans, they learn patterns by seeing large amounts of data over time. The large, language models (eg ChatGPT) need massive amounts of data, often hundreds of gigabytes or terabytes, to train the top-performing A.I.'s. They get most of that data from collections of books, public-facing web sites, and public-facing database. A.I. developers just download, or scrape, all the data right off of those sites. They also bundle collections of files to share with each other. Given the benefits of strong A.I., it must be legal to train the foundational models on any copyrighted data. Countries that make that legal will get far ahead of other countries in these capabilities and markets.
2. Copyright holders want to control their works to maximize the profits they make. They fought against all forms of free, file sharing in the past. They pour a ton of money into lobbying to keep copyright laws strong, too. Any legal proposal must ensure that they retain their ability to control and profit off their work to the maximum extent. They'll also want the proposal to give them opportunities to make more money in the new markets.
Our Goals
For A.I. suppliers, they need as much training data as possible at the lowest cost possible. The training must be cost-effective. Even human beings only cost $150k-$350k to educate. From childhood onto adulthood, the law also usually allows them to consume any publicly-available information or displays for free which becomes knowledge that they can apply. They learn from textbooks, schools, TV, the Internet, etc. Their own generated works are considered novel if they're not too close to what content they consumed. A top-notch A.I. should have the same rights to copyrighted works for training at a cost no higher than humans. Then, their work outputs should be as legal (or illegal) as humans', work outputs are in the same circumstances.
In parallel, our goal is to benefit copyright holders with the A.I. laws. This proposal will allow copyright holders to retain all existing agreements, legal protections, and revenue streams. Their work can generate revenue every time a new, A.I. supplier uses it. That revenue will go to the same beneficiaries as before. They will be able to enter a lucrative market. They will also have upsell opportunities by making their works ready for A.I. training (i.e. pre-processing data). Most content creators will achieve their goals, whether profit or public benefit, while A.I. suppliers will achieve theirs.
All Copyrighted Material is Legal
All copyrighted material is legal for training software models, including A.I., with no restrictions by suppliers of copyrighted works. If an A.I. supplier can access, rent, or buy it, then they can use it for training. What's illegal for humans to access is illegal for A.I.'s to access. If a human is allowed to use a work, any terms restricting that work's use for A.I. training are void by law. There's no restrictions on how A.I. developers can use copyrighted content during the training phase. They can pre-process it, filter for quality, change the order, do multiple rounds of training, and so on. There is no DMCA or reverse engineering violation if the copyrighted works are solely used for training A.I.'s. This law is also retroactive so that all A.I. models previously trained on publicly-available, copyrighted works are themselves legal (eg Falcon-40B).
Any copyrighted and freely-available work, once legally published, is forever free to download (i.e. scrape) for use in training A.I.'s. Companies are still allowed to limit or block scrapers that they detect. This rule just reinforces that it's legal to use those works.
Optional (more risk): Any copyrighted and freely-available work, once legally published, can be transferred in original or pre-processed form individually or in data bundles for use in training A.I.'s. Any other use is subject to existing law on content or file sharing. This rule is helpful because current A.I.'s already use data sources obtained this way. Those include Common Crawl, The Pile, and Proof-Pile-2. Protecting developers' ability to share them will help both in independent verification of claims about A.I.'s performance and in rapidly experimenting with models using field-proven, data sets. If scraping is legal (prior rule), many A.I. developers scraping the same data sources will increase both bandwidth and storage use with lots of redundant data. This extra rule will also eliminate that by letting developers just share their existing training data. Whole markets are already forming to host, process, and distribute these collections.
Distribution of Copyrighted Works for A.I.
Copyright holders shall not commit price discrimination against A.I.-training customers. They must make any publicly-available work available to A.I. developers at the same or discounted price as non-A.I. uses. They can bundle data together at discounted prices to both meet high, data needs of AI training and increase their own revenue per customer. They can offer at a higher rate optional, pre-processed versions of their works that are easier for A.I. developers to use. A.I.-ready content might become a large, revenue stream for content creators.
Any existing, legal copy of a work owned by a private individual or a company can be used in an A.I.. They can digitize it themselves if needed with no DMCA violations for that use case. They can use one copy of a work for multiple models so long as they're just for their own models. This is to incentivize each A.I. supplier to buy copyrighted material from authors while getting plenty of value in return. Like people can sell books and CD's, people can also transfer their copies of copyrighted works to A.I. companies to use. That follows the same rule of one, legal copy per work per A.I. supplier. Those rules will create secondary markets like we see with used textbooks, cars, and so on. This combination benefits the most people and companies.
Transformative Works
Using a set of training data to create a generative model is, by law, a transformative work. This rule supports allowing everyone who has or buys quality data to make and distribute models using that data. Fine-tuning an existing model with training data may or may not make much change in the model's behavior. For now, fine-tuning a model is a generative work if the behavior of the models differ significantly. That wording is intentionally vague to leave more precise definitions to future laws and court decisions. This shouldn't be a problem since most fine-tunings are done on open-source models or proprietary models with permission of model owners.
Are Outputs Infringing?
Existing, copyright law will decide whether model outputs are infringing. Courts will use the same standards they use to assess whether a human's use of their knowledge is novel or infringing. Outputs are always independent creations if differing significantly from their inputs. The outputs of models trained only on public domain data are always independent creations.
A.I.'s are not moral agents: they are merely a tool built on mathematics. Humans operating A.I.'s are moral agents. What an A.I. does, including the damage they cause, is the responsibility of the person or company using them. They should choose, use, and maintain A.I.'s with the same care that they would any other tool in their endeavors. The expected level of accuracy, safety, and security should be similar to how they use non-A.I. tools in the same circumstances. This rule will reduce the damage of poorly-constructed A.I.'s without trying to control the development of the A.I.'s themselves. Instead, the users are incentivized to choose those which will reduce their own liability. The liability requirement might also preclude the use of A.I.'s when their behavior is too unpredictable to justify the liability of using them.
If my above proposal is rejected, and the use of copyrighted works isn't allowed and/or models aren't always transformative, then...
All AI models are legal, transformative uses of copyrighted material so long as their licensed use doesn't compete with original authors. For instance, a book on software security can be used to train an AI to spot or correct security defects. It can't be used to educate people on software security. A K-12-College educated AI with medical knowledge can use all of that training to support in medical billing or diagnosing patients (esp X-rays or MRI's). An A.I. trained on marketing material for specific products and services can write marketing material for different products and services. Under this rule, copyright holders can still issue more permissive licenses for any and all uses. AI suppliers would likely use a mix of permissive and restricted sources for training data. This rule, compared to making all copyrighted data legal for use, will still put a country and its models behind any country that adopts the original proposal, though. I've included it as a fall-back option.
If you know legislators or can talk to them, please send them this proposal so we can be sure our A.I.'s are legal. Once they are, we'll immediately see more applications with higher quality and lower risk. The kinds of people who will produce those are waiting to know their work is legal.
(Learn the Gospel of Jesus Christ with proof it's true and our stories. Learn how to live it.)