It's like those old, pyramid diagrams. You start with a foundation. You keep layering things on top of it. At the bottom, we have data that can be used for any reason without even citing it. Data without worries! At the top, it's highly-restricted but still useful. The models start out as free as they can be while getting smarter, but less useful, as they learn more stuff. Totally different from people.
We need to start with a trustworthy foundation. The base model should use information that's public domain, legally clear, diverse in its content, and large. Sites like Project Gutenberg have many works in the public domain. One can't always be sure about their status, though. To be cautious, one might get a list of the oldest works whose copyright definitely expired. More cautious people might cross-check information on multiple sites to confirm their titles, authors, and dates. Mix this up into an AI model in a way that squeezes all you can out of smaller, data collections. With fine-tuning (esp expert examples), this foundational model can be used for anything from basic tasks like summarizing papers to making training data for other AI models. If small and for making training data, one might fine-tune hundreds or thousands of small models specifically to make them better at solving specific problems on specific topics.
(Note: If public domain, the timeline of these documents might also requiring marking what era or type of English each source uses with with instruct or story prompts saying stick with modern English. If not, the output might look like those Americans that start spontaneously speaking British, Middle English in mid-conversation.)
Highly-Permissive Licenses
Next layer of models will use content with highly-permissive, business-friendly licenses. Think BSD or Apache. They may the license to be mentioned and/or the author to be cited. The training system for such works might need to label every input with its citation (or encapsulate it). Permissively-licensed code, like Github's, was the main data I had in mind for this tier. If nothing else, we might be able to use it for auto-complete like many use Copilot. Initially, we might keep costs minimum by specializing them to one language (eg Python, INTERCAL) or a small number (eg Flask apps).
Copyleft Licenses
The other type of open-source code is copyleft code. These require releasing the code under the same or a compatible license. For copyleft models, we can add data ranging from Wikipedia to GPL code. Whatever allows use with re-releasing under compatible licenses. We mix them together releasing under one license. If good sources don't allow that, we might create a separate instance of the generic, copyleft model to then add other copyleft data to. The resulting model is released under that license. This already tells you that it will be handy to have a fire-and-forget infrastructure for training models this way.I wondered about how to use copywritten work that's not open source. My plan was to train models like we train people from children to adults. People start out giving the kids instruction that's akin to labeled material. They also get to see random things by exploring. They also play which has implications for attention, making synthetic data, and motivation. They're disciplined to instill in them what they will or won't do with reinforcement of the concepts on new data. They eventually contemplate what they learn, too. They rapidly take in information on the fundamentals, stabilize a bit with general learning, develop some specialties, and eventually highly specialize in college. Our K-12 education teaches concepts in a layered way where upper grades re-teach more complex versions of what lower grades taught. And they reinforce them.
To imitate that, I'd train a model using K-12, college, specialist, and conversational data. It would regularly review new data using existing knowledge. If Christian, I'll ask homeschooling or college organizations. For secular materials, edX or other online educators are possibilities. Increase it's specialist skill with works on problem-solving, fallacies (eg Art of Deception), statistics (and How to Lie With Statistics), creativity, good coding (eg Code Complete), risk assessment, persuasive writing... all sorts of things that might make AI's wiser and more effective. On top of diverse works, Christian publishers also have good ethics built into their texts on many levels. I'd love to see many works like this available for models. all of that available for models. How?
First, let's just ask publishers for help. The basic license is for any author or publisher to allow anyone to use their content to train an A.I.. They might require the resulting models be released under open-source licenses. Since they say they're like humans, the default rule can be that anything that comes out of it is held to the same standards of fair use if partly used or infringement if a verbatim quote. I'd prefer anyone who can read it can use it in training a model. If it's a book or something, I'd rather any regulation say that they can't charge more per copy for A.I. training than they do for human use. They can charge less for volume deals given the huge data required for A.I.'s. They can also charge more for separate, curated versions that are better for A.I. training. then it should cost no more to buy to train an A.I. than a human would pay to read it. This will open up revenue streams for publishers. They could charge more for curating the works for A.I. training, though. If the model is proprietary, any publisher allowing their data to be used in it gets a free copy of it for internal use.
Before our next models, let's look at why they don't want verbatim quoting: content extraction and competition. Content owners fear people will put their content in AI models, cause the AI models to generate the same content (or equivalent), and then be competing with them as if they had the original content. I've found even non-profits and charities hold onto their "free" content tightly to benefit their brand, control how it's used, get donations, and so on. Others, usually academics or independents, are less motivated by money so much as wanting their work to be their work while others, humans or AI's, must mix it into original works or at least cite their contributions. So, our solution for obtaining and using their training data must either eliminate or reduce these risks. We must serve them first before serving others using their work.
There's a few routes. First, we might be able to accumulate large amounts of licensed, training data if using traceable models with restricted outputs. These models would be proprietary with agreements that can limit how they're used. For outputs, a pre-existing prompt for model users might say to use only the information inside a data source, like a paper or database. Pre-existing operations might include summarizing, outlining, translating, code modification, and so on. One usage I had was extracting text from a PDF, outlining its points, and summarizing its contributions (or spotting flaws). The model would be instructed to only reference what's in the paper in the output. Likewise, a software tool generating tests for just the input code is outputting content that's very close to what the user already has. Likewise, a search feature would only mention content in the folders on your computer in its results.
What all the uses above have in common is low risk of loss to publishers. The users are mostly seeing their existent content or close derivatives of their content in the models' outputs. The publishers might even make money if the models advertise their data sources. I recommend publishers begin processing their data to be ready for use in AI models (esp LLM's). That way, they can easily license it individually, in groups (eg categories), or all of it. They should already start looking at how people use models that requires strong training in language, in general or niche-specific. They can market their data sets for models that are restricted to those uses. I imagine who will or won't participate will vary depending on how each supplier balances risk vs reward.
(Note: There's another layer like porting Python code to Rust or C++, which I did with GPT, which is leakier. In Rust's case, the linked list article was one many people were imitating at one point. In those scenarios, the training data might be more likely to end up in the outputs. Such leaky usages might require more discussion or negotiation with publishers.)
Another proprietary, data source with potential is low selling or out-of-print works. Publishers probably want to make extra money on them. Companies might buy them cheaply for use in their data lakes. Google has lots of books they've already scanned that they may or may not be using. Companies, non-profits, and individuals can sponsor both the public-domain release of existing works and the creation of new ones. These can be used for non-AI applications, too, such as free education. In AI, I could imagine suppliers competing on the quality of their data sources. Buyers mix and match it.
Now, for the most democratic route: we do it! Many of us have been making content for a long time. Although some is professional, it's mostly casual comments on places like Slashdot, Reddit, and Hacker News. Our dialogue often fails to meet God's standards of righteousness and love. Despite our faults, we've been blessed with a tremendous number of comments that would be great, training data for AI's. I'm talking packed with information, full of compassion, people asking the right questions, exemplary interviews... it's all out there just sitting on servers. Perhaps, generating the occasional quarter of ad revenue.
Like Wikipedia did, I'm calling for users to contribute their own content. Put it all out under public-domain-equivalent licenses (eg CC0). When it gets started, people might temporarily change their profiles on Hacker News, Reddit, etc to say all their comments are licensed that way. Then, have sites like Archive.org collect a snapshot as proof of that license before sending that into whoever is collecting it all. If we have a proof, we might also just extract those users' comments out of existing collections of site data, too. Companies and communities doing open-source AI can set up their own collection sites that also distribute what they collected. I'd say limit it to whatever was published before GPT came out, too, just in case. Before SciGen if really paranoid.
What about models we use but don't distribute? There's data sets online letting people retrieve them just to read them. They aren't licensed for distribution but you can link to them. For those, we might have open-source scripts to pull that data or even paid downloads of the whole thing to support the suppliers. Like the platform earlier, each user of this software would pull the data, preprocess it, apply local instructions (eg filters), and combine with local data. Then, they'd use other software to train the model on that data. This could push the open-source ecosystems to make all of that easier. Or copyright holders would push the legal system to make it harder. Who knows what would happen!
Note: I'm interested in working with people doing this if the work is getting open-sourced. Alternatively, a sponsorship or contract to explore every option above ranging from verifying public-domain works to talking to K-12/college suppliers about their data. Maybe build a 3B or 7B test model down the road. If that sounds good, email me about it.
(The licenses are open problems. This is mostly brainstorming.)
If it's already public, please license it for use in AI's (or any software). I'm talking content that the public is already allowed to read and cite. It just doesn't currently have any license provision for sharing, remixing, or using in AI models. Two versions could exist: use it with AI's that you use however you want; fair use like for a human user. In the latter, the output has citations, copyright applies like any other usage, and whatever they publish might be in a situation similar to existing law. If I understand them right, current rules would only protect content that humans generate with the AI content only mixed into it. Such use cases might be easier to evaluate under existing law.
Now, let's talk training data. It's expensive to put together, expensive to use, and people have legal and ethical concerns. For training data, a group might do something similar to copyleft with their terms of service. You can use the training data if your base model is permissive, you release all of your artifacts permissively, and one version of each model is compatible with open-source tooling. That's my baseline just to grow the open-source ecosystem. For legal or ethical concerns, the terms might say the trainer must include in training or prompts whatever might ensure legal or ethical behavior. Both parties would define and agree to that.
What about the issues of content extraction and competition that I already mentioned? How do we address that with licensing? Here's some ideas for that:
1. Percentage of total data. The copywritten work must not be larger than N% of total, training data put into a model. If it's tiny enough, one might be able to argue it only adds so much weigh to the outputs. What if it's the only data of its type, though?
2. Number of epochs. The data set normally goes into the LLM a specific number of times in training. The entire thing is repeated. More repetition might increase the odds that specific data leaks out of it. Some content owners might want to limit the number of epochs for their content. The training set would be generated in a way that respected per-epoch limits.
3. Merged with similar data. The copywritten work must be one of multiple examples of the same types of data. For instance, there might be many examples given to the model about what files are, how to generate them, doing it in Python, and specific examples in Python. When it generates Python code, any or all of this might have contributed to it.
4. Ratio of data, set size to number of parameters. The content owners might want the training data to exceed the number of parameters by a multiplier N. For instance, at least 10GB or 100GB going into a 1G model. The multiplier is 10.
5. Diverse data. The content owner might want a wide range of data on many topics to go into the model. They might even specify certain data sets, a minimum number of topics, or even a number of word vectors per word used (their keywords). Once again, the odds the model is just repeating one piece of data goes down as the number of data and similar words in the model goes up.
6. "For non-commercial use clause." We'll probably have an easier time asking for proprietary content when whatever we're producing can't make money.
7. Non-compete clause. Since I called out OpenAI's, I'll clarify that I mean a narrower limitation such as textbook companies letting us use their data in models that do almost anything except make textbooks or other, educational materials. There could be companies sitting on massive piles of data, that focus on specific market segments, and would sell us that data so long as we stayed out of that market. These non-compete terms are for specific models build largely on proprietary content that include a restriction not to compete with that content or its supplier. The closest thing in software are licenses like Redis' RSAL. They might even let us train AI models with the outputs if we stay within the restrictions.
8. We'll advertise your company and products. I'm really hesitant to mention this one. I'd rather talk about how to do it right. The goal is to get good marketing, legal docs, technical presentations, art, and so on from publishers. For example, I found book-length documents by hardware companies with both classroom-grade information on electrical engineering and nuggets of wisdom you can't learn in college. The AI models will only need a small percentage of what each company has collected from a large number of companies. Since the model cites them, ask them for it in exchange for the advertising benefits it gives them. Just limit the products, training inputs, and usages to those which make large contributions to training data both by volume and quality.
9. "If you buy or invest in our company." This isn't exactly a license term. Every now and then, you're needing terabytes of data, the companies that have it are worth a few billion dollars, and you're sitting on $10 billion for whatever reason. Just buy the companies. If not, give them a massive amount of money for their content. Also, give them a free copy of the resulting model for internal use. If it's textbooks, edX, or Coursera, maybe add to the deal that all future courses go into the AI models or they start making new ones the AI company needs. If it's companies selling must-have research, you'll pay them to license those papers and they'll be in the model they get, too. If it's chat AI, there could be companies out there with billions of user comments that only make $350 million a year. Maybe their best strategy right now is trying to charge an extra quarter for a thousand API requests. Then, you show up dropping a check that gives them money like that right now and they'll also get an internal copy of the model to use. Hard to say no to offers like that. Over 99.999% of the market won't be able to use No. 9 but I imagine a few companies can. At least a five-letter acronym's worth.
For proprietary content, I'd also love to see shrink-wrap licenses like they have at License Zero for different levels of content licensing by publishers. I encourage people who write licenses like that to start experimenting with them using the concepts above.
So, those are the strategies. Use what you want to. I'm interested in your feedback on them, too.
(Navigation: Go to top-level page in this series.)
(Learn the Gospel of Jesus Christ
with proof it's true and our stories. Learn how
to live it.)