What is copyright? Most people who work are compensated for their time or deliverable one time. Others can do whatever they want with their ideas or other creations from that moment forward. At some point, rulers decided that creative works were special enough to merit monopolies on displaying, copying, or reusing them. The idea was that those laws would boost innovation immediately while still making those works free for public use later on. Originally, copyright in the U.S. only lasted 14 years with a renewal possible. Laws kept extending that toward the lifetime of the author. Later, corrupt lawmakers extended it to around 70 years after the author was dead. Why after death? So that companies that owned the works could continue profiting off of restricting those ideas while society got nothing in return for that. Today, the U.S. system is a mix of a concept society finds acceptable (rewarding content creators) plus legal additions bought and maintained with bribes.
People responded in different ways. Many continued sharing content with each other because that's natural for them. It also has clear benefits. The Wired documentary on Shenzhen, an area that constantly remixes ideas, shows how innovative that natural flow can be. While many were unaware it was illegal, others willingly broke copyright laws as civil disobedience against corruption or just for selfish gain. Others, esp librarians, stayed within the law by fighting back in court to protect our rights. They argued for fair use, public domain, and first-sale doctrine on textbooks. What gains courts allowed in these areas created tremendous benefits for society: preserving all human knowledge (e.g. Archive.org), inexpensive education, increasing creativity (esp software/video), improving productivity (eg Google), and even national R&D which is choking on paywalls.
Some copyright holders fought for more power. They were suing people for reusing anything that became culturally trendy. They attempted schemes like Ultraviolet to control your content (app stores today). They forcing unskippable content, esp warning screens and advertisements, on our movies we already paid for. Even paying customers started using pirated versions to simply watch the movie. Others couldn't use Windows due to its copyright protection with some wanting versions without to just to reliably use software they paid $100 for. With all these problems, the pro-copyright lobby bribed Congress to pass DMCA so you'd be a criminal if you turned any of that off. DMCA also let them take your content off the Internet by claiming it's technically theirs. Later, they tried to grab more power with SOPA but Internet users fought back.
Copyright law is currently corrupt and damages innovation. Personally, I think we'd be better off abolishing copyrights and patents in their current form to force free market competition on ideas with iterations happening quickly. People will continue to invent since they have many motivations: glorifying Christ, social good, money, ego, and so on. If wanting to reward creators, then limit copyright terms to 1-5 years to match most product, life cycles last. Limit patents to the time it takes the fastest player to deploy a quality with the patented ideas. If a new phone comes out every 2 years, then the patent lasts 2 years. Those with lots of money can keep funding expensive, risk taking. This will optimize the market to reward with profit those who sell whatever is in demand, new or stable. They will be working, cooperating, and competing. Right now, many suppliers try to serve their customers as little as possible while restricting others' innovations.
Also, lock-in is a big problem. If wanting a free market, we should modify our laws to reduce switching costs so you can easily swap out suppliers. I recommend any paid offerings require both data formats and API's to be open source. Open, data formats let the user and competing solutions make use of the users' data. Open API's let the software be extended by the user or other suppliers to do more than it was originally designed for. Entire ecosystems can form this way. The supplier should have permissively-licensed tools that produce these files for any version that the publish. After Apple's lawsuits, I'll add that other companies should be able to copy a product's appearance or interface to reduce training costs. We can require they make it clear that it's a different piece of software. Buyers should get the right to repair or modify but they waive liability from the supplier in that case. Finally, no copyright or patents should block independent creation of software or hardware.
Patents
One of the biggest risks you can face is patent suits, esp patent trolling. This is a problem outside of AI-generated works since you could step on someone's patent at any time. Many compare it to walking through a minefield. What I will say is that any output of AI's that you use commercially might get you hit with a patent suit. If it's patented, it's probably not your I.P. no matter what copyright law or the A.I. company says. Also, many CompSci and medical papers are about patented works which the paper may or may mention. AI's whose training data include patented works will probably generate infringing outputs. Violating a patent after looking at it, or knowingly violating it, leads to higher, damage awards. That some AI's are actually trained on the patent databases is hilarious when considering these risks.
My later sections will largely ignore patents. So, I'll mention a few, quick ideas to solve it. We might make sure all training data for foundational models used with R&D or coding are at least 20 years old. There's still patents that can get you, esp in the U.S. and U.K., but that would reduce most risk. Then, layer works that are potentially patented on top of those in a way where the AI cites anything it uses. Those papers or names can be checked against patent databases. We can even have separate sets of research papers that are already verified as having no patents or invalidated patents. Patent-immune AI's can be trained on those papers. This might have the side effect of either discouraging patents or encouraging building on non-patented work.
(Note: There might be patents on the model architectures, parameters, etc. I'd use patent-free architectures or license the critical part of the best ones. I'm also collecting alternative architectures partly for this reason.)
Trademarks
Trademarks can include slogans, images, and more. The AI's might use these. What are we to about that?
Fair Competition Laws
I think it's relevant to bring up non-competes. Some firms, including OpenAI, added non-compete clauses to their terms of service that say you can't use their outputs to compete with that company. That's a broad statement. There's numerous laws in the United States that exist to promote fair competition. Examples here. These services might be violating some of those. Whereas, companies building on AI models that they open source shouldn't have a problem here.
I asked Bing (GPT4) if it knew about AI products without the non-compete agreements. Bing dodged the question with (paraphrasing): "Hmm. Let's try changing the topic." or "That's awkward. Let's try changing the topic." It looks like they programmed the AI to dodge questions about their shady, legal terms. That's often a sign of dishonesty.
Organizational Character
God designed us to love Him and reflect who He is, and to love others as ourselves. He'll mainly judge us on our character. If we've done evil, only Christ can save us by His character. In this life, we must also consider the character of those we do business with.
"You shall have just balances, just weights, a just ephah, and a just
hin. I am the LORD your God, who brought you out of the land of Egypt."
(Leviticus 19:36)
Jesus said: "He who is faithful in a very little is faithful also in much.
He who is dishonest in a very little is also dishonest in much. If
therefore you have not been faithful in the wicked money, who will trust
you with the true riches? If you have not been faithful in that which is
another’s, who will give you that which is your own?" (in Luke
16)
The AI companies are always talking about ethics. They say they're about the public good hoping more people benefit from AI. Then, they restrict competition with their services. Although warning of AI's dangers, they keep raising billions to expand the capabilities of AI's they own. They take others' work and ignore their terms before putting restrictions and legal terms on their own works that they want customers to follow. Recently, there's evidence OpenAI has even been secretly reducing the capabilities of their products. They kept the price and marketed benefits the same, though. For the future, there's the risk that power and money corrupt. Most of the major players are either chasing billions or trying to make billions. If either succeeds, they wield the power of superhuman intelligence over others.
These same companies want to be the ethical guardians of A.I.. They've proven to be liars, thieves, and hypocrites who also restrict others' rights for selfish gain. Those are the last people we should trust to decide AI ethics! If considering law and regulation, we should be even more critical of their character. We should entrust regulation in the hands of those who are righteous, care about others, understand the subject, and have a track record of making ethical and effective decisions. Look for those people and organizations to help decide A.I.'s future.
That said, these A.I.'s offer enormous benefits for society. They can act as teachers, creativity/productivity boosters, problem solvers, improve safety/security (including U.S. infrastructure), and more. I couldn't begin to summarize all the ways people have used products like ChatGPT. At a national level, whatever country has strong A.I. will also have many, competitive advantages against other countries. The biggest obstacles to developing good A.I.'s are:
(a) Access to a large amount of high-quality data for training that's 100% legal to use for training A.I.'s.
(b) Compute costs for training models which are impacted by the need for hundreds to thousands of expensive GPU's that are supply-limited.
There's also a large, growing ecosystem of open-source A.I.. They're following a model that defaults on no copyright, fair use, and open access. The innovation has been so high that there's thousands of free models on HuggingFace alone. On training, those supporting open-source tools drove the cost down from maybe tens of millions for GPT to $200,000 for smaller models like MPT-7B. A collaboration of 1,000 people produced a GPT3-sized model that they released for free. U.A.E. and China released their top models, too. The fine-tuned models from the open-source community are also matching or outperforming commercial models despite costing almost nothing. Even big companies know the open-source community is racing ahead of them. That's why Facebook open-sourced their model to ride that momentum. Even the free tools (example) for using proprietary A.I.'s are better than what the suppliers themselves offer.
Bottom line: A.I. development will race unstoppably forward regardless of what U.S. law says, the benefits of open-source A.I.'s are enormous, and the best choice is to make it legal to use copyright works to train A.I.'s.
The main threat to A.I. training is copyright law. For that reason, this
section will mention the arguments from A.I. developers that their models
are legal under copyright law.
Their Arguments
The first argument is that they're generative just like human brains. While they aren't brains, most of these AI's learn from example like the brain: they receive tons of information (often unstructured), mix that data into mathematical algorithms, form an internal representation that's totally different from the original form, and use that internal representation to generate new information in response to user input ("prompts"). Each execution of an A.I. model can produce outputs containing pre-existing content, new content, hallucinated content, or total non-sense. What's shared or different can vary word by word, pixel by pixel, and beat by beat. What goes in is clearly not what's coming out in many cases, showing their originality. However, what goes in does come out in other cases and might a fragment of a copywritten work. Proponents proponents argue that, since they're doing the same things, the A.I.'s should be treated under copyright law just like people making new works.
Another argument is built on the generation process. Copyright law says that a work that is substantially transformative can legally create a new work. I've shown that the A.I.'s take in data, substantially transform it, and then generate new data. The A.I. proponents argue that their works are substantially transformative. I think we should make a distinction on whether it's the ideas themselves or the mechanics used to generate them that are substantially transformative. If it's ideas, then this might be tested by showing how the A.I.'s keep producing new ideas that would be original works if humans made them. If we count mechanics, then the complex process that generates A.I. outputs might already count as transformative enough by law. There's also tools that visualize how neural networks work layer by layer. We can more clearly see how transformative they are.
Their next argument on top of that is that, so far, courts have ruled that computer-generated content cannot be copyrighted. This is definitely computer-generated content. It's also copyright works driving that content generation process in a way that can sometimes reproduce the copyright works. Considering that, courts might not treat that computer-generated content the same way as prior generated content. For example, if the prior generators couldn't reproduce copywritten works, then that wouldn't have been a consideration in that ruling. If they could, the old ruling could be reused in this case.
One more argument is fair use. The Pile paper said that "non-commercial, not-for-profit use of copyright media is preemptively fair use." Some groups are using free, data collections to train models they're releasing for free, often for personal use. The data collections have many copyrighted works in them. They argue their collection, data curation, and AI training are non-commercial and non-profit. Therefore, it's a fair use of that data. What's unclear is what happens legally when that same data comes out of the AI. Also, even open AI's are being used commercially with most licensed in ways allowing commercial use. Will that be fair use?
Let's use a simple example. The MPT and LlaMA-2 models are under highly-permissive, free licenses. That might be considered non-profit and non-commercial at the model level. However, the companies will make money off of the models. They might directly. They mostly will indirectly. Is even an Apache 2-licensed model considered non-commercial if it's designed to make a business money? Would the organization producing the models and the organization profiting on them have to be different?
Let's look at that another way. Let's say someone downloaded the data from the web sites themselves in a way that respects their terms. The data is often copywritten but non-profit and non-commercial is fair use. So, they make a model from all that copywritten data that they then release for free under an Apache license. Others use that model commercially to produce outputs that might also compete with the content creators. Are both the commercial use of that model and its outputs legally clear?
Let's use the same setup with a different goal. We have that model generate high-quality, training data for cheaper models. Researchers already do this with GPT-3/4 using training sets like ShareGPT. Will the resulting model be totally clear where you could never get sued for using it or what it outputs? If it is, we can put in one, huge investment into a models like BLOOM-176B or Falcon-40B that continuing paying off by making other models better. If that risk is there, though, using training data from any model with copywritten works might legally contaminate the other models.
(Note: I've love expert opinions by I.P. lawyers on these issues.)
Let's go back to summarize their main argument: what LLM-style AI's do is similar to how humans produce original work from prior knowledge, is heavily transformative, the outputs can't legally be copywritten, and is at least partly fair use. Therefore, both the training and generation process should be legal under copyright law with no lawsuits allowed against any A.I. producers or users. Also, that no copyright claims can be made on A.I. outputs which, as generative works, can't have a copyright anyway.
(Note: One author of The Pile said that AI models aren't copyrightable. He also claimed that they're preparing to take Meta/Facebook to court to establish that precedent. His stated goal is to make life better for researchers and programmers sharing or using their own works. The resulting precedent would allow AI models to be built on almost any data before being used for any purpose. That includes both reproducing and improving that data.)
Their Risks
First, it's hard to say they're transformative in situations where they produce content that matches data in the training set either verbatim or that's really similar. What's often illegal for people should also be illegal for software whose creators say it works like people. Second, products like ChatGPT and Claude are definitely not non-commercial because they charge for their products and raise investor money. Open-source models often generate value for their creators and users in many ways: value-added services, grants, paid hosting, consulting, padding resumes, fame, etc. All of them use copywritten works intended to benefit those content creators to benefit everyone except those content creators. When arguing fair use, it looks like they dropped "fair" but went all in on "use."
Next section: Proving Wrongdoing.
(Navigation: Go to top-level page in this series.)
(Learn the Gospel of Jesus Christ with proof it's true and our stories. Learn how to live it.)