Meta Rips Off The Author And Passes The Savings On To Skynet

It turns out that Meta, AKA Facebook, used a giant database of pirated books known as “book3” for their AI generative training efforts.

Indeed, you can now search an index to see who was ripped off.

Did they rip me off? Not by name, as I have no published novels, but they did ripoff Mike Ashley’s The Mammoth Book of Extreme Science Fiction, which has my story “Crucifixion Variations” in it, so yeah.

They ripped off Howard Waldrop:

  • Dream Factories and Radio Pictures
  • Going Home Again: Stories
  • Horse of a Different Color
  • Other Worlds, Better Lives
  • Things Will Never Be the Same
  • They ripped off a whole lot of Joe R. Lansdale.

    They ripped off a whole lot of George R. R. Martin (in multiple languages).

    There’s already been a lawsuit filed against Meta by Richard Kadrey, Sarah Silverman and Christopher Golden over using their material for training AIs, but there seems to be no mention of pirated books or book3.

    The fact that Meta is not only training AI on author’s works without their permission, but using pirated copies to do so adds insult to injury.

    And probably additional monetary damages from the resulting lawsuits.

    I expect the latest piracy revelations to lead to whole host of new lawsuits…

    Tags: , , , , , , ,

    9 Responses to “Meta Rips Off The Author And Passes The Savings On To Skynet”

    1. Meatwood Flack says:

      In other words, Mark “serial IP thief” Zuckerberg has struck again. He’s lucky too, what with CA’s repeal of the 3 strikes law.

    2. 10x25mm says:

      You have to wonder whether FakeBook’s ripoff occurred in the United States or another country. The Marshall Islands appears to offer no copyright protection whatsoever.

      Disney really screwed legitimate AI training in the United States with their 95 / 120 year Copyright Act enhancement now enshrined in Title 17 of the United States Code. Since most other countries have much shorter copyright protection periods, legitimate AI development is likely to migrate to foreign countries.

    3. […] Meta Rips Off The Author And Passes The Savings On To Skynet. “The fact that Meta is not only training AI on author’s works without their permission, but […]

    4. Georg Felis says:

      Interesting how they picked real writers and their best books to train the AI rather than ‘award-winning’ trash like “If you were a dinosaur, my love”

    5. Book3 and pirated books *are* mentioned in the class action lawsuit brought against OpenAI by the Authors Guild https://authorsguild.org/news/you-just-found-out-your-book-was-used-to-train-ai-now-what/

      One of my books is in that list.

    6. CardanoCrusader says:

      Given the vast input an AI LLM needs to do it’s job, it would be REALLY hard to argue that the AI is merely derivative, especially given the proprietary algorithms required to produce the output. If anything qualifies for “fair use”, certainly transforming 500 pages as part of a 20 million page data set would qualify. Not only do the proprietary statistical algorithms add value, differentiate the output, but even the original word-to-number conversion is proprietary.

      The conversion of the words of the original work into numbers is already a differentiation. It is an add-on value given to the work by the people who assign the numbers, the weightings and the numerical categorizations to the words. Arguably, the number string derived from a given work is it’s own entity, unique in value from the original work, and that’s BEFORE it is fed into the algorithms. At that point, the original author of the original work arguably no longer has a copyright claim.

      So, then this proprietary number string, with its unique weightings and categorizations, is fed through proprietary algorithms. The output is unique to the algorithms, the weighting and the original number conversion. So, what’s left to copyright? The output stream? How?

    7. jabrwok says:

      Larry Correia shows up in that searchable index. I wonder if he’s contemplating a lawsuit. Or maybe Baen Books could do so on behalf of all its authors.

    8. Paul says:

      No, 500 pages or a 500 page book is not fair use. Fair use is based in part on how much of a work is being used compared to the totality of that particular work. So if I grab half a page of a book and footnote it and use it in my work, that’s fair use. Or 2 paragraphs of a 6 paragraph blog post. Since Facebook is using *all* 500 pages of a 500 page book, that would *not* be fair use. That’s using the whole bloody thing. That’s blatant theft of intellectual property.
      They are stealing all 20 million pages.
      Now, if they grabbed page 10 from 5 thousand books, you *might* have a fair use argument, but that would only get them 5,000 pages, not 20 million.

      As for “derivative”… does the LLM exist and function without the 20 million pages of input? No, it does not. It requires the input. Without it, the LLM does not exist. It becomes a tLM (tiny Language Model)(@copyright me). So it is obviously 90% derivative from the body of text it is being fed. Some programmers worked on the algorithms for what? a hundred man-years? The books they are stealing comprise the work of probably hundreds of thousands of man-years of labor and love.

      Next, are these companies making money from the use of the entirety of multiple author’s works? Yes, they are. So they should be completely screwed and bankrupted when this goes to court.

      Additionally, LLM’s in science, current events, news, pop culture, etc., will quickly age if they don’t continually get up to date input. So how is that going to work? You can’t have an AI write an article on the Israeli election without input from somewhere. Where are they receiving that information?

      But companies could build LLM’s on their own data. Microsoft Press owns the copyright to hundreds or thousands of technical manuals. They are free to use those to build a technical LLM. Similarly, the US military has thousands of manuals and textbooks for machinery, ships, vehicles, weapons, military tactics, strategy, and logistics. DoD should build an internal LLM using all of that plus selected non-DoD works that are either expired copyright or where they compensate the authors appropriately.

      If someone were to digitize every novel and book printed, say, pre-1900, and build an LLM from that, then go for it. Those copyrights should all be expired by now (see Project Gutenberg).

      That they don’t consider an AI built on that information viable because it doesn’t have current information, shows that the current authors’ works are critical to the success of their systems.

    9. JBalconi says:

      Dean Koontz could destroy them. They pirated not only his own novels but collaborations.

    Leave a Reply