Meta’s Use of Pirated Material to Train AI, and Why You Should Care

Meta’s Use of Pirated Material to Train AI, and Why You Should Care

 

It all started with a piece in The Atlantic by Alex Reisner ( https://www.theatlantic.com/technology/archive/2025/03/libgen-meta-openai/682093/ ) revealing that Meta, the organisation behind social media sites such as Facebook and Instagram, have been using a library of pirated written material to train their generative AI. Of course, this is a bit of a simplistic starting point. There have been ongoing outrages throughout creative communities for years now, including legal cases brought by users of DeviantArt to MidJourney for their use of copyrighted images to train AI ( https://www.theartnewspaper.com/2024/05/10/deviantart-midjourney-stable-diffusion-artificial-intelligence-image-generators ). Similarly, a group of authors, including Paul Tremblay and Mona Awad, brought a lawsuit against OpenAI for book scraping ( https://www.theguardian.com/books/2023/jul/05/authors-file-a-lawsuit-against-openai-for-unlawfully-ingesting-their-books ) that were partially dismissed in February ( https://www.theguardian.com/books/2024/feb/14/two-openai-book-lawsuits-partially-dismissed-by-california-court ). But the recent furore, and the betrayal of the writing community, is suddenly very focused around this issue.

The article came about when information was released in relation to a separate copyright legal case against Meta (the one mentioned above), which revealed they have used pirate bit torrent library LibGen to train its new generative AI engine, Llama 3. It also provided a link to a snapshot download of LibGen’s archives ( https://www.theatlantic.com/technology/archive/2025/03/search-libgen-data-set/682094/ ), so writers could confirm if their writing had been pirated, and if there was a chance it had been used in AI training. It should be noted, inclusion on this list of works doesn’t necessarily mean that work has been used by Meta, as the library is ever-growing and changing, and we don’t know at what point Meta accessed LibGen’s archives.

For those not in the know, pirated libraries such as LibGen are somewhat common, although many authors probably didn’t realise the extent of their archives until The Atlantic’s article made it clear. Anecdotally, the issue is so widespread that I feel like perhaps the only person in my social and academic groups who hasn’t had their work pirated this way (I’m sure it’s nothing personal). The so-called libraries are a legal infringement in and of themselves, and several lawsuits have already been brought against the originators of these libraries. However, fines and orders to shut them down have been, so far, basically ignored. 

According to Reisner’s sources, when faced with a decision between going through legal or questionably legal procedures to obtain large quantities of copyrighted material, Meta’s management decided that the legal process was both too slow and too expensive for their needs. Their employees recognised that this course of action posed a “medium-high legal risk”, but decided it was worth it. The end result is that thousands of pieces of pirated copyrighted written material – including fiction, non-fiction, and academic writing – have been used to train an AI product that stands to make Meta a lot of money.

But how exactly have these works been used? Is it such a big deal? Well, there are two sides to this argument. The existing legal argument proposed by Meta and other organisations currently in legal disputes is that their AI doesn’t reproduce copyrighted material, so it isn’t active plagiarism. The way that generative AI is trained is comparable to someone who wants to learn to write reading a butt ton of writing in order to recognise what makes good writing, how normal people speak, and what information is important – except on a massive scale. The issue is that if you, a new writer, were to read some books to learn your craft, you would acquire them through legal means such as borrowing from a library or buying from a store (or I really hope you would!). I’m going to work on the assumption piracy is something you’re not okay with. Meta looked at the legal equivalent of doing this and decided that wasn’t for them. This means that the authors and publishers haven’t been paid to have their material scraped and potentially plagiarised.

But, as you might know if you’ve seen the popularly shared statements from authors like Adam Nevill ( https://adamlgnevill.com/blogs/blog/ai-an-authors-thoughts-after-the-atlantic-magazine-broke-the-meta-heist-story-part-1 ), authors have far more reasons than this to be angry about pirated data scraping. There is the ethical issue of using authors’ material to train AIs that might replace them. Already, many publications that operate on open submissions are being swamped by AI-generated pieces of fiction. There is the issue that Meta stand to make a significant amount of money from these writers’ hard work – work that is poorly paid to begin with, even if acquired legally. And in general there is a deep and real anxiety about maintaining rights for human creatives in a market soon likely to be swamped with material generated, in some form, by AI.

In response to these concerns, various authors and organisations have been protesting to demand culpability, recognition, and rapid legislation. The Society of Authors have organised a petition ( https://www.change.org/p/protect-authors-livelihoods-from-the-unlicensed-use-of-their-work-in-ai-training ) to “protect authors’ livelihoods from the unlicensed use of their work in AI training”, and at time of writing they have 25,000 signatures – a not insignificant number when you consider that there aren’t actually that many writers. A protest march was also rapidly organised and took place on the 4th April. A group of writers and supporters marched to Meta’s registered office in London, led by author of The Spirit Engineer and The Betrayal of Thomas True, A.J. West ( www.linktree/ajwest ), who gave a stirring speech to those who attended. West stated that, “Never have British authors suffered such a wholesale invasion of their right to own our own words. Our own stories. Our own voices,” before calling directly on the Culture Secretary, Lisa Nandy, and Arts Minister, Sir Chris Bryant, to arrange “an urgent face to face meeting with government so that immediate, decisive action can be taken to protect British writers from the monster that threatens to destroy us”.

Which brings us to why this matters to the writing community, and I think particularly to early career writers. If this is an issue purely of legislation and legalities, it would be arguably best left to law courts and politicians – although as the past year has shown, both are proving slow to respond to the fast-paced development of generative AI. But this isn’t just a concern for lawmakers, and for a few reasons. Firstly, who better to articulate the personal fury and ethical issues of generative AI training than our world’s writers? Not only are we the ones being used for this data scraping, but our community has been writing ‘AI goes wrong’ stories for decades. There was a time when AI was science fiction, and our community are the ones who conceived of these stories. If anyone knows the potential ethical, social, and environmental dangers of AI, it’s authors!

But more than this, particularly early career writers are the people likely to be targeted by generative AI that claims to make you a better writer. It was perhaps notable that the most shared section of Adam Nevill’s blog on the AI issue was a section castigating those who would use generative AI to call themselves writers. We live in a time when I know, as an editor, that so many writers look to the internet to ‘learn’ how to write rather than just writing. Without proper information and legislation, early career writers are a group of people likely to be targeted by this kind of AI. It is therefore important to always know where these tools have come from, who they have trodden on in order to become profitable, and the extent to which they can actually help you to become a good versus a derivative writer.

You may also like...