Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

Meta CEO Mark Zuckerberg appears to have used YouTube’s battle to remove pirated content to protect his own company’s use of a data set containing copyrighted e-books, newly released snippets of his deposition reveal. Late last year.
The deposition, which was part of a complaint submitted to the court by the plaintiffs’ attorneys, related to the AI copyright case. Kadrev. meta. It’s one of many lawsuits swirling through the US court system pitting AI companies against authors and other IP holders. In most cases, the defendants in these cases – the AI companies – claim that training on copyrighted content is “fair use”. Many copyright holders disagree.
“For example, YouTube, I think, may host some things that people pirate for a while, but YouTube is trying to take that stuff away,” Zuckerberg said during his speech. Part of a transcript Made available Wednesday night. “And most of the stuff on YouTube, I would guess, is kind of good and they have a license to do that.”
Snippets from Zuckerberg’s deposition provide some clues to Zuckerberg’s thinking about copyrighted content and fair use. However, it should be noted that a full transcript of the deposition has not been released. TechCrunch has reached out to Meta for additional context and will update the article when the company responds.
Based on the deposition nuggets, Zuckerberg appears to be defending his meta use of a training data set of e-books called LibGen to develop his family of AI models known as Llama. Meta’s Llama competes against flagship models from AI companies like OpenAI.
LibGen, which describes itself as a “link aggregator,” provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued numerous times for copyright infringement, ordered to shut down, and fined millions of dollars.
According to court filings released this week, Zuckerberg cleared the use of LibGen to train at least one Mater llama model despite concerns among the company’s AI executives and research team about the legal implications.
Counsel for the plaintiffs, who include bestselling authors Sarah Silverman and Ta-Nehisi Coates, cited Meta staff as referring to Libgen as a “data set we know to be pirated” and flagging that its use “could be harmful.” [Meta’s] negotiating position with regulators,” according to a Legal filings,
During his deposition, Zuckerberg claimed he had “not really heard” of Liebgen.
“I understand you’re trying to give me an opinion about LibGen, which I haven’t really heard,” Zuckerberg said during the deposition. “It’s just that I don’t have knowledge of that particular thing.”
Under questioning by one of the plaintiffs’ attorneys, David Boyce, Zuckerberg explained why it would be unreasonable to ban the use of data sets like LibGen.
“So do I want a policy against YouTube users because some content might be copyrighted? No,” he said. “[T]There are cases where having such a ban may not be the right thing to do.”
Zuckerberg said Meta should be “pretty careful” about training on copyrighted material.
“You know, [if there’s] Anyone who’s providing a website and they’re intentionally trying to infringe on people’s rights … obviously that’s something that we want to be careful about or careful about how we engage with it or even prevent our teams from engaging with it,” Zuckerberg said. said his deposition, according to the transcript.
Attorneys for the plaintiffs in Kadre v. Meta have amended the complaint multiple times since it was filed in 2023 in the United States District Court for the Northern District of California, San Francisco Division. The latest amended complaint filed by plaintiffs’ counsel late Wednesday includes new allegations against Meta, including that the company cross-referenced some pirated books on LibGen and copyrighted books available for license. Lawyers alleged that Meta used this strategy to determine whether it made sense to pursue a licensing agreement with a publisher.
LibGen is allegedly used to train the latest family of meta llama models, llama 3, according to the redacted filing. The plaintiffs also allege that Meta is using the data set to train its next-generation Llama 4 model.
According to the redacted filing, the Meta researchers allegedly tried to hide the fact that the llama models were trained on copyrighted material by inserting “supervised samples” into the llama fine-tuning. And Meta recently downloaded pirated e-books from another source, Z-Library, for Llama training in April 2024, the amended complaint alleges.
Z-Library, or Z-Lib, has been the subject of several legal actions brought by publishers, including domain seizures and removals. In 2022, Russian citizens who maintained it were charged with copyright infringement, wire fraud, and money laundering.