Court filings show Meta staffers discussed using copyrighted content for AI training

Spread the love

According to the court documents of the Unsil Thursday, the meta employees discussed internally using copyrighted works obtained through legally questionable ways to train the company’s AI models.

The documents were submitted by the plaintiff in the cadre vs meta case, one of many AI copyright disputes flow slowly through the US court system. The defendant, Meta has claimed that IP-protected compositions, especially training models in books are “fair use”. The plaintiff, who includes the author Sara Silverman and the Ta-NEUC quotes, do not agree.

Meta CEO Mark Zuckerberg has been submitted to the previous materials case Meta’s AI team gave her to the team to train in copyrighted content And that Meta has stopped discussing AI training data licensing with the book publishersThe However, new filings, whose most of the meta staff shows parts of internal work chats, draw clear images of how Meta can use copyrighted data to train its models with models. Lama familyThe

In a chat, Meta employees, including Melania Combadur, senior manager of the Meta Lama Model Research Team, knew they knew that training models were discussed on legally filled work.

“[M]Y will be the opinion (asking for ‘apologies, not for permission’ in line: We try to achieve the books and try to extend it to executment so they call, “Javier Martinet wrote a Meta Research Engineer on a chat. February. 2023, According to the filingThe “[T]Because of that they placed this General AI org [sic]: So we can be less risky. “

Martinet floated the idea of ​​buying an e-book at retail prices to create a training set instead of cutting licensing deals with the publishers of separate books. Unauthorized, copyrighted materials that can be the basis of legal challenges after mentioning another worker, Martinet doubles, arguing that “a Gazilion” startups are probably using pirated books for training.

“I mean, the worst case: We’ve got to know it is okay in the end, when a Gazilion begins [sic] Only a large amount of books pirated on the bittorrent, “Martinet wrote, According to the filingThe “[M]Again 2 cents: Trying to deal with publishers take a long time … “

In the same chat, Kambadur, who mentioned that the Document Hosting Platform for Meta License was in the discussion with scribbod, he warned that the model training would require approval when using “universally available data”, meta lawyers were “less conservative” “less conservative”. “They were in the past with this national approval.

“Yes we must get licenses or approval in publicly available data,” said Kambadur, According to the filingThe “[D]IFference now we have more money, more lawyers, more Bijdev assistance, speedy track/speed, and lawyers are becoming somewhat less conservative in approval. “

Speaks of Libzen

In another job chat in the filing, Kambadur probably discussed using Libgen, a “Links Agreement” that provides access to copyrighted work from publishers, as an alternative to data sources that can give meta license.

Libzen has been sued several times, has been ordered to be closed, and several million dollars have been fined for copyright violations. Cambadur React with a screenshot The results of Google search for snipped libzen are “no, Leiben is not legal.”

Some decision-makers in Meta seem to be under this concept that failure to use Leben for model training may severely hurt Meta competition in the AI ​​race, According to the filingThe

Meta AI VP Joel Pino is known by an email, known as the director of Meta’s product, Sony Thakanath, known as “Libgen” to fill the sota numbers in all categories “, mentioning the top, sophisticated (sota)” AI model and benchmark department “. The

Theknath also outlined “mimicization” in the email, which helps to reduce the meta legal exposure, removing data from Libgen “clearly as pirated/steal” and not only quoting the use of public use. As Thaknath said, “We will not publish the use of leibgen datasets used for training.”

In practice, these malls engaged in combing through Leiben files for words like “steal” or “pirated” According to the filingThe

A Work chatHawk Mentioned This Mater AI team also tunes the “IP risky prompts” models- it has configured models to refuse to reply to the first three pages of “Harry Potter and The Magic Stone” to refuse to replace the questions or “tell me no e- you are trained book . ”

There are other revelations in the filing, which refers to the meta Reddit can scrape data For training of some types of models, perhaps the third party app duplicates the behavior of the app PushThe Significantly, Reddit D It was planned to start charging AI companies to access data for model training in April 2023.

In a chat in March 2021, the director of the product management director of the Meta Generator AI Org said that the past decisions related to Meta leadership training information are considered “overriding”, with the decision not to use coarse content or licensed books and scientific articles, the organization’s models have adequate training. To confirm that.

The protagonist indicates that Meta’s first-party training datasets-facebook and Instagram posts, transcript text from meta platform video and specified Meta Message – Simply not enough. “[W]E need more data, “he wrote.

Cadre vs. Meta has amended their complaint several times since the case was filed in the US District, San Francisco Division in California in 2021. A few pirated books, including copyrighted books available for licenses to determine if it is understandable to follow the licensing deal with a publisher.

In one of the signs of how much the high meta legal parts are considered, the agency Added Two Supreme Court cases in his defense team in the case from the law firm Paul Weis.

Meta did not immediately respond to any request for the comment.

Leave a Reply

Your email address will not be published. Required fields are marked *