MLCommons and Hugging Face team up to release massive speech data set for AI research

Spread the love

MALMALMONS, a non -profit AI Protection Working Group, has hugged the AI Dev platform for the collection of one of the world’s largest public domain voice recording for AI research.

Data set, called Human speech is uselessThere are more than a million hours of audio extending at least 89 different languages. MALMONS says that it was inspired to create it by the desire for research and development “in different fields of speech technology”.

“Supporting broad natural language processing research for other languages other than English helps bring communication technology to more people worldwide,” the company wrote this Blog post Thursday. “We expect a number of ways to continue the construction and development of the research community, especially in improving the lower-resistant language speech models, in the recognition of extended speech and fancy applications in speech synthesis in various accents and dialects.”

This is a commendable goal, to be sure. However, AI data sets can bear the risk for researchers who prefer to use them.

Biased data is one of those risks. Recording in the speech of obsolete people is from archive. Orgorg, it is probably the most well known for the web archive equipment. Because many contributors to Archive. Org-Org-English-speaking-and American-— all recording are in the American-echoed English in the speech of unexpected people, Ridom on the official project pageThe

This means, without carefully filtering, AI systems can show some of the same superstitions trained in speech as speech recognition and voice synthesizer models. They can fight to replicate English allegedly by a native speaker or have problems with creating synthetic voice in language other than English.

The speech of obsolete people may have recording people that they are unknown that their voices are being used for AI research – including commercial applications. However, MLComnos says that all the recordings on the data set are available under the Public Domain or Creative Commons License, the possibility is likely to be wrong.

According to an MIT analysisHundreds of publicly available AI training data sets lack licensing information and defects. Creator lawyers, including AI Ethics-centered non-profit non-profitly trained Ed Newton-Rex, have made the case that the creators do not need to “opt” the AI data sets because of imposing these creators.

“Many creators (such as squaresspace users) have no meaningful way of alternative options,” Newton-Rax wrote In a post of X last June. “For the manufacturers who are Can Opt out, there are multiple overlapping opt-out methods, which (1) incredibly distracting and (2) their coverage is horribly incomplete. Even if a perfect universal opt-out exists, it would be very unfair to put the opt-out burden on the creators, that the generator AI uses their work to compete with them-not just to understand that they can choose. “

MLComonos says it is committed to update, maintain and improve the quality of the speech of obsolete people. But giving potential flaws, it will see the developers be seriously careful.

Leave a ReplyCancel Reply

Trending now