A team of researchers led by experts from the Massachusetts Institute of Technology (MIT) examined ten datasets most commonly used to test machine learning systems. The scientists found that about 3.4% of the data was inaccurate or mislabeled. This could cause problems in artificial intelligence systems that use these datasets.
The datasets, each linked to over 100,000 works, including text content, images, and videos from newsgroups, the Amazon store, YouTube, and the IMDb movie database. Errors include negative product reviews, mistakenly marked as positive, incorrect descriptions of what is shown in the illustrations, inaccurate descriptions of the content of sound recordings.
It is significant that the researchers also used machine learning methods and related software tools to detect possible errors.
It remains to add that some errors can be considered insignificant, and sometimes we should rather talk about the ambiguity of the input data. However, in one of the datasets – the QuickDraw test – there are errors in about 10% of the dataset. What AI can learn from such data is anyone’s guess.