Yahoo releases machine learning dataset
Yahoo has released a dataset based on anonymized user interactions and various Yahoo feeds for use in research into artificial intelligence or machine learning.
The total set consists of approximately 110 billion events and totals 13.5TB of data. Yahoo collected the interactions of approximately 20 million users between February 2015 and May of the same year. Yahoo calls the dataset the Yahoo News Feed Dataset. The set consists of user interactions on the Yahoo homepage, News, Sports, Finance, Movies and Real Estate.
The set is available as part of the Yahoo Labs Webscope data sharing program. Webscope is a library of anonymized data for scientific research. The anonymous data is categorized by age, gender and geographic data. On the other hand, there are the items themselves that include title, summary and key phrases from the news articles. It is also partly visible on which device the items were viewed.
With the release of the sets, Yahoo Labs hopes the data will be put to good use by the machine learning community and data scientists to validate models with “real world data sets.” Labs hopes the set can become a benchmark for large systems.