2023 Kaggle AI Report

What we have learned about Machine Learning and Data Science from Kaggle over the past two years

Oct 10, 2023

Kaggle is the World’s largest Data Science and Machine Learning community. It has grown from a site that is primarily focused on hosting ML competitions, to a one-stop-shop for almost everything you need to jumpstart your Data Science career: courses and educational materials, online discussions, computational resources, including free GPU and TPU compute, one of the largest collections of freely available high-quality datasets, and a place to host some of the best pretrained deep learning models. When I first came across Kaggle many years ago I immediately grasped its potential for my own career development, and I feel I have been blessed to have been able to take full advantage of everything that site has to offer.

One aspect of Kaggle that I feel is still under-appreciated and underutilized is its value in terms of pushing the boundaries of Machine Learning and Data Science applied research. IMHO, most Kaggle solutions are years ahead in terms of incredible insights compared to what the research community and the current industry cutting edge work comes up with. Every few weeks it seems like some big AI startup would announce a new SOTA model based on a trick that Kaggle practitioners have been using for years (yes, we know about pseudolabeling). Nonetheless, there have been many good research papers that have come out of Kaggle competitions over the years, and I count myself fortunate to have been part of a few of them.

This year Kaggle organized a new and unique challenge. A few months ago Kaggle held an essay competition in which participants were tasked with coming up with a report on the most important insights from the past two years of contributions on the Kaggle platform. The essays were divided into seven separate categories: Generative AI, Text Data, Image/video data, Tabular/time series data, Kaggle competitions, AI ethics, and Other. Essays were initially scored by the other members of the Kaggle community, while the final scoring was done by the seven area chairs. It was my privilege and honor to be the area chair for the Tabular/time series category, and contribute my thoughts and evaluations on that category for the final report.

Tabular data Machine Learning has its own set of quirks and challenges. For many years, from the very start of Kaggle, that form of ML was the norm for most Kaggle competitions, and what most people had in mind when thinking of Kaggle. In recent years Kaggle has largely moved away from tabular data ML, but nonetheless it was gratifying to see that even now it is possible to get most of the important information there about what is so special about this rich and challenging subfield of Machine Learning.

You can find the pdf version of the 2023 Kaggle AI report here.

The slides for the report can be found here.

Bojan’s Newsletter

Discussion about this post