Why Criticism of Kaggle Often Misses the Point
Ok, here we go, i’ll stick my head above the parapet. There’s a debate that’s been gathering pace for a while now, a backlash against Kaggle by those exclaiming a number of points as to why Kaggle isn’t worth doing, that we should maybe not hold winners in such high esteem and that experience gathered doing competitions will not transfer to real life. Some of the more common criticisms include:
- You can only do well in Kaggle if you have access to expensive hardware
- Kaggle favours overfitting and finding leaks
- Kaggle is not representative of actually working in data science
These points do have some basis in truth; they have been correct on occasion but i believe ultimatley are either disproportionately critical or miss the point of Kaggle.
Some competitions have been won by those training on GPU clusters for days, and we have seen leaks and hacks effect the legitimacy of final results. However, these have always been exceptions rather than the general experience of many Kagglers and have become less frequent as the platform has matured.
Many competitions have been won by people with modest setups and Kaggle now hosts many competitions that require use of Kaggle notebooks with set limitations such as training time, levelling the playing field for competitors. In addition, leaks and hacks are often called out by the community allowing organisers to make modifications to deal with these issues.
It’s the point about Kaggle not being reflective of working in data science however, that i want to discuss further and explain why I believe that criticism misses the point of Kaggle.
To do this we’ll break this point down into two themes:
1. Kaggle Doesn’t Represent the Full Life Cycle of a Project
Again, there is some truth to this argument. Kaggle competitions doesn’t really have anything to do with several aspects of a data science project.
Doing a Kaggle competition won't teach you how to analyse a business to see where a model could increase revenue, it won’t give you practice on how to deal with stakeholders or how to identify and gather the appropriate data. It also won’t help you develop the best practices for deploying models in production. All important and sometimes undervalued parts of a project.
However, this isn’t really the point of Kaggle and most Kagglers, I don’t think, would argue that doing well on the platform means you suddenly have all the skills neccessary to cover every aspect of working in business or research.
What Kaggle does provide is the chance to solve real machine learning problems that require detailed working through exploratory data analysis, feature engineering, model training and selection, ensembling and parameter tuning.
It will give you exposure to problems across a wide range of domains, from classic problems involving tabular data from finance, advertising and retail to finding innovative solutions to problems involving text, image and audio data. In addition the community is very transparent with code and analysis being frequently shared during competitions giving opportunities to learn from the work of world class data scientists.
Could working in a data science job or on personal projects give you this same level of exposure?
Kaggle competitions also reinforce the importance of a solid cross validation strategy. Anyone who has been burned by a leaderboard shake up will forever understand the importance of never taking this step lightly.
I would also argue that finding and calling out leaks and hacks often requires a depth of technical understanding and analysis that goes beyond the average data scientist. How valuable to a business is someone who has the attention to detail and data exploration skills to be able to discover flaws in data that most miss?
2. To Do Well on Kaggle You Build Solutions That Are Over-engineered
Once again, this argument has some basis in truth. Any of the top 20% of solutions in some competitions would probably be good enough to generate the value many businesses need. In the real world, additional time spent tweaking a model might be better spent on other parts of the data science process.
To start with, I would challenge that this argument assumes that all applications of data science yield the same value from all domains. The reality is that what might be diminishing returns from getting that extra level of accuracy from models in retail or manufacturing might be hugely valuable for models in finance or medicine.
Ultimately, I believe that Kaggle is about pushing the envelope of data science both in terms of being able to break through previous limitations or applying these methods to problems or domains for the first time. Image classification competitions hosted on Kaggle for example, have had a huge part to play in the evolution of this field and we see competitors now solving problems as diverse as estimating NFL player actions to detecting deep fake videos.
When using data science in a commercial environement we obviously want to be efficient and balance the trade off between development time and the value returned but we should also try to push things forward and be innovative so that the next generation of tools and solutions transcend what we currently have today. In technology, what is cutting edge today is usually the status quo of tomorrow and Kaggle can play a significant role in pushing what can be done with data science.
Kaggle isn’t perfect. There are many important and often undervalued data science skills that you will not get exposure to while doing competitions and Kaggle solutions will at times be overkill. However, I believe that misses the point of what makes Kaggle such a great platform.
What you learn from Kaggle competitions is only part of the data science puzzle and while you won’t necessarily be a complete data scientist after doing well on the platform, you will likely have developed an above average competancy in many aspects of machine learning and data analysis.
Kaggle, in my opinion, is still very hard to beat for developing those kind of technical skills.