One of the exciting things about Web 2.0 is the many ways in which it can cut through the rigidities and plain dysfunctional aspects of existing institutions. In this post on the Kaggle website, Anthony Goldbloom draws attention to the many ways in which Web 2.0 ‘marketplaces’ can ‘turbocharge’ the way scientific work gets done and communicated. Pointing out how Kaggle managed to get to and slightly beyond the frontier of the existing literature within a week and a half Anthony goes on to discuss how this can be. (The resulting text below includes my editings of and interpolations into Anthony’s text – his original is on the Kaggle website).
Scientific literature tends to evolve slowly (somebody writes a paper, somebody else tweaks that paper and so on). Each step follows the other and months or even years are interposed between many of the steps. A competition inspires rapid innovation by introducing the problem to a wide audience. There are an infinite number of approaches that can be applied to any modeling task and it is impossible to know at the outset which technique will be most effective. By exposing a problem to a wide audience, competitions expose the problem to a range of different techniques. This maximises the chances of finding a solution, and gets the most out of any particular dataset – given its inherent noise and richness. Not only that but competitions generate a lot of discussion between participants – often there’s quite a bit of discussion and even collaboration between people who are notionally competing.
Competitions help correct a coordination problem in the wider research community. Data is being collected in greater volumes and at greater speeds than ever before – Think the human genome project, high-resolution camera-clad telescopes and and any number of other projects. Yet how do those collecting the data work out how to analyse it. Often they’re restricted to inhouse knowledge, talent and bandwidth. A single researcher or even research unit is unlikely to know the most advanced machine learning, statistical and other techniques that would allow them to get the most out of their datasets. At the same time, many data mining and statistics researchers find it difficult to access real-world datasets, and develop their techniques on whatever data they have access to.
Kaggle addresses this coordination problem. Data-rich researchers can post their datasets and have them scrutinised by analytics-rich researchers. This gives data-rich researchers access to cutting edge techniques and analytics-rich researchers access to new datasets and current problems.
Data modeling competitions facilitate real-time science. Consider this week’s announcement about the discovery of genetic markers that correlate with extreme longevity. Work on the study began in 1995, with results published in 2010. Had the study been run as a data modelling competition, the results would have been generated in real time and insights available much sooner (and with a higher level of precision).
Data modeling competitions also benchmark, in real time, new techniques against old ones turbocharging the process by which the new and better drives out the old and obsolescent. This helps to avoid situations in which a valuable technique is overlooked by the scientific establishment. This aspect of the case for competitions is best illustrated by Ruslan Salakhutdinov, now a postdoctoral fellow at the Massachusetts Institute of Technology, who had a new algorithm rejected by the NIPS conference. According to Ruslan, the reviewer ‘basically said “it’s junk and I am very confident it’s junk”’. It later turned out that his algorithm was good enough to make him an early leader in the Netflix Prize and 135th overall – a remarkable achievement when you consider that many of the top teams used ensemble models, making his one of the better performing single algorithms.
That example also brings out the human element. The peer reviewer could have just rejected the modelling and explained his reasons for doing so, or (more honestly) tried not to play the expert and said “this uses a technique I don’t fully understand and so I’m not qualified to judge”. But he didn’t say that did he? He said that something that was highly creditworthy was “junk”. Competitions can help cut through our unfortunate habit of experts all being on their own panel of experts big-noting themselves. These are the kinds of psychological drivers that Philip E. Tetlock so devastating anatomises in his exposes of expertise.
Data modeling competitions are also a great interface between academics and industry. There is generally a long lag time before new techniques are adopted by industry. Data modelling competitions can help close the gap by bringing commercial problems directly to the attention of the world’s best researchers and their cutting edge techniques.