Life In The Data Salt Mines

No, I am not a book or article reviewer, but at this time I cannot resist.

You know that I am a denizen of Wilmott.com. Two other denizens (Thijs van den Berg, "outrun" at Wilmott.com and Dana Meyer, "Traden4Alpha" at Wilmott.com) documented their quest to win a data mining contest that required estimating the probability of a binary response variable given 3,751 training samples described on 1,776 dimensions, in a log-loss score metric ....  in an article of the latest issue of the Wilmott magazine - unfortunately there is no online version to link to yet.

A typical high-dimension-low-number-of-samples problem (from the pharmaceutical industry) - one of the kaggle competitions. The contest's log-loss score metrics provided a pay-off structure used also in strategic investment decisions.

The result: they ranked 420 among 702 other teams. I know both from Wilmott - there they prove deep knowledge and original thinking (Dana, senior partner at WorkingKnowledge, Thijs, managing director of sitmo) ... not surprisingly they were a little shocked.

BTW, I'll meet them in Amsterdam soon - celebrating our group win at the Dutch Science Quiz of 2012.

But back to the article. It is written as a kind of a journey report along the timeline of their problem solving effort and as a tale of knowledge gains and log-loss.

And this is the most exciting part for me. I summarize:

Wheels aren't hard to reinvent, are they?

Being two smart thinkers they looked for adventures and fun, not reinventing the wheel. So, they eschewed all the algorithmic fruits of the internet .... they started thinking not reading.

(Lesson learned: the intentional choices increased learning, but reduced the score)

Code is the core

After having played with Excel they created the first block of Code 10 days before the delivery date - in R.

(Lesson Learned: Tool choice is not easy, if you are late you compromise)
I add: in such an effort you should try evolutionary prototyping and a feature rich platform - see also the next point.

You need cross-model validation

You have a new idea and turn the crank on a few more samples and, shocking, the new model works badly in other partitions ...

(Lesson learned: don't judge a method by a few data samples and get something working soon, because you need to cross-validate).

There's always more to do

After the dead line and result, they discussed a vast variety of new ideas and approaches, even now knowing that most of the top ranked contestants used decision tree methods. But they found that estimating the properties of jittery molecules has much in common with estimating the properties of jittery financial markets: seek to understand a complex stochastic system from the limited data available - predict real-time bond prices from the limited data available.

I really enjoyed reading this article. It is an exciting example of constructive learning.

(The big lesson learned: a possible methodology transfer to financial markets)

Amazing!