Wednesday 27 November 2013

Did the kaggle titanic competition with vw. Just put all features, name as bag of words and others with attribute==X. Doing 1 pass gave best results (111 on leaderboard), 2,5,10 gave progressively worse (but only slightly - all at 172 on leaderboardt).

My new github where I publish kaggle code

https://github.com/umarnawaz

Tuesday 26 November 2013

Entered a few Kaggle Competitions

I entered a few Kaggle competitions for fun and so I could put them on my resume.

On all of them I scored somewhat below the middle of the leaderboard.

I did very little feature engineering, I simply loaded them into postgres (except for digit recognition which just formatted for vw using python), outputted them to text files in vw format and used some shell scripts to unix paste and vw run them. VW automatically treat text as bag of words so it gave reasonable results as is. Overfitting exists - running 1000 passes appeared to give better results  at vw console but gave worse on kaggle submission.

For see click predict I first did a multiclass --oaa and did some feature engineering on the timestamps in postgres (date truncs and date parts) and ceilinged the lat/longs. The competition had only 4 days left and this was my first competition so some time was spent learning the kaggle site and the evaluation metric (log). I again did the competition but logarithmed the outputs as regression but got worse results than the multiclass.

Partly sunny with a chance of hashtags, digit recognition, see click predict fix.

Monday 18 November 2013

A pretty good vowpal wabbit tutorial

Someone wrote a pretty good vowpal wabbit tutorial http://spiderspace.wordpress.com/2013/08/22/vowpal-wabbit-tutorial-for-the-uninitiated/

Vowpal Wabbit on MNIST

Running vowpal wabbit on the digit recognition challenge of kaggle with a 1-100 passes and logistic loss gives 88-90%. Using quadratic and cubic features appears to give no improvement. There was thread on kaggle saying 1000 passes and quadratic features gives 97%, so I'll being checking that out later. I'll also be trying lda from vw and seeing if that gives improvements.

Saturday 16 November 2013

Competitions Kaggle should run

A list of competitions kaggle should run. They could be financed by crowdfunds (on kickstarteror indiegogo.

Wikipedia - given some number of words, 100 let's say, predict the title

Jeopardy - same as IBM Watson ran

Image recognition - use something like Kaggle or 1000,000 categroes of images from flickr, predict the category

Conversation Turing test - take wikipeida articles or chat conversation, chop them, predict the next word or remainder of sentence.

Law - given text from both side of trials, predict the verdict




Thursday 14 November 2013

How I approach machine learning

I have a simplified conception of machine learning with a few basic algorithms

decision trees and ensembles of them (boosting/random forests) - I haven't looked much into these but might test them in the future (probably just use waffles)

perceptron (linear regression etc.) - there is only 1 global minimum and so derivative is useful. Just vowpal wabbit for it (get as much data and engineer as many features as possible and let vw sort it out)

neural nets - perceptrons wired together - many global minimum, derivative maybe/maybe not useful. Academic researchers try out different optimizers (sgd most common). I would think simulated annealing would do good on it. I tried the vowpal nnet but it took too long to run and gave poor results (researchers use gpu's to get performance).

autoencoders, topic models - unsupervised liearning. I just use gensim's implemetations. It has tfidf, lsi, rp, lda, hdp. I've only tested out tfidf, lsi, and rp, and might only use rp in the future.

naive bayes - count stuff up.

For my recommender systems projects I plan on sticking to only the vector spaces (gensim), perceptron (vw), and naive bayes (probably do it in sql or awk). I don't have much computer power so things like nnets are too much for now.