Wednesday, July 11, 2012

Survival of Fittest - Which variable in your analysis passes the test?


One of the greatest science discoveries ever made is Evolutionary theory or Survival of Fittest. I had always been a complete believer of it and intentionally or unintentionally used it in many different ways. But before I start writing down my own thoughts on it, let’s first take a quick look into what is the meaning of survival of fittest. 

"Survival of the fittest" is a phrase originating in evolutionary theory, as an alternative description of natural selection. An interpretation of the phrase "survival of the fittest" to mean "only the fittest organisms will prevail"

I have used this phrase many times in defining anyone staying in Mumbai, India. Mumbai, being one of the largest cities is in world, economic capital of India, receives thousands of people from different parts of the country, and of course world, to test their skills and luck. Everyone who walks on the streets of this city has one common dream, making it big, bigger than anyone else. Yet not all, or should I say very few make any significant mark. Most of them get on in the routine of life and rest heads back home or to a different city. But what makes some go back and some stay on?

I have often observed, those who do not survive in this city of gold are uncomfortable with either too much rush of people, fast paced life and above all Mumbai local trains. As per me, Mumbai’s local trains are the best test for the survival of fittest, those who successfully overcome it or enjoy it, often survives.
But why am I writing survival of fittest on an analytics blog? Well as I said earlier, I use this phrase in many different ways. Now if Mumbai local trains test that who is the fittest to survive in the city, it also lets you know the weak links, the one who exists early. This knowledge marks realization on how we should approach analysis of data.

Most of the analyst, working big data, always keeps in mind the end objective or the larger requirements on the analysis, aiming to build that perfect model everyone wants and yet no one has got. As an analyst, I confess that I too try to build that perfect one every time I start a model building.

In the process of analytical model development, we come across so many experts arguing the process as a tough job or a multiple iteration process to get the desired results. But what changes between first trial and the last one? Most of the times the answer will be the weak links in data or model building process. Too many missing values in a variable ignored to be treated properly or treated with general method, variable distributions not checked, binning not done optimally, sampling error, policy changes not incorporated, the list is a long one.

So what am I proposing here? Next time before aiming at the end objective of your analysis, first break the task in number of weak links and kill them first, as they will never survive for long but if included in model will also not let your model survive long. Workout on what data and method you have to follow to get the results. Check the basic requirements for the analysis in your data properly before you start analyzing it, the data sanity checks. Check distributions, derived variables, impute missing value, clustering, factoring, multicollinearity, sampling errors, optimizing bins and incorporate business knowledge.

The next step is to identify the modelling technique to be used. Modelling technique to be used depends on the distribution of dependent variable and kind of output required. So if the dependent variable is binary distribution with only two possible outcomes, say win or lose, and the desired outcome is the probability of win or lose, logistic regression or probit regression suits well. But if we change the desired outcome to classification on data in the two categories, a decision tree will be good to use.

Once the model is built with acceptable model acceptance statistics (R-square, p- value, chi-square, etc.), the next step is to validation the model on a different dataset to check will the model survive different variations in independent variables. Again many analysts look for multiple statistics to fail or go insignificant before deciding on the model not working. I recommend a fit model will survive all the tests, it will be the fittest to survive and if not check which of the above checks you missed.

#Happymodeling


1 comment:

  1. Nice comparison between Mumbai Local trains and DataModelling :)

    ReplyDelete