One
of the greatest science discoveries ever made is Evolutionary theory
or Survival of Fittest. I
had always been a complete believer of it and intentionally or unintentionally
used it in many different ways. But before I start writing down my own thoughts
on it, let’s first take a quick look into what is the meaning of survival of
fittest.
"Survival
of the fittest" is a phrase originating in evolutionary
theory, as an alternative description of natural selection. An
interpretation of the phrase "survival of the fittest" to mean
"only the fittest organisms will prevail"
I
have used this phrase many times in defining anyone staying in Mumbai, India.
Mumbai, being one of the largest cities is in world, economic capital of India,
receives thousands of people from different parts of the country, and of course
world, to test their skills and luck. Everyone who walks on the streets of this
city has one common dream, making it big, bigger than anyone else. Yet not all,
or should I say very few make any significant mark. Most of them get on in the
routine of life and rest heads back home or to a different city. But what makes
some go back and some stay on?
I
have often observed, those who do not survive in this city of gold are
uncomfortable with either too much rush of people, fast paced life and above
all Mumbai local trains. As per me, Mumbai’s local trains are the best test for
the survival of fittest, those who successfully overcome it or enjoy it, often
survives.
But
why am I writing survival of fittest on an analytics blog? Well as I said
earlier, I use this phrase in many different ways. Now if Mumbai local trains
test that who is the fittest to survive in the city, it also lets you know the
weak links, the one who exists early. This knowledge marks realization on how
we should approach analysis of data.
Most
of the analyst, working big data, always keeps in mind the end objective or the
larger requirements on the analysis, aiming to build that perfect model
everyone wants and yet no one has got. As an analyst, I confess that I too try
to build that perfect one every time I start a model building.
In
the process of analytical model development, we come across so many experts
arguing the process as a tough job or a multiple iteration process to get the
desired results. But what changes between first trial and the last one? Most of
the times the answer will be the weak links in data or model building process.
Too many missing values in a variable ignored to be treated properly or treated
with general method, variable distributions not checked, binning not done
optimally, sampling error, policy changes not incorporated, the list is a long
one.
So
what am I proposing here? Next time before aiming at the end objective of your
analysis, first break the task in number of weak links and kill them first, as
they will never survive for long but if included in model will also not let
your model survive long. Workout on what data and method you have to follow to
get the results. Check the basic requirements for the analysis in your data
properly before you start analyzing it, the data sanity checks. Check
distributions, derived variables, impute missing value, clustering, factoring,
multicollinearity, sampling errors, optimizing bins and incorporate business
knowledge.
The
next step is to identify the modelling technique to be used. Modelling
technique to be used depends on the distribution of dependent variable and kind
of output required. So if the dependent variable is binary distribution with
only two possible outcomes, say win or lose, and the desired outcome is the
probability of win or lose, logistic regression or probit regression suits
well. But if we change the desired outcome to classification on data in the two
categories, a decision tree will be good to use.
Once
the model is built with acceptable model acceptance statistics (R-square, p-
value, chi-square, etc.), the next step is to validation the model on a
different dataset to check will the model survive different variations in
independent variables. Again many analysts look for multiple statistics to fail
or go insignificant before deciding on the model not working. I recommend a fit
model will survive all the tests, it will be the fittest to survive and if not
check which of the above checks you missed.
#Happymodeling
Nice comparison between Mumbai Local trains and DataModelling :)
ReplyDelete