Analyze This

Tuesday, March 21, 2017

Selection Problem - Data Products

Life can be so simple if we don't have the opportunity to choose. One has to just take what's available. Live with it and work to improve it. It can't get simpler than that. Isn't it?

But the real problem is it's not true. Life at no point is simple and doesn't matter wherever we are, we always want choices in everything. With multiple choice available in almost everything from personal to business, we get surrounded by two basic selection problems

1. Problem of plenty
2. Probability of selection made is correct.

Whom/what to choose - Problem of plenty
Talk to a student applying for college and ask him about the problem he is facing and he will respond with what course and which college to go to.

Ask Indian cricket team selector or captain about Rahane's future in T20 playing eleven and they will respond with where to fit him in current line up.

Ask a fund manager and he will show you thousands of stocks and mutual fund to choose from and overlay it with investment plans.

Ask a human resource manager of a company about his biggest problem at work and he will come back saying getting the right candidate for the job.

One can go on and on with examples from different life cycles and we will still get the same response of whom to choose.

Well only when you have solved the jigsaw of whom, arrives a much bigger problem. Is the choice made correct? What is the probability of failure and whom should be hold responsible for it?

Only way to get answers to above questions is wait and watch. Not much can be done, if you have given loan to a customer and he goes bad. Only actions which remain in lender's hand are recovery and legal, both of which are cost inefficient and time taking process.

An employee selected for a role may or may not be good at work. Or even if he is good at work, there are always chances of him to quit the job and join competition.

Even when fund manager has selected a list of stocks and funds as per the best of knowledge, there is no guaranteed performance returns on them.

I again take up Rahane, if selected, what should be his role - to open the innings or to come down the order and finish the match.

Clearly selection becomes a major problem area for organisations to crack.

Here data can be a life saver. An statistically proven model based on evoluting information from multiple sources can become a handy tool to be deciding factor here.

It might be a possibility that one model is insufficient in reaching the goal and a multiple model approach need to be taken to achieve the grand design.

This brings us to a point where global database or data bureau are needed where multiple such data products can be developed and used.

Imagine a global employment database or players database of not just internationally playing or played players but players playing at all level even school or a global human information bureau where all possible individual information is captured, stored and analyzed.

Imagine a world of data products.

Saturday, November 14, 2015

Flip a Coin

Once I got enrolled for Statistics major for my graduation, my life’s decisions had been highly influenced by statistical distributions. Much as I studies about them, their impact on my decision making became larger and larger.

One of the distribution which I use most in taking any bivariate decisions like to buy or not, to go or not, is Bernoulli distribution. It just makes life so simple. Toss a coin and you know what to do. All those decision which have zero impact on your life but need to be taken are done just by a flip of coin. Simple, easy, proven and efficient.

Though sports recognized is efficiency and implemented it long back as toss to start the game as result of the match will have more significance from the opponents than who starts first. Flipping a coin as decision making tool has generally given a cold shoulder otherwise and a lot of man hours have been then put to decide this way or that way when both leads to same destination and has no significant difference.

So if you are stuck with what color to wear, what to eat pasta or salad, which place to go on holiday, which share to buy when both are performing similar, stairs or elevator, highway or subway. All done. Just by a flip of coin.

But remember the decision is always yours’ and not of the coin you flipped. And no decision is right or wrong. What you do with that decision makes it right or wrong. So start flipping.

Wednesday, August 1, 2012

Side Effects

Last month I was unwell for a couple of days. It turned out to (probably) be seasonal variation related. Unlike women who sensibly go to the doctor when they feel ill, men do not. However, I did eventually do the un-manly thing and go to my doctor. He prescribed some pills. In this blog I had always talked about key statistical skills that we should try to teach undergrads, and as I read the instructions of these pills it occurred to me that this is a good example of where the world would be a better place if people left university understanding statistics a bit better, and providing useful statistical information, therefore became the norm.

Like a diligent patient, I read the instruction leaflet with the pills. Like most instruction leaflets with pills they had an un-amusing list of possible side effects. These side effects were helpfully listed as common, uncommon and rare. Common ones included headache, stomach aches and feeling sick (Ok, I can handle that), uncommon ones were dizziness, liver disease which might make my eyes yellow, rash, sleepiness or trouble sleeping (but not both). The rare ones included liver failure resulting in brain damage, bleeding at the lips, eyes, mouth, nose and genitals and development of breasts in men.

Excuse me? Did it say ‘development of breasts in men’?

Yes I did.

Here’s a photo to prove it.

Side effects

I’ll admit that I don’t know much about human anatomy, but based on the little I do know, it seems intuitive that my immune system, if reacting badly to something like a drug, might overload my liver and make it explode, or give me kidney failure. I also know that feeling sick and having flu-like symptoms is part and parcel of your immune system kicking into action. But why on earth would my body respond to a nasty drug by sprouting breasts? Perhaps because having them would make me more likely to visit my doctor.

Anyway, back to the tenuous link to stats. Whenever I read this sort of thing (which fortunately isn’t often) I usually feel that I’d rather put up with whatever it is that’s bothering me than run the risk of, for example, bleeding from my lips, eyes, mouth, nose and genitals or getting brain damage. I might feel differently if I had enough information to assess the risk. What do they mean by ’uncommon’ or ‘rare’: 1/100, 1/1,000, 1/billion? Wouldn't it be nice if we could have a bit more information, maybe even an odds ratio – that way I could know, for example, that if I take the pill I’d be 1.2 times more likely to grow breasts than if I don’t. That way we could better assess the likelihood of these adverse events, which if you’re as neurotic as me, would be very helpful.

The campaign for more stats on drug instruction leaflets starts here.

Monday, July 23, 2012

Web Analytics Consulting: A Simple Framework For Smarter Decisions

As I've gotten older I've come to appreciate the value of frameworks a lot more.

When we are young, the answers to everything are simpler because, of course, we know everything.

What metrics should I use? Use BR & CV. What digital marketing works? Definitely Y, do that. How can I improve my business? Simple, do A then B and you're done. So on and so forth.

One upside (or is it a downside?) of age is the wisdom of realizing how much you don't know. Suddenly you don't have concrete answers because you realize: 1. You usually lack all the information you need and 2. Even the most mundane and obvious situations are incredibly complex and unique.

So you start answering questions like "What is two plus two?" with "Tell me a little bit more about what you are adding" or "It really depends on the process you use to add them" or … you get my point.

This is the main reason I love frameworks. They don't contain answers; rather, they help place a situation or a process or steps and encourage you to think a certain way. They force you to step back and think. They make you go talk to other people. They force you to say “hmmm …” And if you can make a person think, if you can encourage them to cover all the bases, if you can get them to ask themselves some tough questions, then you have given them the greatest gift of all. Not the pat answers, but rather the way to figure out the best answers for themselves all by themselves.

So, whenever possible, don't ask for perfect answers, ask how to think. You'll thank me.

Two of the frameworks I've built and shared on this blog are the Digital Marketing & Measurement Model(how to pick the best KPIs for your business that guarantee success, using a powerful five-step process) and the Clear Line of Sight Model (to ensure every bit of Marketing and Analytics you are doing is tied to the Net Income of the company).

The DMMM and CLoS are strategic frameworks (you should embrace them right away!), and in this post I want to share a really, really simple framework for structuring web analytics consulting contracts.

The Web Analytics Consultant Quandary

BB sent this query:

If I take on a consulting project then what could be expectations out of me?

From what I understand, I would be creating a Web Analytic Report and giving my recommendations. That would be one deliverable from my end.

What could be the other deliverable for a web analytic project? What could be their expectation beyond submitting the report?

Would I be required to set up various A/B and multivariate tests for that company?

And what if they are at initial stage and have just set up Google Analytics with no goals, events or internal search tracking. Would I be required to implement goals, events or set up internal search tracking as well as exit survey?

What is the timeline of a web analytic project in the above case where there is no tracking and as a consultant I set up tracking for them? When should I start creating reports?

When does this project end? I mean where I put a stop.

When I get this type of open-ended query my instinct is to figure out how to create a framework that would encourage structured thinking, force for assumptions and flaws and opportunities to rise to the fore.

And it does not have to be complicated, even for something as open and expansive as the query above.

For any web analytics consulting contract, the beginning, middle and end really depend on the contract you've signed, and – you'll be surprised – not the actual amount of work that needs to be done. The contract, and the hourly rate it provides for, will motivate the consultant to do as much or as little as is required to meet the contractual terms.

So, what's the fix?

The Optimal Web Analytics Consulting Framework: DC – DR – DA

Before jumping into any engagement (and signing a contract) I recommend using this simple framework for web analytics consulting contracts: Data Capture. Data Reporting. Data Analysis.

Ask your client: "What is it that you would like to accomplish in these three simple buckets: DC, DR, DA?"

This will force them to think about what they really want to get done, and their reply will be a really huge gift to you because you'll know:

1. If what they want is a fit with the skills you/your company possess,

2. How long the contract will be, and

3. How much you should charge for the work required.

So, what type of work falls into each of these three buckets?

Data Capture:

The work that falls into this bucket is to perform an audit and/or update current data capture mechanisms.

This could cover current or new javascript tag implementation (which has to be both correct and complete). This could mean implementing new updated code (both to fix their current problems and to s.prop and eVar the code to collect new data). It could also mean getting into the tool's admin area, as in the case of Google Analytics, to configure internal site search data capture, setting up goals and goal values (if you don't have these last two things set up you are not doing web analytics, you are doing web letswasteeveryonestimedatapukingforthesakeofdatapukinglytics).

If you are a Web Analyst who is really an Implementation Specialist, this is work that you'll enjoy because it is right up your area of expertise. If you are Web Analyst who is really a data processor (bucket two, below) then you'll find this a little frustrating. If you are a true Web Analyst, you'll find this work to be utterly frustrating. It is important you know who you are, and what the contract/client requires.

Life is too short doing things you hate, so sweat details here. Always match skills with work required for the sake of world peace.

Data Capture consulting work is also quite thankless work because there is always someone who is willing to do this work for less (the web analytics consulting world is brimming with Web Analysts who are essentially Implementation Specialists, not that there's anything wrong with that).

Even for a very smart Implementation Specialist such as yourself, a unique individual with extremely valuable skills, these types of contracts are a lot less fun because all you are responsible is javascript tag hacking and begging the right people at the client to implement your hacking.

Just be aware of this. Talk to your client. Get specifics. Figure out if you want to do it (or someone at your consulting company).

There are lots and lots of pure Data Capture consulting contracts, and sometimes they'll also include our next bucket…

Data Reporting:

Essentially, this work is the client saying: "I want someone to send me my paid search performance every week" or "We have Google Analytics, we need a package of reports each week" or "Our Finance team needs their reports set up."

You'll get access to SiteCatalyst or CoreMetrics and you'll scrape the standard reports into PowerPoint and send it out each week. Or you'll set up some custom reports to give the client exactly what they want. You might have some back and forth with the clients that will help you pull the right metrics into the reports, but for the most part you'll be told what they need and you'll do that for them.

In some cases you'll use your license for Nextanalytics to completely bypass the web analytics tool, Google Analytics in this case, and create the reports and dashboards inside Excel using the tool's free API.

There is less thinking required in this work, you don't even have to be a real Analyst, you can just pass the Adobe certification, the GAIQ test or other tool front-end things and you might be able to do this work. It is also a little less thankless than data capture simply because meeting the clients needs and actually seeing their numbers come together is rewarding.

But there is a lot of competition for this type of work because it requires less experience and analytical sophistication to be successful, hence many Consultants enter the field with this work (then graduate to Capture and if they are really, really good move to Analysis).

Bonus Pro Tip: If you are going to take a lot of Data Reporting contracts, then you should create for yourself (and your company) a massive bank of the best of breed custom reports for various purposes (types of companies and types of reports requested). Then when you sign a Data Reporting contract you can pick the best custom reports from your bank, simply import them into your client's account, and boom (!) you're already in business. Don't forget to ask for a bonus for finishing early. :)

Bonus Custom Reports: You can download my favourite Paid Search Custom Reports and my Content Efficiency, Visitor Acquisition Efficiency and Search Micro-Ecosystem reports and get a head start with your own reports bank!

Data Analysis:

This is the type of work that happens when the client gives you an open-ended assignment to really look at the data.

The client will not usually know what they want, they don't have specific guidance ("give me bounce rates!") and they really you to tell them:

1. What to measure,

2. What the data is saying, and

3. What they should do based on what the data is saying.

These are the most gratifying contracts, with a painful amount of work, because you have to really go in and create a Digital Marketing & Measurement Model (and how amazingly fun that is because you get to root causes, you get to work with an expansive set of company Sr. Leadership, you get to really, really nail down what's important for the client).

You then get to create really cool custom reports and dedicated unique advanced segments (to deliver on the DMMM identified priorities). You can often force someone else to do the implementation right (let the cheaper Implementation Specialists take care of this important but repeatable work) – either a resource with your client, or someone inside your consulting company. You can focus deeply on data analysis and helping drive the recommended actions at your client.

This does mean that you must possess specialized skills for this type of a contract, you have to be a real Web Analyst and not a Web Analyst that is essentially a Implementation Specialist or Report Creator (both very important jobs but don't require analytical skills). You have to know statistics 201. You have to know analytical techniques. You don't compare percent differences (they hide more valuable insights); instead you have your own cluster of techniques like Weighted Sort . You know 19,000 ways to get optimal context for your KPIs and insert it into the dashboards. You have a superb amount of business experience in your industry/line of business, that understanding means you ask nuanced questions when it comes to people and data (killer!). So on and so forth.

This does mean that you'll be able to charge a lot for contracts that are heavy on, or all about, Data Analysis. During my experience I've seen people charge, depending on the client and the consultant skill, $500 a day and $5,000 a day.

Not even 3% of web analytics consulting companies have people with optimal skills to be called an Analyst, so you can see how easy it is to charge a lot for this resource.

Astonishingly, pure Data Analysis contracts are hard to come by because companies are still so obsessed with Data Reporting ("if we just data puke we'll automatically be data driven because everyone in our company is a data analyst"). And since most web analytics consulting companies are Implementation Specialists, there are also lots of Data Capture contracts. Both don't reflect optimally on our industry, but do explain why despite our ecosystem having more data than God should allow anyone to have, we are still mostly gut-driven.

But if you do get a contract with a large component of Data Analysis ("come in and really help us figure our DMMM and take it from there to delivering pure insights and actions we should take") then grab it (if you or your company has the skills). They are deeply satisfying. They are high paying. And you do get a chance to change the world.

So when does the work of a web analytics consultant start or end? How much can they charge for it? Are they required to fix the code or set up experiments? What about customized data dumps?

It all depends. Is it a Data Capture, Data Reporting or Data Analysis contract?

You would be right to state that there are probably no pure DC, DR or DA contracts. They are rare, mostly because when you start doing analysis you'll notice you can't get away from meeting some reporting needs at your client. When you do reporting and analysis you'll discover implementation problems and then someone (you?) have to go fix that.

There is most certainly a symbiotic relationship between the DC, DR and DA.

But it is not uncommon for a contract to be heavily weighted in only one of these three areas. If you use this web analytics consulting framework then you'll be able to identify that upfront and set optimal scope for your contract, charge an appropriate lump sum or hourly rate, and go about working like crazy to become super rich!

A Client Perspective:

If your company is looking to hire a consultant then you should go through this exercise upfront as well. Before you call the blogger you're impressed with, before you sign on the dotted line from a consulting company that's "certified," before you extend a contract to the speaker at an industry conference.

What work do you actually have for the consultant/consulting company?

Is it majorly Data Capture? Data Reporting? Data Analysis?

What is your core weakness in terms of skills inside the company?

Why is it that your organization is HiPPO- or gut-driven, rather than you providing cogent insights to your HiPPOs so that they can mix data and their experience (or gut) to make optimal decisions?

It is never obvious.

But if you take our simple framework, ask the right questions and do some root cause analysis (or just soul searching or at least sleep on it for one night) then you'll be able to better understand what you need, you'll pay optimally for that need to be fulfilled (both contract amount and contract duration) and, I cannot tell you how brilliantly important this is, you'll find the optimal consultant who has the optimal skills you need.

It is not unusual for a million dollars to have been spent and the company to have progressed to zero percent data driven. That's because they thought they were getting a real analyst, they got a superb implementation specialist who's done data reporting but possesses zero actual analytical skills. This person, group of people if a consulting company, then spent a year (charging a million dollars) doing the world's most sophisticated implementation of Site Catalyst / WebTrends / Google Analytics. The company now has 900x more data than it needs, they have 25x more reports than they need. They just don't have any analysis.

That's a big company story.

But if you are a small business you don't have that kind of money. Hence it's even more critical that you go through, even a rough exercise, the DC, DR, DA framework. You likely need all three. Know that it is very, very hard to find the Purple Elephant that will be good at all three, so figure out where you have the greatest need. Hire her. When she's done with her core competence, go out and get the next person to take you to the next level. (And then the next.)

The Data Capture, Data Reporting and Data Analysis framework helps both clients and consultants have an immense amount of clarity on what the needs are (client), what skills are required to meet those needs (consultant) and how much time and money will be required (from the client to the consultant) to deliver glory.

I've created a helpful summary based on my humble experience along four key dimensions that I think you'll find to be of value (regardless of if you are the client or the consultant):

web analytics consulting framework dimensional summary

So use our delightful framework. Spread happiness in the world, happiness that only actions based on great data analysis can deliver.

Ok, as always it is your turn.

Do you have an alternative approach to sizing up the opportunity with a client? As a client, do you have a specific set of instructions you send out when looking for consultants? What kind of contracts is most common out there? Why can't we find more fantastic analysts in our ecosystem? What are your secrets to delivering joy to your clients? If you are a client, what secret ingredients did your last DC, DR or DA consultant possess?

Source: http://www.kaushik.net/avinash/web-analytics-consulting-framework-smarter-decisions/?utm_source=social-media&utm_medium=twitterfbgp&utm_campaign=akfb

Sunday, July 15, 2012

Game theory

Over the last week, cricket blogs have been abuzz with a new cricket ranking system devised by Satyam Mukherjee (below), a scientist at Northwestern University, Illinois. His paper, "Identifying the greatest team and captain - a complex network approach to cricket matches", names Steve Waugh (see list) as the best skipper in Test history. What's new about the theory is its way of looking at the value of a team's victory through relative contests. This could be an iterative method but Mukherjee uses Google PageRank to assess loads of data to compare the performance of captains and teams from 1877 to 2010. The 31-year-old scientist, born in Durgapur (West Bengal), is now a

post doctoral fellow at Kellogg School of Management and Northwestern Institute. His interests lie in complex systems, social networks and statistical physics. And his favourite cricketers: Brian Lara, Sourav Ganguly and Mohammad Azharuddin.

What is the complex network approach? How different is it from the conventional methods of calculating individual or team success?

In simple language, a network is a set of 'nodes' connected by a set of edges or vertices/links. For example, in friendship networks like Facebook or Orkut, a group of users (nodes) are connected to see if they know each other. The subject is almost 15 years old. In recent years, complex network tools have been applied to soccer, baseball and tennis. Thus two soccer players are linked if one passes the ball to another. I have applied these tools to cricket teams and skippers. Conventional methods of ranking teams are based on the number of wins. I have applied the Google PageRank algorithm - used for ranking a web page - to rank the quality of wins. So if a weak team wins against a relatively stronger team, it gains points but if it loses to a strong team it's not penalised much. Traditional ranking schemes are more biased towards the number of wins.

What's wrong with the methodology adopted by the International Cricket Council?

I'm not saying the ICC ranking scheme is wrong. It's just opaque. The methodology is not available anywhere and ranking is based on points. Such points are prone to bias. If we are asked to rate Sachin and Bradman in terms of votes, Sachin would emerge winner purely in terms of votes given by modern age viewers. Network theory doesn't include these 'external factors' and yet provides consistent ranking.

Many might have a bone to pick with you about your Test captains list. Even if one accepts Steve Waugh was the best in history, how is M S Dhoni placed at No 9 and why doesn't Mike Brearley find a place?

The skippers and teams are connected based on the fraction of wins against each other. Unlike in the early days, Test matches now end in results. Then we'd see more draws. One of the reasons why Steve Waugh is the greatest is that he was consistently successful for a long period of time. Dhoni's rank is bound to change when you take into account the recent back-to-back series defeats. This study was conducted in mid-2011 and so I had considered matches played till the end of 2010. Dhoni is high up the order because he had till then never lost to Ponting, who in turn is a successful skipper. Dhoni suffers a big fall in his rank as skipper if 2011 data is applied. Remember he lost to comparatively weaker skippers in England and Australia. Mike Brearley falls short of Waugh, Dhoni or Clive Lloyd's success. Hence, even though he has been widely praised as an inspiring leader, particularly in the 1981 Ashes, his name doesn't feature in the Top 20.

Indian captains like Pataudi, Kapil Dev and Sourav Ganguly have done a lot too in inspiring their teams ...

Pataudi, Kapil Dev, Ganguly and Dhoni are charismatic leaders and known for their leadership skills. But to quantify the 'influence' of a captain and modify the algorithm is not easy. So this research is still open to further analysis which I would like to pursue in the long run.

Why is Clive Lloyd placed lower? Could it be that his team was too good?

Lloyd isn't placed very low, he's placed 6th. But Steve Waugh's quality of wins makes him more successful. Let's also note that the ranking scheme depends on the concept 'a skipper/team is successful if it defeats a successful skipper/team'. I believe this wasn't the case with Lloyd.

What about a captain who saves a match by fighting for a draw?

Initially I did consider including these cases, but from the scorecard database it's not possible to determine how a team or its captain is salvaging a draw from the jaws of defeat. Also imagine a skipper who drew matches and never lost a game, but never won a game either. Here there could be two reasons: a) He's an attacking skipper who pulled his team away from the verge of defeat, and b) He's a defensive skipper who is looking to 'save' a match rather than 'winning' it. But it would only be logical to consider the wins . . .

What about relative individual strengths of teams?

I don't know if anyone has tried that. In the present paper, I haven't taken into account the team's individual relative strength. That again isn't a straightforward parameter but it's not impossible to handle. I may try that in future.

Can the algorithm determine who is better - Sachin or Bradman?

There is no direct application of this algorithm to determine the greatness of Bradman versus Sachin. The PageRank gives a score for games which involves winning or losing in direct contests. It can be applied to tennis where Nadal is facing Federer or in soccer where, say, France is playing Spain. In cricket, one can't have a direct Sachin-Lara or Bradman-Sobers encounter, since batsmen are pitted against bowlers.

Source - http://www.timescrest.com/sports/game-theory-8300

#GameTheory #Cricket #GooglePageRank #CricketRanking

Wednesday, July 11, 2012

Survival of Fittest - Which variable in your analysis passes the test?

One of the greatest science discoveries ever made is Evolutionary theory or Survival of Fittest. I had always been a complete believer of it and intentionally or unintentionally used it in many different ways. But before I start writing down my own thoughts on it, let’s first take a quick look into what is the meaning of survival of fittest.

"Survival of the fittest" is a phrase originating in evolutionary theory, as an alternative description of natural selection. An interpretation of the phrase "survival of the fittest" to mean "only the fittest organisms will prevail"

I have used this phrase many times in defining anyone staying in Mumbai, India. Mumbai, being one of the largest cities is in world, economic capital of India, receives thousands of people from different parts of the country, and of course world, to test their skills and luck. Everyone who walks on the streets of this city has one common dream, making it big, bigger than anyone else. Yet not all, or should I say very few make any significant mark. Most of them get on in the routine of life and rest heads back home or to a different city. But what makes some go back and some stay on?

I have often observed, those who do not survive in this city of gold are uncomfortable with either too much rush of people, fast paced life and above all Mumbai local trains. As per me, Mumbai’s local trains are the best test for the survival of fittest, those who successfully overcome it or enjoy it, often survives.

But why am I writing survival of fittest on an analytics blog? Well as I said earlier, I use this phrase in many different ways. Now if Mumbai local trains test that who is the fittest to survive in the city, it also lets you know the weak links, the one who exists early. This knowledge marks realization on how we should approach analysis of data.

Most of the analyst, working big data, always keeps in mind the end objective or the larger requirements on the analysis, aiming to build that perfect model everyone wants and yet no one has got. As an analyst, I confess that I too try to build that perfect one every time I start a model building.

In the process of analytical model development, we come across so many experts arguing the process as a tough job or a multiple iteration process to get the desired results. But what changes between first trial and the last one? Most of the times the answer will be the weak links in data or model building process. Too many missing values in a variable ignored to be treated properly or treated with general method, variable distributions not checked, binning not done optimally, sampling error, policy changes not incorporated, the list is a long one.

So what am I proposing here? Next time before aiming at the end objective of your analysis, first break the task in number of weak links and kill them first, as they will never survive for long but if included in model will also not let your model survive long. Workout on what data and method you have to follow to get the results. Check the basic requirements for the analysis in your data properly before you start analyzing it, the data sanity checks. Check distributions, derived variables, impute missing value, clustering, factoring, multicollinearity, sampling errors, optimizing bins and incorporate business knowledge.

The next step is to identify the modelling technique to be used. Modelling technique to be used depends on the distribution of dependent variable and kind of output required. So if the dependent variable is binary distribution with only two possible outcomes, say win or lose, and the desired outcome is the probability of win or lose, logistic regression or probit regression suits well. But if we change the desired outcome to classification on data in the two categories, a decision tree will be good to use.

Once the model is built with acceptable model acceptance statistics (R-square, p- value, chi-square, etc.), the next step is to validation the model on a different dataset to check will the model survive different variations in independent variables. Again many analysts look for multiple statistics to fail or go insignificant before deciding on the model not working. I recommend a fit model will survive all the tests, it will be the fittest to survive and if not check which of the above checks you missed.

#Happymodeling

Saturday, June 9, 2012

History of Statistics

A fresh Saturday morning started with a relatively odd topic, discussion on Mughal's history, though I strongly believe that they too can contribute to this blog on analysis but not today. The thought of our history kept rolling in my mind and I sat down to search on history of what I do, statistics.

With great efforts on www.google.com I got about 1,690,000,000 results (0.18 seconds), thought it will be good to mention as I am writing about statistics. The very first link was from among the most trusted and read online free encyclopedia www.wikipedia.org .

Though there is a lot which is written on the link (given at the end), I thought a part of it should be shared, especially the origin, meaning and current statistics.

The history of statistics can be said to start around 1749 although, over time, there have been changes to the interpretation of what the word statistics means. In early times, the meaning was restricted to information about states. This was later extended to include all collections of information of all types, and later still it was extended to include the analysis and interpretation of such data. In modern terms, "statistics" means both sets of collected information, as in national accounts and temperature records, and analytical work which require statistical inference.

Statistical activities are often associated with models expressed using probabilities, and require probability theory for them to be put on a firm theoretical basis.

A number of statistical concepts have had an important impact on a wide range of sciences. These include the design of experiments and approaches to statistical inference such as Bayesian inference, each of which can be considered to have their own sequence in the development of the ideas underlying modern statistics.

Introduction

By the 18th century, the term "statistics" designated the systematic collection of demographic and economic data by states. In the early 19th century, the meaning of "statistics" broadened, then including the discipline concerned with the collection, summary, and analysis of data. Today statistics is widely employed in government, business, and all the sciences. Electronic computers have expedited statistical computation, and have allowed statisticians to develop "computer-intensive" methods.

The term "mathematical statistics" designates the mathematical theories of probability and statistical inference, which are used in statistical practice. The relation between statistics and probability theory developed rather late, however. In the 19th century, statistics increasingly used probability theory, whose initial results were found in the 17th and 18th centuries, particularly in the analysis of games of chance (gambling). By 1800, astronomy used probability models and statistical theories, particularly the method of least squares, which was invented by Legendre and Gauss. Early probability theory and statistics was systematized and extended by Laplace; following Laplace, probability and statistics have been in continual development. In the 19th century, social scientists used statistical reasoning and probability models to advance the new sciences of experimental psychology and sociology; physical scientists used statistical reasoning and probability models to advance the new sciences of thermodynamics and statistical mechanics. The development of statistical reasoning was closely associated with the development of inductive logic and the scientific method.

Statistics is not a field of mathematics but an autonomous mathematical science, like computer science or operations research. Unlike mathematics, statistics had its origins in public administration and maintains a special concern with demography and economics. Being concerned with the scientific method and inductive logic, statistical theory has close association with the philosophy of science; with its emphasis on learning from data and making best predictions, statistics has great overlap with the decision science and microeconomics. With its concerns with data, statistics has overlap with information science and computer science.

Etymology

The term statistics is ultimately derived from the New Latin statisticum collegium ("council of state") and the Italian word statista ("statesman" or "politician"). The German Statistik, first introduced by Gottfried Achenwall (1749), originally designated the analysis of data about the state, signifying the "science of state" (then called political arithmetic in English). It acquired the meaning of the collection and classification of data generally in the early 19th century. It was introduced into English in 1791 by Sir John Sinclair when he published the first of 21 volumes titled Statistical Account of Scotland.

Thus, the original principal purpose of Statistik was data to be used by governmental and (often centralized) administrative bodies. The collection of data about states and localities continues, largely through national and international statistical services. In particular, censuses provide regular information about the population.

The first book to have 'statistics' in its title was "Contributions to Vital Statistics" by Francis GP Neison, actuary to the Medical Invalid and General Life Office (1st ed., 1845; 2nd ed., 1846; 3rd ed., 1857).

Statistics Today

During the 20th century, the creation of precise instruments for agricultural research, public health concerns (epidemirology, biostatistics, etc.), industrial quality control, and economic and social purposes (unemployment rate, econometry, etc.) necessitated substantial advances in statistical practices.

Today the use of statistics has broadened far beyond its origins. Individuals and organizations use statistics to understand data and make informed decisions throughout the natural and social sciences, medicine, business, and other areas.

Statistics is generally regarded not as a subfield of mathematics but rather as a distinct, albeit allied, field. Many universities maintain separate mathematics and statistics departments. Statistics is also taught in departments as diverse as psychology, education, and public health.

Source: http://en.wikipedia.org/wiki/History_of_statistics

Also see for scope of statistical applications: http://numberstory.blogspot.in/2010/12/number-story.html