Wednesday, December 15, 2010

Get Your Data Ready - Nine Data Preparation Lessons for Advanced Analytics

You may be ready for advanced analytics, but is your data? Follow this checklist to get started.

According to a recent survey from TDWI Research, 38 percent oforganizations are practicing advanced analytics today, whereas 85 percent say they will be within three years. Why the dramatic upsurge in advanced analytics? It’s driven by organizations’ need to understand constantly changing business environments (as seen in the recent recession and its aftermath), as well as to discover opportunities for cost reductions and new sales targets (keys to surviving a down economy).

Organizations will face challenges as they move into advanced analytics. Many don’t understand that reporting and analytics are different practices, often with different data requirements. This checklist is designed to help by listing and explaining data requirements that are unique to advanced analytics.

Determine what you want to accomplish with advanced analytics
Here are just a few of the possibilities:

Discover relationships – Advanced analytics can help organizations discover and quantify important relationships to help in such things as reducing fraud, optimizing marketing campaigns and improving product quality.

Anticipate the future – Predictive analytics can produce scores and statistics to help predict the outcomes of certain situations, such as when a customer might churn, and predictive models can assist with forecasting.

Understand and adapt to change – Advanced analytics can help you under-stand change in the form of rising costs or new customer behaviors. Discoveries made through analytics can lead to positive changes that help businesses adapt to an evolving world, but this level of discovery requires specialized analytic tools and analytic databases.

Scale up data integration to handle large analytical data volume
Many analytics databases begin with multiple terabytes, and data loading must scale up to handle large data volumes and load quickly. Large data extracts from operational systems must be as non-intrusive as possible. Over time, analytics should help fine-tune data collection so that only the right data is collected.

Extract, transform and load (ETL) is the preferred method of data integration for data warehousing. With data sets for advanced analytics, it makes sense to rearrange the acronym to ELT to make data available for preliminary analysis immediately. Incremental transformations, when available, can be done as needed within the enterprise data warehouse (EDW) or analytics database.

Realize that reporting and analytics have different purposes and needs
Many people confuse reporting and analytics. Predictive analytics are more exploratory and forward-looking than reporting and OLAP. The value of predictive analytics is discovery achieved through data-sampling best practices that ensure predictive models generalize to new data – future data that is yet to be collected – giving the models broad usage over a long life cycle. Reporting is based on well-known summarized information. OLAP is implemented as a form of parameterized reporting, where the parameters represent dimensions with limited analysis breadth.

Distinguish between data warehouses, data marts and analytical databases
An enterprise data warehouse (EDW) is a data warehouse that manages data for multiple lines of business, supports multiple data processing workloads and manages data in multiple data structures and models. The EDW is optimal to support any BI and advanced analytics practice. An EDW can handle both query-intense and predictive-scoring workloads and manage the low-level, detailed data that advanced analytics often require.

Organizations with a warehouse focused on reporting and OLAP need to extend it with a separate analytics database to support an analytics workload and appropriate data.

Design a data warehouse architecture that accommodates analytics
Analytics processed within the EDW – This is the ideal, but analytical tools require users to dump data into flat files – the data structure for which the tool is optimized. Today, in-database analytics allow users to manage data within a data warehouse for analysis. In-database predictive analytics require the data warehouse to support the scoring of predictive models deployed to it.

Analytics sandboxes – Data warehouses that support in-database analytics should also support analytics sandboxes (analytic databases set up within the EDW sourced with data requested by a user), so that users can work within the sandbox without unpredictable performance hits on the EDW.

Analytics databases outside the EDW – Rogue data marts and spreadsheets often proliferate outside the EDW. A new best practice is to selectively isolate disruptive analytics workloads on data marts and other analytics databases outside the EDW.

Prepare data to meet the needs of the analytics method you’ve chosen
Online analytical processing (OLAP) – OLAP’s purpose is to quickly answer multidimensional queries of summarized data. This provides good query performance, but limits queries to the data, dimensions and simple summary statistics of one or more cubes.

Query-based analytics – Query-based analytical methods depend heavily on structured query language (SQL). Users gather large volumes of operational data and load it into a data warehouse, or analytics database, to analyze it quickly in reaction to sudden business changes. This works well when the SQL processing is executed on an MPP (massively parallel processing) database platform that’s built for complex queries.

Predictive analytics – Predictive analytics demand specific data structures. Some tools have multiple algorithms, each with a unique data requirement. Most algorithms are optimized to run fast and accurately with a flat record structure of hundreds of fields, and perhaps millions of records.

Preserve analytical data’s rich details because they enable discovery
Analytical discovery depends on data nuggets: Discovery-oriented analytical applications depend on large amounts of raw source data. Data set size, and the details within the raw source data, are integral to clustering and relationship definitions produced by advanced analytics. If you strip out the details during data preparation, analytical discovery is hamstrung from the beginning.

Analytics data can also be unstructured: There is usually a need for text mining to discover useful facts that can be converted to structured data. Text mining output commonly feeds into predictive analytics, enhancing the accuracy of predictive models.

Data from an EDW can be analytical, too, of course: Advanced analytics require new data sets, but they can also tap a data warehouse to provide new discoveries with historical context. Discovered facts can be added back to the warehouse.

Improve data after working with it, not before
The perils of improving analytics data: As analysts understand large data sets, they may develop some data integration or data quality routines to improve the data. Improvements to data should occur only after business analysts have worked with it so that details that discovery-oriented analytics depend upon are not lost.

Data quality for analytics databases: Advanced analytics has the potential to identify useful information from data perceived to be poor. Advanced analytics should assess outliers prior to the application of standard data quality routines. Filling in missing field values may mask useful information. Some anomalies should not be fixed, but documented in metadata, so anyone using the data set knows of their presence and how to compensate for it.

Data modeling for analytics databases: Remodeling data can speed up queries and enable multidimensional views, but modeling can also remove data details and limit the scope of queries. Data sets intended for analytical discovery should be treated to data modeling if it is truly required and adds value.

Apply the products of advanced analytics to BI and data warehouse activities
Ultimately, the information and insights inferred from the application of advanced analytics should be applied to standard EDW data to enhance BI activities. Standard reports and ad hoc queries are necessary to run an organization, but not sufficient to propel the organization forward. Advanced analytics enable continuous learning and improvement, as they make the most of organizational data assets.

Wednesday, December 1, 2010

Number Story

Once upon a time, there was a student of statistics who, after completing his first year, went for industrial training to a near by leather goods manufacturing company. A day before joining the company, he was wondering of kind of statistics application that company will be using, which he will be required to learn and how he as individual can contribute in the process he is going to join, giving his first exposure to cooperate the right start.

Wondering on what kind of analysis he can apply in a leather manufacturing company, he went to his college and expressed his dilemma, with his professor. The professor thought for a while on his student's problem and then asked him “what do you require to apply any statistical analysis?” The boy quickly answered “data”. “But what is data?” asked professor. “Data is systematically arranged information for a set of parameters/ variables represented in numeric, alpha numeric or string format” came the response from student. The professor then said “the parameters or variable are just notation used as by user’s requirements, right?” The student in agreement said “yes”. “So what remains in data is the information represented in numeric, alpha numeric or string.” The student again agreed and said “yes that what forms data”. The professor continued “but then numeric, alpha numeric or string are only different format of data representation , and I believe that we can transformed and present everything only in numeric format, what do you think?” asked the professor again. The student was now pushed to think of something he has not yet thought about, but after some thinking he again agreed with his professor and said “yes we can transform all information in numbers”. The professor then asked his last question “tell me one thing now which you can not represent in numbers?” The student was further pushed in his thoughts on imagining numbers. He stood there and thought for the ten minutes, he thought from the distance between stars to the time taken by light to reach earth, from the oceans to countries, from human body to bacteria, from money to weapons, from desires to emotions, everything formed the picture of number and just numbers in his mind. He replied “I don’t know. It seems that I can transform all the information in numbers today”.

The professor then gave his last teaching for the day, he said “you told me that you need only data, which we have seen is nothing but some numbers, to apply any statistical analysis and now you are telling me that you can transform and represent everything in numbers. So are you telling me that you can apply statistical analysis on everything?

The student was once again pushed in his thoughts but this time he also thought that is it a trap and he is going to hear something unexpected from the professor. But he didn’t had much options, whatever his mind was thinking were now showing them as numbers and he was able to think of random analysis techniques, from basic to advance, he can use on them. Trusting his limitation of knowledge and his believe in learning, which brought him to his professor, he prepared himself for the worst and said “yes I can apply statistics on everything.

There was a sudden moment of silence in the room. The student was feeling a millions of particles jumping inside his stomach and to make thing worse for him; his mind was actually able to count the jumps, their timings, etc. and has already thought of a time series model which can fit on it.

The professor then asked the student again “so what were you asking me, what analysis you can apply at a manufacturing company?” He was looking into the eyes of students as if expecting the answer from him for his own question.

The student, by the time, was hit by the lighting which professor wanted to pass on to him. He was imaging all kinds of data which he will get at the company and what all analysis he will do. With a big smile he replied “no I know what analysis I will do there”.

The student came out of his professor’s room happily. But the lesson which professor was teaching him was not finished yet. As he was walking down towards his hostel, dreaming of what all he can represent in numbers and thus analyze them, he was struck by the same question asked by his professor “tell me one thing which can not be transformed and represented in numbers”.

The student went on for his training, competed his masters in statistics and is been working on statistical analysis implementation since last four years. But his search for the answer to his professor is still on.