The Data Scientist at Work
(ED NOTE: We can use this post to apply to data mining for medicine, and not business. We feel the same principles outlined apply. When you see the word “business”, you can insert “medicine”)
OCTOBER 24, 2013
Harvard Business Review calls it the sexiest job of the 21st century,1 it has become a popular concept (Google returns over 200 million hits), and in the business intelligence (BI) world everyone has an opinion on it: the data scientist. This article describes what data scientists do, the relationship with business intelligence, the specific characteristics of a data science exercise, the process a data scientist uses to discover business insights, and the relationship between the data scientist and the business analyst.
Note: This article contains extracts from the white paper2 “Discovering Business Insights in Big Data Using SQL-MapReduce.”
Business Intelligence and Discovery
Discovery of new business insights is the key task of a data scientist. Therefore, let’s start with explaining the concept of discovery and how it relates to business intelligence. Boris Evelson of Forrester Research3 defines business intelligence as follows:
Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision making.
From this definition can be derived that business intelligence is not a tool, not a technology, nor some design technique, but it’s everything needed to transform and present the right data in a form that leads to business insights and improves the decision-making processes of an enterprise.
The tools used by decision makers to study and analyze data can be classified in two main categories: reporting tools and analytical tools. In principle, reporting tools show what has happened. Although the presented data may have been transformed, processed and aggregated, the data still shows the past and current situation. Typical examples of questions answered with reporting tools are: “Show the total revenue per sales region of the last two weeks” and “Present a 360˚ report of a particular customer.” Dashboards are also examples of reports. OLAP tools with which users can study data from every angle and at every level of detail belong to this category as do batch reports.
Analytical tools, on the other hand, are used to find out what may or can happen. They use techniques such as predictive modeling, simulation and forecasting. The result of analytics is usually not (aggregated) data, but a set of rules. Examples of such rules are: “When a customer buys cola and chips, there is a 75% chance he buys dipping sauce as well” and “The most efficient route to deliver goods to a particular set of shops is the following.”
In addition to the users using reporting tools and analytical tools, a third group of users can be identified. These users use anything they can find to discover new insights that can lead to business benefits. Surely they can benefit from reporting and analytical tools, but they need more. They need discovery capabilities.
Discovery is about searching and analyzing data to get new business insights that may lead to business opportunities. Their questions are not that straightforward. They can’t be answered by simply starting a particular report or by firing up a pre-defined analysis. Examples of their questions are:
- What is a possible behavioral pattern of credit card usage that signifies a fraudulent action?
- What are other forms of data that can help us locate deeply buried oil fields more easily?
- How high is the financial risk if a person 21 years old with no job is granted a mortgage?
The challenge for discoverers is that they don’t always know what exactly they are looking for, although they probably have a feeling or an inkling.
The Data Scientist
Nowadays, we use the term data scientist to identify these discoverers who try to gain knowledge or awareness of patterns or rules not known before. But what is a data scientist and what does he do? For example, in an oil company, the ones responsible for analyzing soil test results to locate new oil fields or for analyzing new techniques to find new oil fields faster can be classified as data scientists. Another clear example of a data scientist is an actuary working for an insurance company. Actuaries deploy mathematics, statistics and financial theory to analyze the financial consequences of risk. Professors looking for cures for specific diseases by doing DNA research can also be classified as data scientists.
Usually, data scientists use all the data and all the tools they can get their hands on. They use the more popular analytical and reporting tools, but they don’t stop there. In other words, discovery is not a fancy new term for analytics. Analytics is just one of the many instruments used by data scientists to get new insights.
Although the term data scientist may be relatively new, this profession has existed for a long time. For example, Napoleon Bonaparte used mathematical models to help make decisions on battlefields. These models were developed by mathematicians – Napoleon’s own data scientists. Another (famous) example of that same time period is the Minard Map.4 This is a good example of a data scientist using geo visualization to analyze data. The map depicts the advance into and retreat from Russia by Napoleon’s army in 1812-1813. This army was practically destroyed during the retreat; the army left with 422,000 troops and came back with a mere 10,000. Charles Joseph Minard was clearly a data scientist. Many more examples like this can be found.
Data scientists are smart people. They need business knowledge; they need to understand the enterprise data; they need to know how to deploy technology; they have to understand statistical and visualization techniques; and, most importantly, they need to know how tointerpret the results. For example, if a discovery exercise shows that the number of storks born has a strong correlation with the number of babies born one year later, data scientists should have sufficient knowledge to conclude that these variables do not have a direct relation, but that they are both dependent on a third variable, one that probably hasn’t been included in the study yet.
The Data Scientist Versus the Business Analyst
A traditional user of business intelligence systems is the business analyst. A business analyst assists end users in making informed business decisions. He exploits a data warehouse to uncover important facts and statistics that show an organization’s performance. He helps transform business needs in reports, analyzes data structures and defines business concepts. Quite often, he operates on the frontier between the IT department and the business departments.
Data scientists and business analysts may be using the same data, but they use that data differently. As indicated, the discovery work of a data scientist is about searching and analyzing data to produce new business insights that can lead to business opportunities. The work of the business analyst is more concrete. He creates reports for himself and for end users, helps end users to develop their own reports, and so on.
The boundary between these two jobs is not as clear cut as one may expect. Business analysts may be doing data scientist work occasionally, and vice versa. In fact, the person working as data scientist today may have the role of business analyst tomorrow.
The Data Scientist’s Four-Step Discovery Process
The discovery process used by data scientists commonly consists of four steps (see also Figure 1):
- Data acquisition: In this first step, data is collected from various data sources. Data scientists select the data sources that may be useful and relevant for their study.
- Data preparation: In this step, data is transformed, aggregated, integrated and cleansed until it has the form that data scientists need for their study. For example, for many data mining algorithms, it can be useful to transform real-life values to binary values.
- Data analysis: In this step, data is analyzed using various types of techniques, including simple reporting techniques; classic statistical techniques, such as forecasting, predictive modeling and clustering; advanced data mining techniques;data visualization techniques such as affinity visualization, path visualization, scatter clouds, geo-visualization techniques; and time-series analysis.
- Data interpretation: When the techniques and tools present results and insights, it’s still the responsibility of the data scientist to determine whether the results make sense. This requires in-depth knowledge of the business and the data, and it demands common sense.
Characteristics of the Data Scientist’s Discovery Process
The discovery process deployed by data scientists has the following characteristics:
- The discovery result consists of rules. The result of a discovery process is in most situations insights, and these insights are formulated as a set of rules. These rules can be simple if-then rules. For example, if two payments are done with the same credit card within 10 seconds, they are probably fraudulent. Rules can also be advanced statistical formulas indicating the relationship between specific variables. For example, a 10 degree rise in temperature increases sales of barbecue meat by 300%. Sometimes rules are sophisticated, self-learning data mining models that can predict customer behavior by combining historical and new incoming data.
- The discovery process is an iterative process. Figure 1 suggests that the discovery process is a serial process: when one step is finished, the next one starts, and we never return to a previous step. However, less would be closer to the truth. The discovery process is very iterative. For example, when a data analysis step has been finished, the conclusion may be to collect more data and start all over again. Even a data preparation step may lead to a return to the data acquisition step. In fact, this entire four-step process may have to be repeated several times before the right insights rise to the surface.
- Discovery results should be actionable. When a discovery process is finished, the organization has experienced no advantages yet – no money has been made, no ROI. The discovery process has to be followed up by a step called Act. In this step, the gained insights have to be used or implemented. Examples of implementing insights are: organization policies are changed, decision rules are embedded in operational applications, business processes are optimized, customers are offered special discounts and so on. Without the Act step, the entire discovery exercise has been for nothing. In other words, it’s important that discovery results are actionable. Note that the data scientist is not always involved in the Act step.
- No clear goal. Another characteristic that shows that data scientists are different from most other BI users is that their analysis work doesn’t always have a clear goal. The work they do is much more free format, much more research-like. Because the goal is not always that clear, classifying this process as “finding a needle in a haystack,” doesn’t always make sense. If you’re looking for a needle in a haystack, the goal is very clear, and with a powerful magnet it’s not even that difficult. Discovery is much more a stepwise refinement process. With each step, the data scientist may get closer to useful insights.
- Discovery may return spinoff results. It’s not uncommon that during the discovery process unexpected insights and rules are found. These spinoffs can be as useful as the rules intended to be found. Remember Alexander Fleming who discovered penicillin by accident. There are more well-known examples like this. For example, chemist William Perkin wanted to invent a cure for malaria. His experiments led accidently to the first-ever synthetic dye. And don’t forget George Crum who discovered Coke by accident when searching for a cure for headaches.
- Deployment of a wide range of analysis techniques. As indicated, data scientists use a wide range of analysis techniques to discover new insights. Many well-known statistical techniques can be used to find rules. A data scientist should have access to all the tools and techniques he needs. He should also be able to mix and match them. For example, he may want to apply a time-series analysis first, followed by a geo-visualization of the result. Data scientists should not be restricted in discovering valuable insights due to the lack of tools and techniques.
- Data overload doesn’t exist. The more data a data scientist has access to, the more discovery options he has. In this context, more means three things. First, it means more detailed data – no aggregate data. Aggregation of data can hide potential insights. Dealing with detailed data is a typical aspect of the big data trend. Nowadays, the technology exists to process massive amounts of data fast. Second, more means more data sources. Having access to a data warehouse is probably not enough for data scientists. They may also need access to large files with sensor data, spreadsheet data, external data sources and so on. It wouldn’t be the first time that rules are discovered by enriching internal business data with external data. Third, more means more types of data. Giving data scientists access to structured data is very useful, but not all the data has a very rigid structure. Data scientists may also require access to what some call unstructured, multi-structured, semi-structured or poly-structured data.
- Data scientists create new data. Usually, users of reporting tools don’t create theirown data. They access data stored in a data warehouse or data mart. In some situations, it could be that the data the data scientists need doesn’t even exist yet. The consequence can be that dedicated projects must be initiated to create and collect the required data. An interesting example of such a project is the Amsterdam Born Children and their Development (ABCD) project. This project started in 2001 and still continues. The project tracks the health of 8,000 children. Every so many years, these children have a checkup. The goal of this long-lasting study is to discover what the relationship is between early growth and development on the overall health later on in life. This study is a good example of where the right data has to be created first.
- Discovery projects may be long lasting. Some discovery processes are completed in one day, but they can also last for weeks, months and even years. For example, in April 2013, researchers working at the academic hospital in the city of Utrecht in The Netherlands discovered a formula that predicts the risk of new health problems ten years later for patients who have had a heart attack or stroke. The formula looks at fourteen variables, including age, gender, smoking habits and blood pressure. This study started in January 1996 and ended in 2013. This is a good example of a long-lasting discovery project.
With the introduction of highly scalable, low-cost data storage technology and fast in-memory analytical processing capabilities, the toolset of the data scientists has been enriched dramatically. Huge amounts of data can be stored and analyzed that were unthinkable a few years ago, and the techniques to analyze all that data have evolved enormously. If they weren’t already, data scientists may well become the key persons for organizations to survive in this increasingly competitive world.