Organizations of varying sizes and operating in different sectors of the economy generate large amounts of data. These organizations have an interest in obtaining insights from their data stores. The discipline that helps organizations extract actionable insights from their data has come to be known as data science. Efforts to get insights from data can be collectively referred to as data science projects. For data science projects to be successful, skills and tools need to be harmoniously applied.
In a data science project, there are several key roles and responsibilities which are critical to the success of a project. Depending on the size of the organization, these roles may be filled by one person or one individual may take up several roles. The first role is the project sponsor. The project sponsor is responsible for bringing forward the benefits to the business and evangelizing the value of a data science project. The second role is the client who has the responsibility of putting forward the interests of the end users of the data science project output. The third role is the data scientist who is responsible for setting and executing data analysis, and communicating to the client and the sponsor. Some responsibilities of the data scientist are identifying statistical techniques and machine learning algorithms that are appropriate.
The fourth role is the data architect who is responsible for data management. Data management responsibilities include data extraction, data transformation according to business rules, data loading into data warehouses and data quality management. The data architect is basically responsible for laying the data infrastructure that will be used by the data scientist. The final role is operations which is responsible for managing how the results from the data science team will be incorporated by other teams. For example, the operations role will play an important role in incorporating a recommendation engine developed by the data science team into a shopping cart maintained by the IT team.
A data science project has several stages which are undertaken in an iterative manner. Although our discussion here will identify several stages in practice there is no delineation of the project into distinct stages. Data science activities will fall into several stages and there is no guarantee the project will proceed sequentially.
- The first stage in a data science project is establishing a clear and measurable objective of the project. In this first stage, it is very important to identify the following.
1) The requirements of the sponsor and what is missing
2) All the efforts in place to meet the needs of the sponsor and shortcomings of the efforts.
3) The resources that are needed in terms of data, skills and computing infrastructure.
4) How the results will be used and any hurdles that will be faced in using the results.
- The second step in a data science project is identifying and preparing the data that will be used. In this step, you examine the data thoroughly to establish if it will help in achieving the data science project. The data architect plays a major role in this step. In this step it is very important to establish the following.
1) Availability of data
2) If the data will satisfactorily meet project objectives
3) Quality aspects of the data such as completeness and accuracy.
- The third step is building the models that will identify patterns that exist in the data. In this step you apply statistical techniques and machine learning algorithms that will extract insights from data. The model building step mostly overlaps with identification and data preparation step. This is because in the modeling step you may identify data quality problems or you may realize the available data will not meet the project objective. Some data modeling objectives are listed below.
1) Predicting categories, this is referred to as the classification problem. For example you can predict which customers are likely to churn, which customers are likely to make a purchase or which transactions are fraudulent.
2) Creating groups based on similarity, this is referred to as clustering. For example a clustering algorithm can be used to find which items are purchased together.
3) Estimating the probability of events, this is referred to as scoring. For example a credit card company can use scoring to identify high risk and low risk customers.
- The fourth step in a data science project is assessing a model performance. When assessing a model, you need to look at the following aspects.
1) The accuracy of the model relative to your project objectives
2) Plausibility of the model
3) Performance relative to the approaches in use
Whenever a model does not meet the set criteria you need to reassess the data or your project objectives. For example, if you are classifying financial transactions you need to set a level of accuracy that is acceptable. This will enable you to avoid false alarms. If a model cannot meet a set level of accuracy then for example you might want to add more data.
After identifying a model that meets your project objectives the next step is discussing the model with stakeholders. Documentation is also required which becomes a reference. When preparing the documentation you need to keep in mind the needs of different stakeholders. For example, management would be interested in knowing the impact of the model on business outcomes. In the financial transactions example the management would be interested in knowing the dollar savings from implementing the classification model.
The last step in a data science project is operationalizing the model. The key activities here are deploying and maintaining the model. It is advisable to pilot the model before a full scale deployment. This will enable identification of any shortcomings. After deployment the model moves from the data science team to the team that will use it in day to day operations. Maintenance of the model is needed to adapt the model to data and business changes. For example, the transaction prediction model may only be able to identify fraudulent transactions involving small amounts of money and it therefore needs to be updated.
In this article, we introduced data science projects and noted for a successful data science project different roles, skills and tools are required. We discussed the roles and responsibilities that are critical for project success. We discussed the different stages of a data science project and what is important in each step.