Handling risk management of data in hadoop

1
1821
Handling risk management of data in hadoop

Handling risk management of data in hadoopDefining risk management
Risk management is mainly defined as identifying and assessing the prioritization of problem that can occur in a proceeding. Risk is the possibility of a system failure that can happen if some prevention measures are avoided. In terms of big data analytics risk management is to minimize, monitor and coordinate the probability of failure or loosing of critical application data. A better risk management policy will lead to a better architecture of a system and better data management. Hadoop has a high possibility of its data theft (due to lack of security algorithms incorporated in hadoop) because high amount of data is accessed by high number of users and to maintain and manage the quality data we need a system to secure the data in order to avoid system failure and data theft.

Risk management for Hadoop
Many of the traditional IT users take an advantage of various security and data control techniques that is the basis for securing hadoop. For example using standard perimeter protection for hadoop computing environment and monitoring each user’s activity and with log management to keep the track of users accessing various levels of hadoop framework, like one user might be using PIG for data analysis and one user might be involved in solving complex queries using HBASE, to keep both of these user’s data intact and secure. Hadoop is a vulnerable environment and it is too vast to be able to cover fully with lack of ability to protect complete framework.
Risk management for Hadoop

Traditional techniques in hadoop
Let’s discuss about some recent and robust steps taken for hadoop to secure the environment for managing and lowering the risk factor.

  • Kerberos (just like used for Network security): it is a traditional method that is also implemented in apache hadoop.
  • Apache Knox that is used for perimeter security.
  • ESS (Enterprise scale security) that is used for securing the apache hadoop framework.
  • Argus (a product of Apache Foundation) that is use for monitoring and managing the framework.
  • Last but not least, existing security measures like network firewalls, logging monitoring and auditing schemes for data and configuration management.

Are these Security measures enough
Data masking is one of the prominent technologies that have a high amount of contribution in hadoop architecture. It is mainly used for obfuscation of high amount of data and is used for testing and developing data from live production information system. The disadvantage of this technology is that once data is created using masking scheme, it is irreversible that is highly unsuitable for many analytical applications and data processing techniques.

Some other techniques are also used in conjunction with the existing security measures that are implemented are storage level encryption (used just like traditional data encryption scheme) where the entire data that is stored in HDFS (but stored in some physical location of one of the Linux directory) is encrypted at disk level and is marked as Data at rest. Though it protects the data from being theft and any anonymous user will not be able to access the data and will also be helpful while moving the data from one cluster to another but is helpless in protecting the data while the disk is running (as many of the users will access the data at same time).

Java Programming Course for Beginner From Scratch

Efficiently handling the risk factor in Apache Hadoop
Implementation of data centric strategy is well suited for data management and avoiding risk factor in hadoop framework. In this approach some of the sensitive data field elements are replaced with usable data and still had a chance of de identifying it. This simply means that we are de identifying the sensitive data values so that to they no longer are real values.

Some of the key benefits of this approach are that analytics domain experts can use this technique with data that is protected with data centric method. Many of the data scientists can perform analysis activities and do not need to access personally identifiable information to achieve key aspects of business intelligence.

Data centric approach is a worldwide acceptable approach used for big data analysis especially in hadoop and other hadoop supporting frameworks. It is used by many of the leaders of market in banking sector, healthcare, airlines and insurance sector.
Efficiently handling the risk factor in Apache Hadoop

Salient features of this approach for better risk handling are-:
Fraud Detection: The sample test case of this can be, fraudsters collect prescriptions from many doctors (let’s take an example of 5 doctors) and get them fulfilled by some pharmacies. Using traditional techniques (manual process) it will take several weeks to analyze the track record. Hadoop will enable to track this instantly and in a very efficient manner.

Efficient health analysis: Hadoop is used worldwide is healthcare sector and the prime job of hadoop to generate health records of hundreds and thousands of patients and provide a health status report using data visualization technique.

DE identifying data: Data centric approach also helps in de identifying the data (especially for healthcare and insurance sector) and it looks like as it was copied from traditional database systems.

Ease of scalability: This approach efficiently scales the nodes from 10 to 1000 nodes keeping risk factor in mind so as to keep the whole system stable and robust.

Enabling Innovations: Keeping risk management in mind, this approach enables the innovation through data access schemes and a well-managed data stored in HDFS (Hadoop distributed file system) cluster.

Augment infrastructure protect (implementing risk analysis) and de identifying the data before moving the critical information in and out of the cluster while retaining their format and behavior and this might be the best practice for risk management in hadoop.

Conclusion
This article is a comprehension of a technology i.e. apache hadoop and its lack of ability to secure the data and manage a huge framework with configuring concurrent access to users while keeping security in mind. We have discussed some techniques and useful measures that can be taken in order to secure the data to avoid failure and implement efficient risk handling mechanism.

1 COMMENT

  1. We benchmarked our findings for Hive queries that analyze an S3 directory with 24000 partitions, each having 1 file. We also ran corresponding explain Query for each one of them. We executed these queries on Qubole’s Hive that is derived from Apache Hive.

LEAVE A REPLY

Please enter your comment!
Please enter your name here