Data Science is a revolutionary technology which grabbed several eyeballs in the recent few years and has managed to become one of the most desired fields to make a career in. Everyone is talking about it or at least knows about the usefulness of data science in both the current world as well as the future. Data Science is now hailed as the sexiest job of the 21st century with hundreds of people having the desire to become a data scientist. Although data science, being a buzzword, very few people know this technology to the core or understand it in its true sense.
Despite a lot of people showing interest in data science, it is very important to understand all the sides of data science. Indeed all the benefits, advantages and capabilities are discussed in many forums, blogs and communities, there are some of its dark sides too which you should know and understand before taking your next step. Beyond all the success stories, in this article, we bring you some of its darker sides to paint a clear picture for all the data science aspirants and everyone seeking to become a master of this trending technology.
But before going directly to its darker side, let’s first have brief insights into the data science itself.
What is Data Science?
Data Science is one of the most trending computational fields which has successfully revolutionized numerous other fields single-handedly. Basically, it provides a foundation that helps computers to solve a given problem. From drug design to the banking sector to real estate to android applications, data science has its own advantages which give them the extra edge. It might be easier to get your fundamentals about data science cleared if you check out its explanation in an illustrated form.
In very simple words, which probably you have assumed or heard from others, data science is generally the study of data. However, to your surprise, data science is not just about designing and training the most advanced Artificial Intelligence and Machine Learning algorithms by using the data but to finding the right data. Data Science involves the extraction, visualization, analysis, management and storing the data to generate insights from it. These insights help individuals, companies or organizations to make super powerful, precise and efficient data-driven decisions.
This field is a multidisciplinary field with roots in computer science, math, and statistics, and involves both structured and unstructured data. Instead of just deriving the insights from the data, according to a study, any typical data scientist spent 79% of their time in collecting, organizing and cleaning the data. In regard to this percentage, about 60% of the time is being spent on data cleaning and organizing while the other 19% for collecting the data.
Hope, it is sufficient to give you the basics of data science. Now let’s go to the most awaited section of this article, “the dark secrets of data science”.
Dark Secrets of Data Science
Like every coin having the two sides, going past all the lucrative salaries and the benefits, data science also has certain eyebrow-raising questions or I would say not so good sides of data science. Below are some of the dark sides which are enough to neutralize its hype of usefulness for the better world.
Obvious data science discoveries
Currently, most of the data science related discoveries are too obvious. For instance, when any hospital used data to look for different causes of error by the doctors, they found the lack of sleep as the main culprit; or in case of banks, when employees tried to predict loan defaults, they found the names of all the people having nil or low savings are more likely to default their installments. These and some other predictions like tall people are more likely to hit their heads are too obvious to predict or know.
The majority of the predictions which data scientists are currently making are too obvious. If we can already make certain predictions then why to give our extra effort, time and money for doing it by collecting and visualizing the data. Will it be worthy enough?
However, some scientists or big organizations are actually evaluating these predictions in order to study the more specific or subtle details like predicting the future onset of any particular disease but this often requires a lot of data and its analysis ultimately demanding more time, money and effort.
Sometimes, despite doing all the hard work, data scientists find nothing i.e. no meaningful insights or patterns after the data visualization. Biologically, human minds are good at finding patterns even when there are no patterns present. In the case of data science, a lot of questions that pop up in the mind of data scientists are meant to validate and connections noticed by a human brain. Sometimes they find something and sometimes they don’t find anything.
Although, no result or negative result is also a valuable result of any work but it is often unsatisfying for the data people doing all the hard work. Often these people end up with the conclusion that they might have missed something and be skeptical about their victory for nothing.
Harder to find answers
Sometimes, statistics based answers can be tricky and harder to find then we generally think. This happens mostly when data scientists or any individual uses sensitive statistical methods, small data or sample sizes or are biased about some data ultimately making the finding or answers more likely to be wrong and unreliable.
The solution to this is very simple, use significantly larger data and reliable statistical tools or methods. Large sets of data help in making the subtle and minute predictions which cannot be predicted in a normal scenario. These subtle predictions can be a game changer in cases where the value of understanding is very less or nonexistent such as a certain diagnosis in healthcare, equity trading or any other fields. But the cost to analyze or even gather large sets of data can be very high which makes it harder for smaller or even regular organizations to afford. Because of this, various organizations are reluctant to spend their money and resources unless they are very sure about the output.
Algorithms imitating the past
All the data science related algorithms are based on data which are gathered from the past. This can make algorithms imitate the past but unfortunately not the future. Today, there are several fields which are changing at a rapid pace making it difficult for data scientists to predict the future. As a result, only it can be summarized in the past.
For an example, take fashion industry, no matter how much data you take from the past, the type of ties or suits used in the nineties are nothing to do with the upcoming trend in the same industry. To give a more clear picture, in the 1960’s people used to wear skinny ties but in the seventies, the trend changed to about six inches broader ties and today is the completely different scenario. We can’t use this data to predict what could be the trend of the 21st century, right?
In such cases, no data scientists can predict the exact future but can only reveal what had happened before and it will be upon us to guess whether it’ll occur in forthcoming days or not.
Data are simple and consistent, NO!
No data is simple and consistent, actually, data scientists often get data that are messy, inconsistent and corrupt. Take data from banks, everyone or at least most of us believes that financial data might be the great fit for analysis because of the simplicity of all the transactions made but in reality, it is messy too. As it is very likely that any particular bank can use negative values for withdrawals and other can use a positive value for the same use case. Moreover, including the numerous fees which banks generally charge and other monthly charges makes it more difficult to turn any data into a more consistent column of data set.
This was about the bank which offers more simply because of money. But in several other cases like stock markets, sensors and other fields where errors play vital roles in collecting and organizing data. In these cases, transforming data into a consistent form becomes a headache for any data scientist or a data science expert. Anyway, the good news is that more dramatic effects are more easy to trace and find to overcome all the inconsistencies and noise.
Cheap data, costly filtering
In cases where data is cheap, filtering of that data becomes expensive. Today, an uncountable amount of data is present. Terabytes of information about people using a GIF image at any particular situation, to the security cameras filled with high-resolution images, to the behavior or routine of an individual throughout the day, the internet is filled with data. Today, when any situation arises, it is not about getting the data but to find the right piece of the data to bring the solution.
Though finding or working with the large pool of data is something which computers having a solid or good model do well, but again building that model is often the responsibility of data scientists. Here the question arises that what should be done first, whether to find a model which can distinguish a needle form hay or to find the needle itself at the first place?
Expensive human filters
Humans generally cost more to filter the data with their intelligence in order to build different training sets for artificial intelligence and machine learning algorithms. They can easily classify images, read any documents or listen to an audio tape for filling out the forms with checking the tight boxes in a consistent way. A lot of people from numerous countries are doing this practice for building artificial intelligence training sets. Moreover, it is also an essential process, as, without this preliminary work, data science can’t begin. This might cost you a good amount of money but it is generally completed in a manageable amount of time.
Certain data are impossible to get
Despite a large amount of data is currently present, still, there is a lot of data which is annoyingly difficult to find. If you’ll look into the data of total inhabitants of any particular city, you’ll get a staggering amount of data but within a week time, you’ll notice the drastic change in the very same database which can be crucial if you are even considering for a decade. These and other data keep changing at very rapid speed making it difficult for data scientists and statisticians to keep a clear track of it.
This also happens because of incorrect or inaccurate details given by individuals during the surveys making it impossible for analysis teams to make correct or accurate guesses. It results in the accumulation of unwanted and inconsistent data for the data collectors. The majority of the data from any surveys or fields are never good enough to draw the correct insights and can lead to the wrong directions. On the other hand, apart from tools and data scientists, data science heavily relies on the available data which often leads data scientists to spend most of the time in gathering accurate data initially.
We can’t learn much from several algorithms
However, it is the algorithms that help us to draw probable insights from the data by finding the pattern, in reality, most of these algorithms teach us nothing. Indeed, these algorithms help in producing dramatic results with stunning precision but if you will try to find out the actual way they did it, you will constantly fail. These algorithms are designed to use thousands of filters to tweak the responses for drawing the positive insights onto the pattern and understanding this very same requires the analysis of millions of numbers making it next to impossible for the human brain.
Moreover, when these algorithms are provided with good training, they might become very useful but is often unstable and brittle. Because of this, understanding the correct way algorithms use to make their decision can help us to predict when these algorithms might fail. Contrary to this, the lack of knowledge about the algorithms can lead to unpredictable failures.
In the world of data science, hidden biases are present everywhere. It is filled with anecdotes of data set being biased despite the best efforts from data scientists. For instance, if any individual take photos of one collection in the morning and other in the afternoon, the artificial intelligence and machine learning algorithms ended up with differentiating between morning and afternoon sun and the shadows it cast.
Finding and eliminating biases like these in data science is difficult and is often considered art for the data scientist to do so. As if it was easy then anyone would have found and removed it. Though there are some statistical techniques which can be useful in finding such biases and removing it they are not the one which can promise a hundred percent and they are not even as automatic as we want.
There’s always an answer
Sometimes in data science, there is always an answer even if it is wrong. Suppose, if you saw any particular car with xyz number plate, then can you imagine that among so many cars what was the chance for you to see that particular number plate at that moment? Here algorithms of data science play a major role as every time they have the answers to the questions like finding the maximum, average or minimum.
Here comes another challenge to any data scientists and it is called “p-hacking” which is basically a process of combining a data set seeking the results that look statistically significant. This randomness means that always, there is an answer present somewhere in the data. But the question is whether that answer will stand up with the time or not.
Almost the majority of the data science-related projects produce thousands of graphs, patterns, and charts after examining all the combinations and sub-combinations. Sometimes this may result in making correct and precise predictions, and sometimes it is not of any big help to businesses and organizations looking for precise patterns for their decision making. Most of them invest in data science with the hope to find an answer which will help them to grow their business but in certain cases, they end up exploring something which is not required at the first place.
We saved the darkest secret of data science for the last. Yes with the advancement of technology, the risk of data piracy has also increased several fold. It is true that data scientists are helping businesses and organizations to grow by making a data-driven decision but data utilized for this process might not be obtained with the consent of the user or can be obtained with a process that may breach the customer’s privacy. Another major debate related to this darker side is that it is also very likely that the data submitted by the user to any particular company might be sold to others by that company or even get leaked due to poor security. These data can vary from anything to user purchases, transactions, subscriptions and so much more. This has been one of the biggest concerns for several industries since the very beginning of data science.
These were some of the darkest and less discussed secrets of data science which I believe every data science enthusiast or any individual who wants to acquire the knowledge of data science should know. It is an ever-changing world where the data is present in abundance but the question arises how many of them are useful, relevant and efficient for making the correct and precise data-driven decisions. If you are looking to step into this trending technology or to make a career in it then apart from just knowing the advantages, salaries, usefulness and the application of data science, I believe you should also prepare yourself for some of its disadvantages or not so good sides of data science or the dark secrets of data science.
But all is not so dark and gloomy in this sphere. With the popularity of data science, jobs and salaries both are on the rise. If this article has awoken your curiosity about the subject and you are ready to step into its waters, try out the “Introduction To Data Science Using R Programming” online tutorial. The course comes with 7 hours of video covering the equal number of sections that are imperative to understand data science thoroughly. The topics included in the course are basic data visualization, leaflets maps, popups and labels, linear and multiple regression, time series, decision tree and so much more!