27
Feb

Big Data and The Big Problems That They Bring

big-data-problems

Big Data

Big Data As you know, first of all – the facts. And then they can already turn as you like. Famous saying about “There are Lies, damned lies, and statistics.” is no longer a joke, but it is the daily norm. The thing is that, in itself it is an array of data, no matter how long it was not – it’s just an array of data. In order to extract information from it, it is necessary to perform operations on it. And then the most important step – the analysis of the data obtained. Something that only people can do. And their judgment is subject to inaccuracies and distortions.

Even if the data is based on correct measurement. Now, in many areas of science and businesses undergoing fundamental changes brought about by the introduction of systems of mass data collection and analysis. Thanks to the Internet and other means of mass communication which made this work easier than ever. We live in a time when the data is simple to get, but to understand what they look like – no. Many companies, commercial not just sit on the huge deposits of data but on hundreds of terabytes. Ability to collect new and unprecedented – API, research and other tools at your disposal. But carried away by the pursuit of terabytes of data and gigahertz processors, which will process them, we forget about the purpose of such studies.

In the end, “Big Data” should look for large data sets, depending, detection of which is not under the power of analytics of humans. But there are some important questions that remain unanswered, despite the abundance of news on Big Data in the network.

Here are five questions that, I think would be worth to lift everyone who is going to work in the area of Big Data.

  1. Large Amounts Do Not Guarantee Quality
  2. All The Data Are Not The Same
  3. “What?” And “Why?” Are Different Questions
  4. Everything Can Be Explained In Many Ways
  5. Use Of Public Data Is Not Always Ethical

Go through each of these points in more detail.

More – Does Not Mean Better

Despite the «Big», quality is more important than quantity. And in order to understand the quality, it is necessary to understand the limitations imposed by the data. One of these restrictions – the way in which it is sampled. The accuracy of this method is important for all of the social sciences, as well as economic research. Sampling method determines the conclusions that can be drawn on this basis – which methods of analysis and extrapolation may apply. So, to judge the representativeness of the sample, it must be random. If you examine the topological properties of the key role played by diversity.

Through algorithms of «Big Data», it is possible to find statistical regularities in large volumes of data, even if in fact they are not there. In such circumstances, the appearance of false predictions can only be a matter of time. But given the common roots and methods of obtaining the true, highlighting them will be extremely difficult. When it comes to research, the sample is made in accordance with the scientific requirements. Type of sampling is planned in advance and are collected in accordance with specified requirements. It is not easy, but it allows you to make plausible estimates based on incomplete (and they are always incomplete) data. Big Data here is changing the rules of the game, allowing the (theoretical) research for the available data set. It is impossible to interview all the people on earth, but it is possible to collect data on all users on Facebook, and Facebook itself certainly does this.

However, researchers usually do not have access to such samples. For example, when doing research on users of Twitter, researchers usually do not have access to data for all users, they have access to their own tape and believe that based on this, they can make any judgments, but this sample is not complete and is not accidental. Many believe that placing a large amount of data can be carried out on the basis of their research and make judgments simply because the volumes that they have are so great, but it’s wrong. Without an understanding of the structure and nature of the source of these data, it is impossible to choose the right methods of analysis and processing. As a result, the conclusions drawn from the analyzed data will be incorrect.

All The Data Is Not Equally Useful

Because of the volume, many researchers believe that the algorithms of Big Data is the best research tool. Their “purity” often do not attach importance. I was very surprised by the perception in some quarters and the view that the further development of Big Data technologies will make the rest approaches to unnecessary massive research. This view often pops up in connection with research on the social networks. Indeed, why spend on costly public opinion polls and phoning people, process profiles if you can just take a sample of data from social networks? But the opinion data, say, Facebook is more accurate than those which are obtained through surveys by sociologists – is wrong.

In the first place because those who so claims do not see the difference between the sources from which these data are obtained. And I’m not even against of surveys of people and collecting data on their behavior in computer networks. There are many types of social networks, which we unite under the general term. Each of them requires specific methods of research and data collection. Similar structures exist in other areas of statistics and analytics. Moreover, often the data of electronic networks can rely a stretch. Social networks may extend further than noted on Facebook and at the same time does not include many people who have been marked.

Do not forget that today computer networks provide us with a rather primitive way of displaying our relationship. In fact, everything is much more complicated than one would think, looking at the beautiful pictures of social graphs. In many cases it is necessary to take the amendments to the inaccurate or misleading data collected. Universal data do not exist, and the ability to analyze them, to carry them over calculations or model, this does not change. Must be very well understood what information may or may not be removed from any data.

“What?” And “Why?” Are Different Questions

Marketers love Big Data. Mainly because they do not understand how it works or it can give data. For example, they confuse with facts and reasons. For example, the number of “likes” on social network page with brand recognition by people. The analysis of the behavior and interactions of people, financial transactions, and so on are very important task. But this is only the first step to understanding what will happen. In order to predict future behavior, it is not enough to know the answer to the question “what’s happening?”, You also need to understand why this is happening. The answer to this question often follows directly from the first, and to identify these issues is even more dangerous. But to draw conclusions based on the data collected – a difficult task that requires considerable knowledge in their field and a well-developed intuition. In short, even if properly collected and analyzed data, without a qualified expert cannot do if you want to understand that, in fact, do all these figures and graphs, and what the implications of this can be done. And this brings us to the next problem.

Interpretation Of Information

However sophisticated analysis algorithms were not there, they still have to be interpreted by humans. And it is not even important, who is this person – you, the marketer or specially hired analyst. Interpretation as a rationale for the analysis, their integration into any system, is inextricably linked not only with the analysis, but also the personality of the analyst. According to the same data of five different people can make different conclusions. And further steps should be planned precisely in view of these findings. And if the conclusions were wrong, the consequences of these actions can be disastrous.

Example: A Friendster is not a successful predecessor, Facebook, studied the work of sociologists before starting the network. In particular, one study concludes that one person can effectively support only about 150 social relationships with other people. Unfortunately, Friendster interpreted this conclusion as a guide to action and limit the number of “friends” in the ceiling 150. As we can see in the example of Facebook, this is not enough. An error occurred in the interpretation of the concept of “social cohesion”, namely in identifying it with “friends” on social networks. Shows all the same Facebook, these concepts are not equivalent. Also errors in interpretation often occur when the analyst has to combine the data with the theory, which he adheres under which interprets the results should be taken. When the facts are joined with bad theory, there are two options – either to “correction” of the facts (in other words – the rejection of all experimental data except those that fall within the framework of the theory), or the recognition of the theory wrong and the construction of new (what can not all, and a little head to the question “What are the results of the analysis?” wants to hear the answer “I do not know”). Often, such a choice is made unconsciously – we are all susceptible to perceptual distortions that make us automatically dismissed as unimportant and false information that is poorly corresponds with your views.

What Is Good And What Is Bad

Ethical research concerning Big Data is still a “gray zone” in which there is no established rules or patterns of behavior that could be followed. The apparent “impersonality” of the data collected by automatic algorithms, plays into the hands of researchers. We are getting used to manipulate arrays of personal data as if it’s just ones and zeros, not quantified with the lives of hundreds of people. People who usually were not asked whether they want to participate in such studies. Privacy, like many other concepts tied to the context. So far, the dominant view is such that if the data is available to the public, they are available for use. But there is a difference between a shared and available data at any time for any purpose. So far, the use of public domain located in the data is allowed, but this will change soon. Or change the moral norms. Now it is difficult to say with certainty what will happen before.

What To Do?

I tried to identify the main ways to overcome the problems described above. End of specialized analysts. Analytical skills has too few experts who make important decisions. As a result, the personal opinion of one person can be critical for the course of the entire company. Instead of hiring experts and pay them, you must cultivate your inner analytical staff. Ideally, even from people who have the skills and analytical thinking, but not consisting mainly staff of analysts – they can express an independent opinion without fear of making weight and can often look at the problem from an unexpected quarter. IT – more and less than T. Technological side of IT in the enterprise is important, but should not obscure the information. Until now, IT depends and rests on specific people and personalities. Understanding (and often – guessing) the needs of other departments are also served by the IT department of the company is not an easy task, and it should be handled by experts to be clear about the entire IT enterprise structure. Systems thinking and teamwork skills – these are the traits that are often not enough. In order to analyze the information, it should be well structured. However, many organizations are collecting the data are not engaged in their structuring. It’s like if the library books lying side by side, not cataloged. Data structures allow efficient data analysis and quickly find the desired information.

The analysis should be used in conjunction with the simulation. As practice shows, often opportunities purely analytical algorithms do not suffice. It’s easy to explain – analyzing, we inevitably return to the past, and then try to extrapolate the results obtained in the future. Practice shows that it is not an effective method in any case, little effective separately from others. Systems theory allow us to understand the general laws of behavior of the system at any time, but on the basis of these laws, there are based models that are corrected using extrapolated data. This combined analysis method is much more effective than each of them individually.

ESDS

Leave a Reply

RSS
Follow by Email