The term "Big Data" was born on September 4, 2008, when the issue of Nature was released which collected "features and opinion pieces on one of the most daunting challenges facing modern science: how to cope with the flood of data now being generated". This flood of data, by analogy with "Big Oil", was called "Big Data".
There are many definitions of the term "Big Data".
McKinsey, 2011: "Big data" refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.1
Chris Preimesberger, eWEEK, August 15, 2011: "Big data refers to the volume, variety and velocity of structured and unstructured data pouring through networks into processors and storage devices, along with the conversion of such data into business advice for enterprises". These elements can be broken down into three distinct categories: volume, variety, and velocity.
"Big data is a term used to refer to data sets that are too large or complex for traditional data-processing application software to adequately deal with. Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate".
When does average-size data become big data? What's the tipping point?
That depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."4
Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many exabytes of data. In order to know the measure of things: the analysts of the IBS company estimated the whole amount of data accumulated by mankind by the following values:
2003 — 5 exabytes (1 EB = 1 billion gigabytes)
2008 — 0.18 zettabyte (1 ZB = 1024 exabytes)
2015 — more than 6.5 zettabytes
2020 — 40-44 zettabyte (forecast)
2025 — this volume will grow another 10 times.
The report also notes that most of the data will not be generated by ordinary consumers, but by enterprises (recall Industrial Internet of Things).
It is also worth mentioning that out of 4.4 ZB data collected in the world by 2013, only 1.5% had information value.
You can also use a simpler definition that is fully in line with the well-established opinion of journalists and marketing specialists.
Big data is a set of technologies that are designed to perform three operations:
- Processing of data that have larger volumes compared to "standard" scenarios.
- To be able to work with data incoming fast and in very large amounts. That is, there is not just a lot of data, but they are constantly becoming more and more.
- To be able to work with structured and weakly structured data in parallel and in different aspects.
It is believed that these "skills" allow us to reveal hidden regularities that elude limited human perception. This provides unprecedented opportunities to optimize many areas of our life: government, medicine, telecommunications, finance, transport, manufacturing, and so on. It is not surprising that journalists and marketers used the phrase Big Data so often that many experts consider this term to be discredited and suggest refusing it.
Moreover, in October 2015, Gartner removed Big Data from popular trends. The company's analysts explained their decision by saying that the concept of "big data" includes a large number of technologies already actively used in enterprises, they partially belong to other popular areas and trends and have become an everyday working tool.
Be that as it may, the term Big Data is still widely used, as evidenced by our article.
Three "V" (4, 5, 7) and three principles of working with big data
The defining characteristics for big data are, in addition to their physical volume, other, emphasizing the complexity of the task of processing and analyzing this data. The VVV feature set (volume, velocity, variety — physical volume, data growth rate, and the need for fast processing, the ability to simultaneously process various types of data) was developed by the Meta Group in 2001 to indicate the equal importance of data management in all three aspects.
Further, interpretations appeared with four V (veracity was added), with five V (viability and value), with seven V (variability and visualization). But IDC, for example, interprets the fourth V as value, emphasizing the economic expediency of processing large amounts of data in appropriate conditions.10
Based on the above definitions, can say that the basic principles of working with big data are:
- Horizontal scalability. This is the basic principle of big data processing. As already mentioned, the volume of big data is getting bigger every day. Accordingly, it is necessary to increase the number of computational nodes over which these data are distributed, and processing should occur without degrading performance.
- Fault tolerance. This principle follows from the previous one. Since there can be a lot of computing nodes in the cluster (sometimes tens of thousands) and their number, it is possible, will increase, the probability of machines failure also increases. Methods of working with big data should take into account the possibility of such situations and include preventive measures.
- The locality of the data. Since data is distributed across a large number of compute nodes, then if they are physically located on one server and processed on another server, the cost of data transmission can become unnecessarily large. Therefore, it is desirable data processing is carried out on the same machine on which they are stored.
These principles differ from those that are typical for traditional, centralized, vertical models of storage of well-structured data. Accordingly, for working with big data, there are developing new approaches and technologies.
Technologies and trends of working with Big Data
Initially, a set of approaches and technologies included the means of mass-parallel processing of indefinitely structured data, such as NoSQL DBMS, MapReduce algorithms, and Hadoop project tools. Then, other solutions began to attribute as technologies Big Data, as well as some hardware, if they provided similar capabilities in processing extra-large data arrays.
It is important to know the definition of big data techniques and big data technologies. Techniques are generally defined as types of analysis used to process and extract information from raw data while technologies are mostly referring to methods, software, and platforms used for strategic data management.12
- MapReduce: the model of distributed parallel computing in computer clusters. According to this model, developed by Google, the application is divided into a large number of identical elementary tasks that are performed on the nodes of the cluster and then reduced to the result.
- NoSQL: a generic term for various non-relational databases and repositories, does not denote any particular technology or product. Conventional relational databases are well suited for fairly fast and homogeneous queries, and on complex and flexible queries typical of big data, the load exceeds reasonable limits and the use of the DBMS becomes ineffective.
- Hadoop is a freely distributed set of utilities, libraries, and a framework for developing and executing distributed programs running on clusters with hundreds and thousands of nodes. Considered as one of the underlying technologies of big data.
- R is a programming language for statistical data processing and working with graphics. It is widely used for data analysis and has actually become the standard for statistical programs.
- Hardware solutions. Corporations Teradata, EMC, and others offer hardware-software complexes designed for processing big data. These complexes are supplied as ready-to-install telecommunication cabinets containing a cluster of servers and control software for mass-parallel processing. To that solutions also sometimes carry hardware for in-memory analytical processing, in particular, Hana hardware and software systems from SAP and Oracle's Exalytics complex, despite the fact that this processing is not initially mass-parallel, and the amount of RAM in one node is limited by several terabytes.
Big data analytics techniques
McKinsey Global Institute in its 2011 report listed 26 common data analytics techniques and claimed that although this is not a complete list all the techniques presented are applicable to big data analysis.
Some of these techniques are provided below.
- Data Mining class methods: a set of methods for detecting in data previously unknown, non-trivial, practically useful knowledge necessary for decision-making. Such methods, in particular, include association rule learning, classification (categorization of new data points and assigning them to proper groups), cluster analysis, regression analysis, etc.
- Crowdsourcing: the classification and enrichment of data by the forces of a wide, indefinite circle of people performing this work without entering into an employment relationship
- Data fusion and integration: a set of techniques to integrate heterogeneous data from a variety of sources for in-depth analysis (e.g. signal processing, natural language processing, etc.)
- Machine learning, including supervised and unsupervised learning: designing algorithms to help automatically learn and identify various patterns based on historical/empirical data (using models built on the basis of statistical analysis or machine learning to produce complex predictions based on basic models)
- Neural networks, network analysis, optimization, including genetic algorithms (heuristic search algorithms used to solve optimization and modeling problems by randomly selecting, combining, and varying the desired parameters using mechanisms similar to natural selection in nature).
- Pattern recognition: methods used to identify patterns and regularities in the data.
- Predictive analytics: techniques that are used to predict an outcome or its probability.
- Simulation: techniques that imitate a real-life process or system behaviors.
- Spatial analysis: a class of methods that use topological, geometric, and geographic information extracted from data.
- Statistics, time series analysis, A/B testing (known as split testing or bucket testing, compares a control set with many other test sets to define what strategy will enhance a certain objective variable.
- Visualization of analytical data: presentation of information in the form of figures, diagrams, using interactive features and animation, both for obtaining results and for use as input data for further analysis. A very important step in the analysis of big data, allowing you to present the most important results of the analysis in the most convenient form for perception.
Big data in the industry
According to McKinsey's report "Global Institute for Big Data: The Next Frontier for Innovation, Competition, and Productivity", data has become as important a factor of production as labor and production assets. Through the use of big data, companies can gain tangible competitive advantages. Big Data technologies can be useful in the following tasks:
- forecasting the market situation;
- marketing and optimization sales;
- product improvement;
- making of management decisions;
- increase in labor productivity;
- efficient logistics;
- monitoring of the state of fixed assets.
In manufacturing enterprises, big data is also generated due to the implementation of Industrial Internet of Things technologies. During this process, the main components and parts of machines and machines are supplied with sensors, actuators, controllers, and, sometimes, inexpensive processors capable of performing boundary (fog) calculations. During the production process, continuous data collection and possibly pre-processing (for example, filtering) is carried out. Analytical platforms process these arrays of information in real-time, present the results in the most convenient form for perception and save for further use. Based on the analysis of the obtained data, conclusions are doing about the state of the equipment, its efficiency, the quality of the products produced, the need to make changes in technological processes, etc.
Thanks to real-time monitoring of information, enterprise personnel can:
- • reduce downtime
- • increase equipment productivity
- • reduce equipment maintenance costs
- • prevent accidents.
The last point is especially important. For example, operators operating in the petrochemical industry receive an average of about 1,500 emergency messages per day, that is, more than one message per minute. This leads to increased fatigue of operators who have to constantly make instant decisions on how to respond to a particular signal. But the analytical platform can filter out the secondary information, and then the operators get the opportunity to focus primarily on critical situations. This allows them to more effectively identify and prevent breakdowns and, possibly, accidents. As a result, levels of production reliability, industrial safety, the readiness of process equipment, and compliance with regulatory requirements are increased.
In addition, according to the results of the analysis of big data, it is possible to calculate the payback period of the equipment, the prospects for changing the technological regimes, reducing or redistributing the staff — to make strategic decisions on the further development of the enterprise.