Data Science
Data Science (sometimes datalogy — Datalogy) is a branch of computer science that studies the problems of analyzing, processing, and presenting data in digital form. Combines methods for processing data in conditions of large volumes and high levels of parallelism, statistical methods, data mining methods, and artificial intelligence applications for working with data, as well as methods for designing and developing databases.
However, some experts believe that this definition is erroneous, because data science is not "data science", as it is written in the Russian-language Wikipedia. Data is not the subject of this science, so it is wrong to call data science a synonym for the science of datalogy proposed by Peter Naur. The term data science in Russian, perhaps, should be translated as "the science of working with data" or "scientific methods of working with data." Consequently, the challenge for data scientists is to extract knowledge using methods collectively called data mining, combining statistics and other methods of data analysis in order to understand what the data contains.
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and understanding (insights) from data presented in various forms, both structured and unstructured; it is largely synonymous with data mining and big data. Data science is "the concept of combining statistics, data analysis, machine learning, and related techniques" to "understand and analyze real-world phenomena." It uses methods and theories drawn from many areas in the context of mathematics, statistics, computer science, and computer science. Turing Prize winner Jim Gray presented data science as the "fourth paradigm" of science and argued that "everything in science is changing due to the impact of information technology" and the explosive increase in the amount of data (data deluge).
The fourth paradigm of science
Turing-Prize winner Jim Gray and astronomer and futurist Alex Shalai have divided humanity's scientific past into three data-use periods and complemented it with a modern fourth.
- Ancient times — a description of the observed phenomena and logical conclusions based on observations.
- XVII century — the creation of theories using analytical models to prove their truth.
- XX century — the use of numerical modeling methods, made possible by the advent of computers.
- XXI century — the use of methods based on data analysis; application of statistical and other methods of extracting useful information to work with huge amounts of data.
Evidently, data science is the science of the 21st century; it is viewed as an academic discipline, and since the early 2010s, largely due to the popularization of the concept of "big data", — and as a practical intersectoral field of activity, while the profession of a data scientist (data scientist) with the beginning of the 2010s is considered one of the most attractive, highly paid and promising.
Today, the term data science is often used interchangeably with earlier concepts such as business analytics, business intelligence, predictive modeling, and statistics. In many cases, earlier approaches and solutions are now simply renamed "data science" to make them more compelling. This can lead to the term becoming "blurry", as has already happened with the term "big data".
The main differences between data science and business intelligence (BI)
Completeness of the data used:
BI — structured digital data that gives a very limited picture of the surrounding world
data science — any data sufficient to reflect the picture of the surrounding world with any required completeness.
The main objectives of the analysis:
BI — analysis of previous data to identify business trends, assess the impact of certain events on the near future.
data science — predicting future results in order to make informed decisions, getting answers to the "what" and "how" questions.
Final result:
BI — information
data science — knowledge
In both cases, specialists play a decisive role. The main difference between the two specialties is that the BI expert is able to provide an objective picture from the past to the present, while the data scientist must understand how and what to do.
Data, information, knowledge
Since we noted above that the end result of data mining (BI) is information, and the end result of data science analysis is knowledge, the DIKW concept should be mentioned.
DIKW (English data, information, knowledge, wisdom — data, information, knowledge, wisdom) is an information hierarchy, in which each next level adds certain properties to the previous one.
- At the bottom is the data layer.
- Information adds context.
- Knowledge adds "how" (mechanism of use).
- Wisdom adds "when" (terms of use).
Information is data that is significant to the observer because of its significance to the observer. Knowledge consists of information supported by intention or direction. We can say that knowledge is what turns information into instructions (recipes). Critics of the DIKW concept suggest that this notion of knowledge can be useful (and effective) in a business context, but does not agree well with what has been thought of as knowledge for thousands of years. According to DIKW, knowledge is the result of filtering information, while "traditional" knowledge and related processes, not to mention wisdom, are the result of more complex processes: social, cultural, etc. That is, DIKW gives a distorted and simplified representation of knowledge and wisdom. However, "the distinctive characteristics of knowledge are still subject to ambiguity in philosophy," and the answer to the question, "Where do you see the difference between data and information?" very few specialists, even from the IT sphere, can give. Therefore, the introduction of the data-information-knowledge relationship, albeit in a simplified form, is undoubtedly useful.
Data Science Life Cycle
The figure below illustrates the five stages of the data science life cycle:
- Capture (capture) — data collection, data input, signal reception, data extraction).
- Maintain (support) — data storage, data cleansing, data preparation, data processing, data architecture).
- Process — Data Mining, Clustering / Classification, Data Modeling, Data Synthesis).
- Analyze — search / confirmatory, predictive analysis, regression, text analysis, qualitative analysis).
- Communicate (communicating results) — data transfer, data visualization, business intelligence, decision making).
Accordingly, a Data Scientist must be able not only to mine and analyze, but also to process large amounts of data, and use a variety of tools. There is no unambiguous description of this profession yet, and it is unlikely to appear in the near future — too much depends on the scope of application of data skills.
The main tasks of a Data Scientist
A data scientist should be able to:
- extract the necessary information from a variety of sources;
- use information flows in real-time;
- to establish hidden patterns in data sets;
- statistically analyze them to make smart business decisions.
The data scientist must be curious and result-oriented, have a good knowledge of the industry in which he works, and have good communication skills that will allow him to explain the obtained technical results to his “non-technical” colleagues. He must have significant experience in statistics and linear algebra, as well as knowledge in programming, data warehousing, mining, and modeling to build and analyze algorithms.