1.3 Reading

Recently,when people still feel vague about"Internet of Things""Cloud Computing""Mobile Internet"and other hot words,"Big Data"has emerged and developed into a prairie fire.The biggest difference between the 2014 Brazil World Cup and the previous World Cup is that it integrates many technological elements such as"Cloud Computing""Big Data",etc.IBM research in 2013 shows that,90%of all the data obtained by human civilization is generated in the past two years.By 2020,the data generated in the world is 44 times of that in 2009.According to IDC monitoring,the amount of data produced by human beings is growing exponentially,about doubling every two years,and the global amount reach 35 ZB in 2020.According to statistics,on average,2 million users are using Google search every second.Facebook has more than 1 billion registered users and generates more than 300 TB of log data every day.At the same time,the rapid development of Sensor Networks,the Internet of Things,Social Networks and other technologies has led to the explosive growth of data scale.Various video monitoring and sensing devices have also continuously generated huge amount of streaming media data.Energy,transportation,health care,finance,retail and other industries have also generated a large number of data,accumulating TeraBytes and PetaBytes of Big Data.The above situation shows that now it has entered the era of Big Data,which has begun to benefit mankind and become a valuable asset of the information society.A decade of digital universe growth is shown in Figure 1-1.

Figure 1-1 A decade of digital universe growth.

1.3.1 What is Big Data?

According to McKinsey,Big Data refers to data sets whose size is beyond the ability of typical database software tools to capture,store,manage and analyse.There is no explicit definition of how big a data set should be.New technology needs to be in place to manage this Big Data phenomenon.IDC defines Big Data technologies as a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high-velocity capture,discovery and analysis.According to O'Reilly,"Big Data is data that exceeds the processing capacity of conventional database systems.The data is too big,moves too fast,or does not fit the structures of existing database architectures.To gain value from these data,there must be an alternative way to process it."

1.3.2 Characteristics of Big Data

For a data set to be considered Big Data,it must possess one or more characteristics that require accommodation in the solution design and architecture of the analytic environment.Most of these characteristics were initially identified by Doug Laney in 2001 when he published an article describing the impact of the volume,velocity and variety of e-commerce data on enterprise data warehouses.To this list,veracity has been added to account for the lower signal-to-noise ratio of unstructured data as compared to structured data sources.Ultimately,the goal is to conduct analysis of the data in such a manner that high-quality results are delivered in a timely manner,which provides optimal value to the enterprise.

This section explores the five characteristics of Big Data that can be used to help differentiate data categorized as"Big"from other forms of data.The five Big Data characteristics shown in Figure 1-2 are commonly referred to as the"5V".

Figure 1-2 The"5V"of Big Data.

1. Volume

The anticipated volume of data that processed by Big Data solutions is substantial and ever-growing.High data volumes impose distinct data storage and processing demands,as well as additional data preparation,curation and management processes.Figure 1-3 provides a visual representation of the large volume of data created by organizations and users worldwide everyday.

Figure 1-3 The large volume of data created by organizations and users worldwide everyday.

Typical data sources that are responsible for generating high data volumes can include as follows.

(1)Online transactions,such as point-of-sale and banking.

(2)Scientific and research experiments,such as the Large Hadron Collider and Atacama Large Millimeter or Submillimeter Array telescope.

(3)Sensors,such as GPS sensors,RFIDs,smart meters and telematics.

(4)Social media,such as Facebook and Twitter.

2. Velocity

In Big Data environments,data can arrive at fast speed,and enormous data sets can accumulate within very short periods of time.From an enterprise's point of view,the velocity of data translates into the amount of time it takes for the data to be processed once it enters the enterprise's perimeter.Coping with the fast inflow of data requires the enterprise to design highly elastic and available data processing solutions and corresponding data storage capabilities.

Depending on the data source,the velocity may be different.For example,MRI scan images are not generated as frequently as log entries from a high-traffic Web Server.As illustrated in Figure 1-4,data velocity is put into perspective when considering that the following data volume can easily be generated in a given minute:350000 tweets,300 hours of video footage uploaded to YouTube,171 million emails and 330 GB of sensor data from a jet engine.

Figure 1-4 Examples of high-velocity Big Data data sets produced every minute include tweets,video,emails and GB generated from a jet engine.

3. Variety

Data variety refers to the multiple formats and types of data that need to be supported by Big Data solutions.Data variety brings challenges for enterprises in terms of data integration,transformation,processing,and storage.Figure 1-5 provides a visual representation of data variety,which includes structured data in the form of financial transactions,semi-structured data in the form of emails and unstructured data in the form of images.

Figure 1-5 A visual representation of data variety.

4. Veracity

Veracity refers to the quality or fidelity of data.Data that enters Big Data environments need to be assessed for quality,which can lead to data processing activities to resolve invalid data and remove noise.In relation to veracity,data can be part of the signal or noise of a data set.Noise is data that cannot be converted into information and thus has no value,whereas signals have value and lead to meaningful information.Data with a high signal-to-noise ratio has more veracity than data with a lower ratio.Data that is acquired in a controlled manner,for example via online customer registrations,usually contains less noise than data acquired via uncontrolled sources,such as blog postings.Thus the signal-to-noise ratio of data is dependent upon the source of the data and its type.

5. Value

Value is defined as the usefulness of data for an enterprise.The value characteristic is intuitively related to the veracity characteristic in that the higher the data fidelity,the more value it holds for the business.Value is also dependent on how long data processing takes,because analytics results have a shelf-life.For example,a 20 minutes delayed stock quote has little to no value for making a trade compared to a quote that is 20 milliseconds old.Data that has high veracity and can be analysed quickly has more value to a business,as shown in Figure 1-6.As demonstrated,value and time are inversely related.The longer it takes for data to be turned into meaningful information,the less value it has for a business.Stale results inhibit the quality and speed of decision making.

Apart from veracity and time,value is also impacted by the following lifecycle-related concerns.

(1)How well has the data been stored?

(2)Were valuable attributes of the data removed during data cleansing?

(3)Are the right types of questions being asked during data analysis?

(4)Are the results of the analysis being accurately communicated to the appropriate decision-makers?

Figure 1-6 Data that has high veracity and can be analysed quickly has more value to a business.

1.3.3 Why is Big Data Important?

The convergence across business domains has ushered in a new economic system that is redefining relationships among producers,distributors,and consumers or goods and services.In an increasingly complex world,business verticals are intertwined and what happens in one vertical has a direct impact on other verticals.Within a business,this complexity makes it difficult for business leaders to rely solely on experience (or pure intuition) to make decisions.They need to rely on good data services for their decisions.By placing data at the heart of the business operations to provide access to new insights,organizations will then be able to compete more effectively.

Three things have come together to drive attention to Big Data.

(1) The technologies to combine and interrogate Big Data have matured to a point where their deployments are practical.

(2) The underlying cost of the infrastructure to power the analysis has fallen dramatically,making it economical to mine the information.

(3) The competitive pressure on businesses has increased to the point where most traditional strategies are offering only marginal benefits.Big Data has the potential to provide new forms of competitive advantage for businesses.

For years,organizations have captured structured transactional data and used batch processing to place summaries of the data into traditional relational database.The analysis of such data is retrospective and the investigations done on the data sets are on past patterns of business operations.In recent years,new technologies with lower costs have enabled improvements in data capture,data storage,and data analysis.Organizations can now capture more data from many more sources and types (blogs,social media,audio and video files).The options to optimally store and process the data have expanded dramatically and technologies such as MapReduce and in-memory computing (discussed in later sections) provide highly optimized capabilities for different business purposes.The analysis of data can be done in real-time,acting on full data sets rather than summarised elements.In addition,the number of options to interpret and analyse the data has also increased,with the use of various visualization technologies.