coverpage
Title Page
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Introduction to Data
Exploring data
What is Enterprise Data?
Enterprise Data Management
Big data concepts
Big data and 4Vs
Relevance of data
Quality of data
Where does this data live in an enterprise?
Intranet (within enterprise)
Internet (external to enterprise)
Business applications hosted in cloud
Third-party cloud solutions
Social data (structured and unstructured)
Data stores or persistent stores (RDBMS or NoSQL)
Traditional data warehouse
File stores
Enterprise's current state
Enterprise digital transformation
Enterprises embarking on this journey
Some examples
Data lake use case enlightenment
Summary
Comprehensive Concepts of a Data Lake
What is a Data Lake?
Relevance to enterprises
How does a Data Lake help enterprises?
Data Lake benefits
How Data Lake works?
Differences between Data Lake and Data Warehouse
Approaches to building a Data Lake
Lambda Architecture-driven Data Lake
Data ingestion layer - ingest for processing and storage
Batch layer - batch processing of ingested data
Speed layer - near real time data processing
Data storage layer - store all data
Serving layer - data delivery and exports
Data acquisition layer - get data from source systems
Messaging Layer - guaranteed data delivery
Exploring the Data Ingestion Layer
Exploring the Lambda layer
Batch layer
Speed layer
Serving layer
Data push
Data pull
Data storage layer
Batch process layer
Speed layer
Serving layer
Relational data stores
Distributed data stores
Summary
Lambda Architecture as a Pattern for Data Lake
What is Lambda Architecture?
History of Lambda Architecture
Principles of Lambda Architecture
Fault-tolerant principle
Immutable Data principle
Re-computation principle
Components of a Lambda Architecture
Batch layer
Speed layer
CAP Theorem
Eventual consistency
Serving layer
Complete working of a Lambda Architecture
Advantages of Lambda Architecture
Disadvantages of Lambda Architectures
Technology overview for Lambda Architecture
Applied lambda
Enterprise-level log analysis
Capturing and analyzing sensor data
Real-time mailing platform statistics
Real-time sports analysis
Recommendation engines
Analyzing security threats
Multi-channel consumer behaviour
Working examples of Lambda Architecture
Kappa architecture
Summary
Applied Lambda for Data Lake
Knowing Hadoop distributions
Selection factors for a big data stack for enterprises
Technical capabilities
Ease of deployment and maintenance
Integration readiness
Batch layer for data processing
The NameNode server
The secondary NameNode Server
Yet Another Resource Negotiator (YARN)
Data storage nodes (DataNode)
Speed layer
Flume for data acquisition
Source for event sourcing
Interceptors for event interception
Channels for event flow
Sink as an event destination
Spark Streaming
DStreams
Data Frames
Checkpointing
Apache Flink
Serving layer
Data repository layer
Relational databases
Big data tables/views
Data services with data indexes
NoSQL databases
Data access layer
Data exports
Data publishing
Summary
Data Acquisition of Batch Data using Apache Sqoop
Context in data lake - data acquisition
Data acquisition layer
Data acquisition of batch data - technology mapping
Why Apache Sqoop
History of Sqoop
Advantages of Sqoop
Disadvantages of Sqoop
Workings of Sqoop
Sqoop 2 architecture
Sqoop 1 versus Sqoop 2
Ease of use
Ease of extension
Security
When to use Sqoop 1 and Sqoop 2
Functioning of Sqoop
Data import using Sqoop
Data export using Sqoop
Sqoop connectors
Types of Sqoop connectors
Sqoop support for HDFS
Sqoop working example
Installation and Configuration
Step 1 - Installing and verifying Java
Step 2 - Installing and verifying Hadoop
Step 3 - Installing and verifying Hue
Step 4 - Installing and verifying Sqoop
Step 5 - Installing and verifying PostgreSQL (RDBMS)
Step 6 - Installing and verifying HBase (NoSQL)
Configure data source (ingestion)
Sqoop configuration (database drivers)
Configuring HDFS as destination
Sqoop Import
Import complete database
Import selected tables
Import selected columns from a table
Import into HBase
Sqoop Export
Sqoop Job
Job command
Create job
List Job
Run Job
Create Job
Sqoop 2
Sqoop in purview of SCV use case
When to use Sqoop
When not to use Sqoop
Real-time Sqooping: a possibility?
Other options
Native big data connectors
Talend
Pentaho's Kettle (PDI - Pentaho Data Integration)
Summary
Data Acquisition of Stream Data using Apache Flume
Context in Data Lake: data acquisition
What is Stream Data?
Batch and stream data
Data acquisition of stream data - technology mapping
What is Flume?
Sqoop and Flume
Why Flume?
History of Flume
Advantages of Flume
Disadvantages of Flume
Flume architecture principles
The Flume Architecture
Distributed pipeline - Flume architecture
Fan Out - Flume architecture
Fan In - Flume architecture
Three tier design - Flume architecture
Advanced Flume architecture
Flume reliability level
Flume event - Stream Data
Flume agent
Flume agent configurations
Flume source
Custom Source
Flume Channel
Custom channel
Flume sink
Custom sink
Flume configuration
Flume transaction management
Other flume components
Channel processor
Interceptor
Channel Selector
Sink Groups
Sink Processor
Event Serializers
Context Routing
Flume working example
Installation and Configuration
Step 1: Installing and verifying Flume
Step 2: Configuring Flume
Step 3: Start Flume
Flume in purview of SCV use case
Kafka Installation
Example 1 - RDBMS to Kafka
Example 2: Spool messages to Kafka
Example 3: Interceptors
Example 4 - Memory channel file channel and Kafka channel
When to use Flume
When not to use Flume
Other options
Apache Flink
Apache NiFi
Summary
Messaging Layer using Apache Kafka
Context in Data Lake - messaging layer
Messaging layer
Messaging layer - technology mapping
What is Apache Kafka?
Why Apache Kafka
History of Kafka
Advantages of Kafka
Disadvantages of Kafka
Kafka architecture
Core architecture principles of Kafka
Data stream life cycle
Working of Kafka
Kafka message
Kafka producer
Persistence of data in Kafka using topics
Partitions - Kafka topic division
Kafka message broker
Kafka consumer
Consumer groups
Other Kafka components
Zookeeper
MirrorMaker
Kafka programming interface
Kafka core API's
Kafka REST interface
Producer and consumer reliability
Kafka security
Kafka as message-oriented middleware
Scale-out architecture with Kafka
Kafka connect
Kafka working example
Installation
Producer - putting messages into Kafka
Kafka Connect
Consumer - getting messages from Kafka
Setting up multi-broker cluster
Kafka in the purview of an SCV use case
When to use Kafka
When not to use Kafka
Other options
RabbitMQ
ZeroMQ
Apache ActiveMQ
Summary
Data Processing using Apache Flink
Context in a Data Lake - Data Ingestion Layer
Data Ingestion Layer
Data Ingestion Layer - technology mapping
What is Apache Flink?
Why Apache Flink?
History of Flink
Advantages of Flink
Disadvantages of Flink
Working of Flink
Flink architecture
Client
Job Manager
Task Manager
Flink execution model
Core architecture principles of Flink
Flink Component Stack
Checkpointing in Flink
Savepoints in Flink
Streaming window options in Flink
Time window
Count window
Tumbling window configuration
Sliding window configuration
Memory management
Flink API's
DataStream API
Flink DataStream API example
Streaming connectors
DataSet API
Flink DataSet API example
Table API
Flink domain specific libraries
Gelly - Flink Graph API
FlinkML
FlinkCEP
Flink working example
Installation
Example - data processing with Flink
Data generation
Step 1 - Preparing streams
Step 2 - Consuming Streams via Flink
Step 3 - Streaming data into HDFS
Flink in purview of SCV use cases
User Log Data Generation
Flume Setup
Flink Processors
When to use Flink
When not to use Flink
Other options
Apache Spark
Apache Storm
Apache Tez
Summary
Data Store Using Apache Hadoop
Context for Data Lake - Data Storage and lambda Batch layer
Data Storage and the Lambda Batch Layer
Data Storage and Lambda Batch Layer - technology mapping
What is Apache Hadoop?
Why Hadoop?
History of Hadoop
Advantages of Hadoop
Disadvantages of Hadoop
Working of Hadoop
Hadoop core architecture principles
Hadoop architecture
Hadoop architecture 1.x
Hadoop architecture 2.x
Hadoop architecture components
HDFS
YARN
MapReduce
Hadoop ecosystem
Hadoop architecture in detail
Hadoop ecosystem
Data access/processing components
Apache Pig
Apache Hive
Data storage components
Apache HBase
Monitoring management and orchestration components
Apache ZooKeeper
Apache Oozie
Apache Ambari
Data integration components
Apache Sqoop
Apache Flume
Hadoop distributions
HDFS and formats
Hadoop for near real-time applications
Hadoop deployment modes
Hadoop working examples
Installation
Data preparation
Hive installation
Example - Bulk Data Load
File Data Load
RDBMS Data Load
Example - MapReduce processing
Text Data as Hive Tables
Avro Data as HIVE Table
Hadoop in purview of SCV use case
Initial directory setup
Data loads
Data visualization with HIVE tables
When not to use Hadoop
Other Hadoop Processing Options
Summary
Indexed Data Store using Elasticsearch
Context in Data Lake: data storage and lambda speed layer
Data Storage and Lambda Speed Layer
Data Storage and Lambda Speed Layer: technology mapping
What is Elasticsearch?
Why Elasticsearch
History of Elasticsearch
Advantages of Elasticsearch
Disadvantages of Elasticsearch
Working of Elasticsearch
Elasticsearch core architecture principles
Elasticsearch terminologies
Document in Elasticsearch
Index in Elasticsearch
What is Inverted Index?
Shard in Elasticsearch
Nodes in Elasticsearch
Cluster in Elasticsearch
Elastic Stack
Elastic Stack - Kibana
Elastic Stack - Elasticsearch
Elastic Stack - Logstash
Elastic Stack - Beats
Elastic Stack - X-Pack
Elastic Cloud
Apache Lucene
How Lucene works
Elasticsearch DSL (Query DSL)
Important queries in Query DSL
Nodes in Elasticsearch
Elasticsearch - master node
Elasticsearch - data node
Elasticsearch - client node
Elasticsearch and relational database
Elasticsearch ecosystem
Elasticsearch analyzers
Built-in analyzers
Custom analyzers
Elasticsearch plugins
Elasticsearch deployment options
Clients for Elasticsearch
Elasticsearch for fast streaming layer
Elasticsearch as a data source
Elasticsearch for content indexing
Elasticsearch and Hadoop
Elasticsearch working example
Installation
Creating and Deleting Indexes
Indexing Documents
Getting Indexed Document
Searching Documents
Updating Documents
Deleting a document
Elasticsearch in purview of SCV use case
Data preparation
Initial Cleanup
Data Generation
Customer data import into Hive using Sqoop
Data acquisition via Flume into Kafka channel
Data ingestion via Flink to HDFS and Elasticsearch
Packaging via POM file
Avro schema definitions
Schema deserialization class
Writing to HDFS as parquet files
Writing into Elasticsearch
Command line arguments
Flink deployment
Parquet data visualization as Hive tables
Data indexing from Hive
Query data from ES (customer address and contacts)
When to use Elasticsearch
When not to use Elasticsearch
Other options
Apache Solr
Summary
Data Lake Components Working Together
Where we stand with Data Lake
Core architecture principles of Data Lake
Challenges faced by enterprise Data Lake
Expectations from Data Lake
Data Lake for other activities
Knowing more about data storage
Zones in Data Storage
Data Schema and Model
Storage options
Apache HCatalog (Hive Metastore)
Compression methodologies
Data partitioning
Knowing more about Data processing
Data validation and cleansing
Machine learning
Scheduler/Workflow
Apache Oozie
Database setup and configuration
Build from Source
Oozie Workflows
Oozie coordinator
Complex event processing
Thoughts on data security
Apache Knox
Apache Ranger
Apache Sentry
Thoughts on data encryption
Hadoop key management server
Metadata management and governance
Metadata
Data governance
Data lineage
How can we achieve?
Apache Atlas
WhereHows
Thoughts on Data Auditing
Thoughts on data traceability
Knowing more about Serving Layer
Principles of Serving Layer
Service Types
GraphQL
Data Lake with REST API
Business services
Serving Layer components
Data Services
Elasticsearch & HBase
Apache Hive & Impala
RDBMS
Data exports
Polyglot data access
Example: serving layer
Summary
Data Lake Use Case Suggestions
Establishing cybersecurity practices in an enterprise
Know the customers dealing with your enterprise
Bring efficiency in warehouse management
Developing a brand and marketing of the enterprise
Achieve a higher degree of personalization with customers
Bringing IoT data analysis at your fingertips
More practical and useful data archival
Compliment the existing data warehouse infrastructure
Achieving telecom security and regulatory compliance
Summary
更新时间:2021-07-02 22:48:29