Hadoop: The Definitive Guide, 3rd Edition
Read it now on the O’Reilly learning platform with a 10-day free trial.
O’Reilly members get unlimited access to books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.
Book description
Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).
- Store large datasets with the Hadoop Distributed File System (HDFS)
- Run distributed computations with MapReduce
- Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
- Discover common pitfalls and advanced features for writing real-world MapReduce programs
- Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
- Load data from relational databases into HDFS, using Sqoop
- Perform large-scale data processing with the Pig query language
- Analyze datasets with Hive, Hadoop’s data warehousing system
- Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems
Show and hide more
Publisher resources
Table of contents Product information
Table of contents
- Hadoop: The Definitive Guide
- Dedication
- Foreword
- Preface
- Administrative Notes
- What’s in This Book?
- What’s New in the Second Edition?
- What’s New in the Third Edition?
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments
- Data!
- Data Storage and Analysis
- Comparison with Other Systems
- Rational Database Management System
- Grid Computing
- Volunteer Computing
- What’s Covered in This Book
- Configuration names
- MapReduce APIs
- A Weather Dataset
- Data Format
- Map and Reduce
- Java MapReduce
- A test run
- The old and the new Java MapReduce APIs
- Data Flow
- Combiner Functions
- Specifying a combiner function
- Ruby
- Python
- Compiling and Running
- The Design of HDFS
- HDFS Concepts
- Blocks
- Namenodes and Datanodes
- HDFS Federation
- HDFS High-Availability
- Failover and fencing
- Basic Filesystem Operations
- Interfaces
- HTTP
- C
- FUSE
- Reading Data from a Hadoop URL
- Reading Data Using the FileSystem API
- FSDataInputStream
- FSDataOutputStream
- File metadata: FileStatus
- Listing files
- File patterns
- PathFilter
- Anatomy of a File Read
- Anatomy of a File Write
- Coherency Model
- Consequences for application design
- Keeping an HDFS Cluster Balanced
- Using Hadoop Archives
- Limitations
- Data Integrity
- Data Integrity in HDFS
- LocalFileSystem
- ChecksumFileSystem
- Codecs
- Compressing and decompressing streams with CompressionCodec
- Inferring CompressionCodecs using CompressionCodecFactory
- Native libraries
- CodecPool
- Compressing map output
- The Writable Interface
- WritableComparable and comparators
- Writable wrappers for Java primitives
- Text
- Indexing
- Unicode
- Iteration
- Mutability
- Resorting to String
- Implementing a RawComparator for speed
- Custom comparators
- Serialization IDL
- Avro Data Types and Schemas
- In-Memory Serialization and Deserialization
- The specific API
- Python API
- C API
- SequenceFile
- Writing a SequenceFile
- Reading a SequenceFile
- Displaying a SequenceFile with the command-line interface
- Sorting and merging SequenceFiles
- The SequenceFile format
- Writing a MapFile
- Reading a MapFile
- MapFile variants
- Converting a SequenceFile to a MapFile
- The Configuration API
- Combining Resources
- Variable Expansion
- Managing Configuration
- GenericOptionsParser, Tool, and ToolRunner
- Mapper
- Reducer
- Running a Job in a Local Job Runner
- Fixing the mapper
- Packaging a Job
- The client classpath
- The task classpath
- Packaging dependencies
- Task classpath precedence
- The jobtracker page
- The job page
- The tasks page
- The task details page
- Handling malformed data
- Profiling Tasks
- The HPROF profiler
- Other profilers
- Decomposing a Problem into MapReduce Jobs
- JobControl
- Apache Oozie
- Defining an Oozie workflow
- Packaging and deploying an Oozie workflow application
- Running an Oozie workflow job
- Anatomy of a MapReduce Job Run
- Classic MapReduce (MapReduce 1)
- Job submission
- Job initialization
- Task assignment
- Task execution
- Streaming and pipes
- Job submission
- Job initialization
- Task assignment
- Task execution
- Progress and status updates
- Job completion
- Failures in Classic MapReduce
- Task failure
- Tasktracker failure
- Jobtracker failure
- Task failure
- Application master failure
- Node manager failure
- Resource manager failure
- The Fair Scheduler
- The Capacity Scheduler
- The Map Side
- The Reduce Side
- Configuration Tuning
- The Task Execution Environment
- Streaming environment variables
- Task side-effect files
- MapReduce Types
- The Default MapReduce Job
- The default Streaming job
- Keys and values in Streaming
- Input Splits and Records
- FileInputFormat
- FileInputFormat input paths
- FileInputFormat input splits
- Small files and CombineFileInputFormat
- Preventing splitting
- File information in the mapper
- Processing a whole file as a record
- TextInputFormat
- KeyValueTextInputFormat
- NLineInputFormat
- XML
- SequenceFileInputFormat
- SequenceFileAsTextInputFormat
- SequenceFileAsBinaryInputFormat
- Text Output
- Binary Output
- SequenceFileOutputFormat
- SequenceFileAsBinaryOutputFormat
- MapFileOutputFormat
- An example: Partitioning data
- MultipleOutputs
- Counters
- Built-in Counters
- Task counters
- Job counters
- Dynamic counters
- Readable counter names
- Retrieving counters
- Using the new MapReduce API
- Preparation
- Partial Sort
- An application: Partitioned MapFile lookups
- Java code
- Streaming
- Map-Side Joins
- Reduce-Side Joins
- Using the Job Configuration
- Distributed Cache
- Usage
- How it works
- The distributed cache API
- Cluster Specification
- Network Topology
- Rack awareness
- Installing Java
- Creating a Hadoop User
- Installing Hadoop
- Testing the Installation
- Configuration Management
- Control scripts
- Master node scenarios
- Memory
- Java
- System logfiles
- SSH settings
- HDFS
- MapReduce
- Cluster membership
- Buffer size
- HDFS block size
- Reserved storage space
- Trash
- Job scheduler
- Reduce slow start
- Task memory limits
- Important YARN Daemon Properties
- Memory
- Kerberos and Hadoop
- An example
- Hadoop Benchmarks
- Benchmarking HDFS with TestDFSIO
- Benchmarking MapReduce with Sort
- Other benchmarks
- Apache Whirr
- Setup
- Launching a cluster
- Configuration
- Running a proxy
- Running a MapReduce job
- Shutting down a cluster
- HDFS
- Persistent Data Structures
- Namenode directory structure
- The filesystem image and edit log
- Secondary namenode directory structure
- Datanode directory structure
- Entering and leaving safe mode
- dfsadmin
- Filesystem check (fsck)
- Finding the blocks for a file
- Logging
- Setting log levels
- Getting stack traces
- FileContext
- GangliaContext
- NullContextWithUpdateThread
- CompositeContext
- Routine Administration Procedures
- Metadata backups
- Data backups
- Filesystem check (fsck)
- Filesystem balancer
- Commissioning new nodes
- Decommissioning old nodes
- HDFS data and metadata upgrades
- Start the upgrade
- Wait until the upgrade is complete
- Check the upgrade
- Roll back the upgrade (optional)
- Finalize the upgrade (optional)
- Installing and Running Pig
- Execution Types
- Local mode
- MapReduce mode
- Generating Examples
- Structure
- Statements
- Expressions
- Types
- Schemas
- Validation and nulls
- Schema merging
- A Filter UDF
- Leveraging types
- Dynamic invokers
- Using a schema
- Loading and Storing Data
- Filtering Data
- FOREACH. GENERATE
- STREAM
- JOIN
- COGROUP
- CROSS
- GROUP
- Parallelism
- Parameter Substitution
- Dynamic parameters
- Parameter substitution processing
- Installing Hive
- The Hive Shell
- Configuring Hive
- Logging
- Hive clients
- Schema on Read Versus Schema on Write
- Updates, Transactions, and Indexes
- Data Types
- Primitive types
- Complex types
- Conversions
- Managed Tables and External Tables
- Partitions and Buckets
- Partitions
- Buckets
- The default storage format: Delimited text
- Binary storage formats: Sequence files, Avro datafiles and RCFiles
- An example: RegexSerDe
- Inserts
- Multitable insert
- CREATE TABLE. AS SELECT
- Sorting and Aggregating
- MapReduce Scripts
- Joins
- Inner joins
- Outer joins
- Semi joins
- Map joins
- Writing a UDF
- Writing a UDAF
- A more complex UDAF
- HBasics
- Backdrop
- Whirlwind Tour of the Data Model
- Regions
- Locking
- HBase in operation
- Test Drive
- Java
- MapReduce
- REST
- Thrift
- Avro
- Schemas
- Loading Data
- Optimization notes
- Successful Service
- HBase
- Use Case: HBase at Streamy.com
- Very large items tables
- Very large sort merges
- Life with HBase
- Versions
- HDFS
- UI
- Metrics
- Schema Design
- Joins
- Row keys
- Installing and Running ZooKeeper
- An Example
- Group Membership in ZooKeeper
- Creating the Group
- Joining a Group
- Listing Members in a Group
- ZooKeeper command-line tools
- Data Model
- Ephemeral znodes
- Sequence numbers
- Watches
- Multiupdate
- APIs
- Watch triggers
- ACLs
- Time
- A Configuration Service
- The Resilient ZooKeeper Application
- InterruptedException
- KeeperException
- State exceptions
- Recoverable exceptions
- Unrecoverable exceptions
- The herd effect
- Recoverable exceptions
- Unrecoverable exceptions
- Implementation
- BookKeeper and Hedwig
- Resilience and Performance
- Configuration
- Getting Sqoop
- Sqoop Connectors
- A Sample Import
- Text and Binary File Formats
- Additional Serialization Systems
- Controlling the Import
- Imports and Consistency
- Direct-mode Imports
- Imported Data and Hive
- Exports and Transactionality
- Exports and SequenceFiles
- Hadoop Usage at Last.fm
- Last.fm: The Social Music Revolution
- Hadoop at Last.fm
- Generating Charts with Hadoop
- The Track Statistics Program
- Calculating the number of unique listeners
- UniqueListenersMapper
- UniqueListenersReducer
- SumMapper
- SumReducer
- MergeListenersMapper
- IdentityMapper
- SumReducer
- Hadoop at Facebook
- History
- Use cases
- Data architecture
- Hadoop configuration
- Advertiser insights and performance
- Ad hoc analysis and product feedback
- Data analysis
- Data organization
- Query language
- Data pipelines using Hive
- Fair sharing
- Space management
- Scribe-HDFS integration
- Improvements to Hive
- Data Structures
- CrawlDb
- LinkDb
- Segments
- Link inversion
- Generation of fetchlists
- Step 1: Select, sort by score, limit by URL count per host
- Step 2: Invert, partition by host, sort randomly
- Requirements/The Problem
- Logs
- Log collection
- Log storage
- Processing
- Phase 1: Map
- Phase 1: Reduce
- Phase 2: Map
- Phase 2: Reduce
- Sharding
- Search results
- Fields, Tuples, and Pipes
- Operations
- Taps, Schemes, and Flows
- Cascading in Practice
- Flexibility
- Hadoop and Cascading at ShareThis
- Summary
- Measuring Community
- Everybody’s Talkin’ at Me: The Twitter Reply Graph
- Edge pairs versus adjacency list
- Degree
- Get neighbors
- Community metrics and the 1 million × 1 million problem
- Local properties at global scale
- Prerequisites
- Installation
- Configuration
- Standalone Mode
- Pseudodistributed Mode
- Configuring SSH
- Formatting the HDFS filesystem
- Starting and stopping the daemons (MapReduce 1)
- Starting and stopping the daemons (MapReduce 2)
Show and hide more
Product information
- Title: Hadoop: The Definitive Guide, 3rd Edition
- Author(s): Tom White
- Release date: May 2012
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781449311520