Hadoop: The Definitive Guide, 3rd Edition

Read it now on the O’Reilly learning platform with a 10-day free trial.

O’Reilly members get unlimited access to books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Book description

Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).

Store large datasets with the Hadoop Distributed File System (HDFS)
Run distributed computations with MapReduce
Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
Discover common pitfalls and advanced features for writing real-world MapReduce programs
Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
Load data from relational databases into HDFS, using Sqoop
Perform large-scale data processing with the Pig query language
Analyze datasets with Hive, Hadoop’s data warehousing system
Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

Show and hide more

Publisher resources

Table of contents Product information

Hadoop: The Definitive Guide
Dedication
Foreword
Preface
1. Administrative Notes
2. What’s in This Book?
3. What’s New in the Second Edition?
4. What’s New in the Third Edition?
5. Conventions Used in This Book
6. Using Code Examples
7. Safari® Books Online
8. How to Contact Us
9. Acknowledgments
1. Data!
2. Data Storage and Analysis
3. Comparison with Other Systems
  1. Rational Database Management System
  2. Grid Computing
  3. Volunteer Computing
  1. What’s Covered in This Book
    1. Configuration names
    2. MapReduce APIs
    1. A Weather Dataset
      1. Data Format
      1. Map and Reduce
      2. Java MapReduce
        
        A test run
        
        The old and the new Java MapReduce APIs
        
        Data Flow
        
        Combiner Functions
        
        Specifying a combiner function
        
        Ruby
        
        Python
        
        Compiling and Running
        
        The Design of HDFS
        
        HDFS Concepts
        
        Blocks
        
        Namenodes and Datanodes
        
        HDFS Federation
        
        HDFS High-Availability
        
        Failover and fencing
        
        Basic Filesystem Operations
        
        Interfaces
        
        HTTP
        
        C
        
        FUSE
        
        Reading Data from a Hadoop URL
        
        Reading Data Using the FileSystem API
        
        FSDataInputStream
        
        FSDataOutputStream
        
        File metadata: FileStatus
        
        Listing files
        
        File patterns
        
        PathFilter
        
        Anatomy of a File Read
        
        Anatomy of a File Write
        
        Coherency Model
        
        Consequences for application design
        
        Keeping an HDFS Cluster Balanced
        
        Using Hadoop Archives
        
        Limitations
        
        Data Integrity
        
        Data Integrity in HDFS
        
        LocalFileSystem
        
        ChecksumFileSystem
        
        Codecs
        
        Compressing and decompressing streams with CompressionCodec
        
        Inferring CompressionCodecs using CompressionCodecFactory
        
        Native libraries
        
        CodecPool
        
        Compressing map output
        
        The Writable Interface
        
        WritableComparable and comparators
        
        Writable wrappers for Java primitives
        
        Text
        
        Indexing
        
        Unicode
        
        Iteration
        
        Mutability
        
        Resorting to String
        
        Implementing a RawComparator for speed
        
        Custom comparators
        
        Serialization IDL
        
        Avro Data Types and Schemas
        
        In-Memory Serialization and Deserialization
        
        The specific API
        
        Python API
        
        C API
        
        SequenceFile
        
        Writing a SequenceFile
        
        Reading a SequenceFile
        
        Displaying a SequenceFile with the command-line interface
        
        Sorting and merging SequenceFiles
        
        The SequenceFile format
        
        Writing a MapFile
        
        Reading a MapFile
        
        MapFile variants
        
        Converting a SequenceFile to a MapFile
        
        The Configuration API
        
        Combining Resources
        
        Variable Expansion
        
        Managing Configuration
        
        GenericOptionsParser, Tool, and ToolRunner
        
        Mapper
        
        Reducer
        
        Running a Job in a Local Job Runner
        
        Fixing the mapper
        
        Packaging a Job
        
        The client classpath
        
        The task classpath
        
        Packaging dependencies
        
        Task classpath precedence
        
        The jobtracker page
        
        The job page
        
        The tasks page
        
        The task details page
        
        Handling malformed data
        
        Profiling Tasks
        
        The HPROF profiler
        
        Other profilers
        
        Decomposing a Problem into MapReduce Jobs
        
        JobControl
        
        Apache Oozie
        
        Defining an Oozie workflow
        
        Packaging and deploying an Oozie workflow application
        
        Running an Oozie workflow job
        
        Anatomy of a MapReduce Job Run
        
        Classic MapReduce (MapReduce 1)
        
        Job submission
        
        Job initialization
        
        Task assignment
        
        Task execution
        
        Streaming and pipes
        
        Job submission
        
        Job initialization
        
        Task assignment
        
        Task execution
        
        Progress and status updates
        
        Job completion
        
        Failures in Classic MapReduce
        
        Task failure
        
        Tasktracker failure
        
        Jobtracker failure
        
        Task failure
        
        Application master failure
        
        Node manager failure
        
        Resource manager failure
        
        The Fair Scheduler
        
        The Capacity Scheduler
        
        The Map Side
        
        The Reduce Side
        
        Configuration Tuning
        
        The Task Execution Environment
        
        Streaming environment variables
        
        Task side-effect files
        
        MapReduce Types
        
        The Default MapReduce Job
        
        The default Streaming job
        
        Keys and values in Streaming
        
        Input Splits and Records
        
        FileInputFormat
        
        FileInputFormat input paths
        
        FileInputFormat input splits
        
        Small files and CombineFileInputFormat
        
        Preventing splitting
        
        File information in the mapper
        
        Processing a whole file as a record
        
        TextInputFormat
        
        KeyValueTextInputFormat
        
        NLineInputFormat
        
        XML
        
        SequenceFileInputFormat
        
        SequenceFileAsTextInputFormat
        
        SequenceFileAsBinaryInputFormat
        
        Text Output
        
        Binary Output
        
        SequenceFileOutputFormat
        
        SequenceFileAsBinaryOutputFormat
        
        MapFileOutputFormat
        
        An example: Partitioning data
        
        MultipleOutputs
        
        Counters
        
        Built-in Counters
        
        Task counters
        
        Job counters
        
        Dynamic counters
        
        Readable counter names
        
        Retrieving counters
        
        Using the new MapReduce API
        
        Preparation
        
        Partial Sort
        
        An application: Partitioned MapFile lookups
        
        Java code
        
        Streaming
        
        Map-Side Joins
        
        Reduce-Side Joins
        
        Using the Job Configuration
        
        Distributed Cache
        
        Usage
        
        How it works
        
        The distributed cache API
        
        Cluster Specification
        
        Network Topology
        
        Rack awareness
        
        Installing Java
        
        Creating a Hadoop User
        
        Installing Hadoop
        
        Testing the Installation
        
        Configuration Management
        
        Control scripts
        
        Master node scenarios
        
        Memory
        
        Java
        
        System logfiles
        
        SSH settings
        
        HDFS
        
        MapReduce
        
        Cluster membership
        
        Buffer size
        
        HDFS block size
        
        Reserved storage space
        
        Trash
        
        Job scheduler
        
        Reduce slow start
        
        Task memory limits
        
        Important YARN Daemon Properties
        
        Memory
        
        Kerberos and Hadoop
        
        An example
        
        Hadoop Benchmarks
        
        Benchmarking HDFS with TestDFSIO
        
        Benchmarking MapReduce with Sort
        
        Other benchmarks
        
        Apache Whirr
        
        Setup
        
        Launching a cluster
        
        Configuration
        
        Running a proxy
        
        Running a MapReduce job
        
        Shutting down a cluster
        
        HDFS
        
        Persistent Data Structures
        
        Namenode directory structure
        
        The filesystem image and edit log
        
        Secondary namenode directory structure
        
        Datanode directory structure
        
        Entering and leaving safe mode
        
        dfsadmin
        
        Filesystem check (fsck)
        
        Finding the blocks for a file
        
        Logging
        
        Setting log levels
        
        Getting stack traces
        
        FileContext
        
        GangliaContext
        
        NullContextWithUpdateThread
        
        CompositeContext
        
        Routine Administration Procedures
        
        Metadata backups
        
        Data backups
        
        Filesystem check (fsck)
        
        Filesystem balancer
        
        Commissioning new nodes
        
        Decommissioning old nodes
        
        HDFS data and metadata upgrades
        
        Start the upgrade
        
        Wait until the upgrade is complete
        
        Check the upgrade
        
        Roll back the upgrade (optional)
        
        Finalize the upgrade (optional)
        
        Installing and Running Pig
        
        Execution Types
        
        Local mode
        
        MapReduce mode
        
        Generating Examples
        
        Structure
        
        Statements
        
        Expressions
        
        Types
        
        Schemas
        
        Validation and nulls
        
        Schema merging
        
        A Filter UDF
        
        Leveraging types
        
        Dynamic invokers
        
        Using a schema
        
        Loading and Storing Data
        
        Filtering Data
        
        FOREACH. GENERATE
        
        STREAM
        
        JOIN
        
        COGROUP
        
        CROSS
        
        GROUP
        
        Parallelism
        
        Parameter Substitution
        
        Dynamic parameters
        
        Parameter substitution processing
        
        Installing Hive
        
        The Hive Shell
        
        Configuring Hive
        
        Logging
        
        Hive clients
        
        Schema on Read Versus Schema on Write
        
        Updates, Transactions, and Indexes
        
        Data Types
        
        Primitive types
        
        Complex types
        
        Conversions
        
        Managed Tables and External Tables
        
        Partitions and Buckets
        
        Partitions
        
        Buckets
        
        The default storage format: Delimited text
        
        Binary storage formats: Sequence files, Avro datafiles and RCFiles
        
        An example: RegexSerDe
        
        Inserts
        
        Multitable insert
        
        CREATE TABLE. AS SELECT
        
        Sorting and Aggregating
        
        MapReduce Scripts
        
        Joins
        
        Inner joins
        
        Outer joins
        
        Semi joins
        
        Map joins
        
        Writing a UDF
        
        Writing a UDAF
        
        A more complex UDAF
        
        HBasics
        
        Backdrop
        
        Whirlwind Tour of the Data Model
        
        Regions
        
        Locking
        
        HBase in operation
        
        Test Drive
        
        Java
        
        MapReduce
        
        REST
        
        Thrift
        
        Avro
        
        Schemas
        
        Loading Data
        
        Optimization notes
        
        Successful Service
        
        HBase
        
        Use Case: HBase at Streamy.com
        
        Very large items tables
        
        Very large sort merges
        
        Life with HBase
        
        Versions
        
        HDFS
        
        UI
        
        Metrics
        
        Schema Design
        
        Joins
        
        Row keys
        
        Installing and Running ZooKeeper
        
        An Example
        
        Group Membership in ZooKeeper
        
        Creating the Group
        
        Joining a Group
        
        Listing Members in a Group
        
        ZooKeeper command-line tools
        
        Data Model
        
        Ephemeral znodes
        
        Sequence numbers
        
        Watches
        
        Multiupdate
        
        APIs
        
        Watch triggers
        
        ACLs
        
        Time
        
        A Configuration Service
        
        The Resilient ZooKeeper Application
        
        InterruptedException
        
        KeeperException
        
        State exceptions
        
        Recoverable exceptions
        
        Unrecoverable exceptions
        
        The herd effect
        
        Recoverable exceptions
        
        Unrecoverable exceptions
        
        Implementation
        
        BookKeeper and Hedwig
        
        Resilience and Performance
        
        Configuration
        
        Getting Sqoop
        
        Sqoop Connectors
        
        A Sample Import
        
        Text and Binary File Formats
        
        Additional Serialization Systems
        
        Controlling the Import
        
        Imports and Consistency
        
        Direct-mode Imports
        
        Imported Data and Hive
        
        Exports and Transactionality
        
        Exports and SequenceFiles
        
        Hadoop Usage at Last.fm
        
        Last.fm: The Social Music Revolution
        
        Hadoop at Last.fm
        
        Generating Charts with Hadoop
        
        The Track Statistics Program
        
        Calculating the number of unique listeners
        
        UniqueListenersMapper
        
        UniqueListenersReducer
        
        SumMapper
        
        SumReducer
        
        MergeListenersMapper
        
        IdentityMapper
        
        SumReducer
        
        Hadoop at Facebook
        
        History
        
        Use cases
        
        Data architecture
        
        Hadoop configuration
        
        Advertiser insights and performance
        
        Ad hoc analysis and product feedback
        
        Data analysis
        
        Data organization
        
        Query language
        
        Data pipelines using Hive
        
        Fair sharing
        
        Space management
        
        Scribe-HDFS integration
        
        Improvements to Hive
        
        Data Structures
        
        CrawlDb
        
        LinkDb
        
        Segments
        
        Link inversion
        
        Generation of fetchlists
        
        Step 1: Select, sort by score, limit by URL count per host
        
        Step 2: Invert, partition by host, sort randomly
        
        Requirements/The Problem
        
        Logs
        
        Log collection
        
        Log storage
        
        Processing
        
        Phase 1: Map
        
        Phase 1: Reduce
        
        Phase 2: Map
        
        Phase 2: Reduce
        
        Sharding
        
        Search results
        
        Fields, Tuples, and Pipes
        
        Operations
        
        Taps, Schemes, and Flows
        
        Cascading in Practice
        
        Flexibility
        
        Hadoop and Cascading at ShareThis
        
        Summary
        
        Measuring Community
        
        Everybody’s Talkin’ at Me: The Twitter Reply Graph
        
        Edge pairs versus adjacency list
        
        Degree
        
        Get neighbors
        
        Community metrics and the 1 million × 1 million problem
        
        Local properties at global scale
        
        Prerequisites
        
        Installation
        
        Configuration
        
        Standalone Mode
        
        Pseudodistributed Mode
        
        Configuring SSH
        
        Formatting the HDFS filesystem
        
        Starting and stopping the daemons (MapReduce 1)
        
        Starting and stopping the daemons (MapReduce 2)
        
        Show and hide more
        Product information
        
        Title: Hadoop: The Definitive Guide, 3rd Edition
        
        Author(s): Tom White
        
        Release date: May 2012
        
        Publisher(s): O'Reilly Media, Inc.
        
        ISBN: 9781449311520