The Apache Hadoop software library has come into it’s own. It is the basis for advanced distributed development for a host of companies, government institutions, and scientific research facilities. The Hadoop ecosystem now contains dozens of components for everything from search, databases, and data warehousing to image processing, deep learning, and natural language processing. With the advent of Hadoop 2, different resource managers may be used to provide an even greater level of sophistication and control than previously possible. Competitors, replacements, as well as successors and mutations of the Hadoop technologies and architectures abound. These include Apache Flink, Apache Spark, and many others. The “death of Hadoop” has been announced many times by software experts and commentators. We have to face the question squarely: is Hadoop dead? It depends on the perceived boundaries of Hadoop itself. Do we consider Apache Spark, the in-memory successor to Hadoop’s batch file approach, a part of the Hadoop family simply because it also uses HDFS, the Hadoop file system? Many other examples of “gray areas” exist in which newer technologies replace or enhance the original “Hadoop classic” features. Distributed computing is a moving target and the boundaries of Hadoop and its ecosystem have changed remarkably over a few short years. In this book, we attempt to show some of the diverse and dynamic aspects of Hadoop and its associated ecosystem, and to try to convince you that, although changing, Hadoop is still very much alive, relevant to current software development, and particularly interesting to data analytics programmers.
Download Ebook