jspωiki
Apache Hadoop

Overview#

Apache Hadoop is an open Source software framework used for distributed storage and processing of big data sets using the MapReduce programming model.

Apache Hadoop is a generally considered to be a Data-lake

Apache Hadoop consists of computer clusters built from commodity hardware. All the modules in Apache Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The core of Apache Hadoop consists to major pieces:

Hadoop splits files into large blocks and distributes them across nodes in a cluster.

Apache Hadoop then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The genesis of Apache Hadoop came from the Google File System paper that was published in October 2003

More Information#

There might be more information for this subject on one of the following: