Hello all,
This post mainly talks about the architecture of HDFS. The Hadoop Distributed File System (HDFS) supports high-throughput data access, applies to processing of large data sets, and provides POSIX-like APIs.
The HDFS consists of active and standby NameNodes and multiple DataNodes, as shown in Figure 1.
The HDFS works in master/slave mode. NameNodes run on the master node, and DataNodes run on the slave nodes. ZKFC should run along with the NameNodes.
The NameNodes and DataNodes communicate with each other using Transmission Control Protocol (TCP)/Internet Protocol (IP). The NameNode, DataNode, ZKFC and JournalNode can be deployed on Linux servers.
Figure 1 HA HDFS Architecture
Table 1 describes the functions of each module shown in Figure 1.
Table 1 HDFS modules
Module | Function |
---|---|
NameNode | Manages the namespace, directory structure, and metadata of file systems, and provides a backup mechanism.
|
DataNode | Stores data blocks of each file and periodically reports stored data blocks to the NameNode. |
JournalNode | Synchronizes metadata between the active and standby NameNodes in the High Availability (HA) cluster. |
ZKFC | ZKFC must be deployed for each NameNode. It is responsible for monitoring NameNode status and writing status information to the ZooKeeper. ZKFC also has permission to select the active NameNode. |
ZK Cluster | ZooKeeper Cluster is a co-ordination service which helps the ZKFC to perform leader election. |
HttpFS gateway | HttpFS is a single stateless gateway process which provides the WebHDFS REST API for external processes and the FileSystem API for the HDFS. HttpFS is used for data transmission between different Hadoop versions. It is also used as a gateway to access the HDFS behind a firewall. |
Thanks.