Hello, friend!
In this post, I will share with you the HDFS.
HDFS Overview
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS has a high fault tolerance capability and is deployed on cost-effective hardware.
HDFS provides high-throughput access to application data and applies to applications with large data sets.
HDFS looses some Potable Operating System Interface of UNIX (POSIX) requirements to implement streaming access to file system data.
HDFS was originally built as the foundation for the Apache Nutch Web search engine project.
HDFS is a part of the Apache Hadoop Core project.
HDFS Architecture Overview
HDFS working principle, see [FI Components] HDFS working principle.
HDFS Namespace Management
The HDFS namespace contains directories, files, and blocks.
HDFS uses the traditional hierarchical file system. Therefore, users can create and delete directories and files, move files between directories, and rename files in the same way as using a common file system.
NameNode maintains the file system namespace. Any changes to the file system namespace or its properties are recorded by the NameNode.
Communication Protocol
HDFS is a distributed file system deployed on a cluster. Therefore, a large amount of data needs to be transmitted over the network.
All HDFS communication protocols are based on the TCP/IP protocol.
The client initiates a TCP connection to the NameNode through a configurable port and uses the client protocol to interact with the NameNode.
The NameNode and the DataNode interact with each other by using the DataNode protocol.
The interaction between the client and the DataNode is implemented through the Remote Procedure Call (RPC). In design, the NameNode does not initiate an RPC request, but responds to RPC requests from the client and DataNode.
Client
The client is the most commonly used method for users to operate HDFS. HDFS provides a client during deployment.
The HDFS client is a library that contains HDFS file system interfaces that hide most of the complexity of HDFS implementation.
Strictly speaking, the client is not a part of HDFS.
The client supports common operations such as opening, reading, and writing, and provides a command line mode similar to Shell to access data in HDFS.
HDFS also provides Java APIs as client programming interfaces for applications to access the file system.
If you want to know more information about HDFS modules, see [FI Components] HA HDFS Architecture.
Disadvantages of the HDFS Single-NameNode Architecture
Only one NameNode is set for HDFS, which greatly simplifies the system design but also brings some obvious limitations. The details are as follows:
Namespace limitation: NameNodes are stored in the memory. Therefore, the number of objects (files and blocks) that can be contained in a NameNode is limited by the memory size.
Performance bottleneck: The throughput of the entire distributed file system is limited by the throughput of a single NameNode.
Isolation: Because there is only one NameNode and one namespace in the cluster, different applications cannot be isolated.
Cluster availability: Once the only NameNode is faulty, the entire cluster becomes unavailable.
HDFS-related Concepts
Computer Cluster Structure
The distributed file system stores files on multiple computer nodes. Thousands of computer nodes form a computer cluster.
Currently, the computer cluster used by the distributed file system consists of common hardware, which greatly reduces the hardware overhead.
Basic System Architecture
Block
The default size of an HDFS block is 128 MB. A file is divided into multiple blocks, which are used as storage units.
The block size is much larger than that of a common file system, minimizing the addressing overhead.
The abstract block concept brings the following obvious benefits:
Supporting large-scale file storage.
Simplifying system design.
Applicable to data backup.
The difference between NameNode and DataNode
NameNode | DataNode |
Stores metadata. | Stores file content. |
Metadata is stored in the memory. | The file content is stored in the disk. |
Saves the mapping between files, blocks, and DataNodes. | Maintains the mapping between block IDs and local files on DataNodes. |
If you want to know more information about NameNode and DataNode, see HDFS Architecture and Functionality.
Summary of HDFS-related posts
That's all, thanks!