Hadoop basic concepts
A Hadoop building block
• NameNode(NameNode)
• DataNode(DataNode)
• Secondary NameNode(Secondary NameNode)
• JobTracker
• TaskTracker
1. File characteristics of HDFS system
• store enormous amounts of information (terabytes or petabytes), saving data to a large number of nodes.
Support for large single files.
• provide high reliability of data, single or multiple nodes do not work, no impact on the system, data is still available.
• provide quick access to this information and provide an extensible way to access it.
You can serve more clients by simply adding more servers.
• HDFS is designed for MapReduce so that data can be accessed and calculated as far as possible based on its local locality.
2. The defect of HDFS
• low latency data access
For example, at the millisecond level such as low latency and high throughput.
• small file access
Large footprint NameNode memory seek time exceeds read time.
• concurrent writes/random file changes
Only one writer per file can support append.
3. Working principle of MapReduce system

4. The NameNode
The NameNode is the most important part of the Hadoop daemon,Hadoop adopts Master/slave structure in both distributed computing and distributed storage. The distributed storage system is called the Hadoop file system, or simply called HDFS.
Running NameNode consumes a large amount of memory and IO resources. Therefore, in order to reduce the load on the machine, the server hosting NameNode usually does not store user data or perform computation tasks of MapReduce program. This means that the NameNode server will not be a DataNode or a TaskTracker at the same time.
But the importance of the NameNode has a downside -- the failure of the Hadoop cluster.
5. The DataNode
From nodes in each cluster would reside a DataNode daemon, hard work, to perform the distributed file system will HDFS data block read or written to the local file system of the actual file. When hope HDFS file for reading and writing, the file is segmented into multiple blocks, each block of data by the NameNode told client resides in the DataNode. The client directly communicate with DataNode daemon, to deal with data block corresponding to the local file. However the DataNode can communicate with other DataNode, copy these Data blocks for redundancy.
6. Secondary NameNode
SNN is an HDFS cluster monitoring the state of the auxiliary daemon, it usually dominant in a server, the server will not run other DataNode or TaskTracker daemon. SNN with the NameNode different is that it does not receive or record any real change in the HDFS. On the contrary it communicates with the NameNode, feeling the configured time interval for HDFS cluster snapshots of the metadata.
While the NameNode is a single point of failure for a Hadoop cluster, SNN snapshots can help reduce downtime and reduce the risk of data loss. However, failure handling of the NameNode requires manual intervention, i.e. manually reconfiguring the cluster to use SNN as the primary NameNode.
7. The JobTracker
JobTracker daemon is the link between the application and Hadoop. Once submitted the code to the cluster, the JobTracker will determine the execution plan, including decision process which files, for different task allocation node and monitoring the operation of all the tasks. If failed, the JobTracker will automatically restart the task, but the distribution of nodes may be different, constrained by the predefined retry count at the same time.
Each Hadoop cluster has only one JobTracker daemon, which typically runs on the primary node of the server cluster.
8. TaskTracker
Like the stored daemon, the computed daemon follows the master-slave architecture :JobTracker ACTS as the master node, monitoring the entire execution of the MapReduce job, while TaskTracker manages the execution of individual tasks on each slave node.
If the JobTracker does not receive a heartbeat from the TaskTracker within the specified time, it will assume that the TaskTracker has crashed and resubm.