Overview
Huawei Fusion Insight is a distributed data processing system that provides large-capacity data storage, query, and analysis capabilities. Fusion Insight encapsulates another layer on top of the Hadoop cluster, similar to open source big data platforms like CDH and HDP.
HDFS Principle - Distributed file system
When configuring the hbase cluster to connect the HDFS to other mirror disks, there are many perplexing problems. The three cornerstones of the underlying technology of big data originated from Google's three papers before 2006, GFS Map-Reduce Bigtable, in which GFS Map-Reduce technology directly supports Apache With the Hadoop project,Bigtable gave birth to a new database domain called NoSQL.
Due to the high latency of mapreduce processing framework, After 2009, Google launched Dremel to promote the rise of real-time computing system, which triggered the second wave of big data technology, some big data companies have launched their own big data query analysis products, such as Cloudera open source big data query analysis engine Impala Hortonworks open source Stinger Facebook open source Presto UC Berkeley AMPLAB LABS developed Spark computing framework. All of these technologies are based on HDSF data sources, the most basic of which is read and write operations.
Data storage redundancy
To ensure the fault tolerance and availability of the system,HDFS adopts the multi-copy mode for data redundancy storage. Usually, multiple copies of a data block are distributed to different slave nodes, which has the following advantages:
Speed up data transmission speed;Easy to check data error;Ensure data reliability.
Strategy for data access
Storage of data
In order to improve data reliability and system availability, and make full use of network bandwidth,HDFS adopts the RACK-based data storage policy. Usually, an HDFS cluster contains multiple Racks. Data between different Racks needs to be communicated through switches or routers, but the same RACK does not The default redundancy replication factor of THE HDFS is 3. Each file block is stored in three places. Two copies are stored on different machines of the same RACK, and the third copy is stored on different machines of the same RACK.
Data reading
The HDFS provides an API to determine the ID of the Rack to which the primary node belongs. Clients can invoke the API to obtain their own Rack ID When a client reads data, it obtains a list of the locations where different copies of the block are stored from the primary node. The list contains the secondary nodes where the copies are stored. It can call the API to determine the Rack IDS of the client and those secondary nodes If the ids are the same, the copy is preferred for reading data.
Data reproduction
The HDFS data replication adopts the pipeline replication strategy, which greatly improves the efficiency of the data replication process. When a client writes a file to the HDFS, the file is first written locally and divided into several blocks. The size of each block is determined according to the value set by HDFS Each block sends a write request like the primary node in the HDFS cluster, and the primary node returns a list of writable secondary nodes, which are then written.
Data Errors and Recovery
Primary node error
Store metadata information on the master node synchronously in other file systems;
Run a second primary node, after the primary node is down, it can make up for it by using the second secondary secondary node for data recovery.
Error from Node
Each slave node periodically sends information to the primary node to report its own status.When the slave nodes fail, they are marked as down, and the primary node no longer sends IO requests to them.At this point, if you find that there are fewer data blocks than the redundancy factor, a data redundancy replication will be initiated to generate a new copy of it.
Data Error
After the client reads the data, it checks the data with MD5 and SHA1 to ensure that the correct data is read.If an error is found, a copy of the data block is read.