Hello, friend!
This post will share with you the HBase key processes and highlights.
HBase Key Processes
Data Read and Write Process
When you write data, the data is allocated to the corresponding HRegionServer for execution.
Your data is first written to MemStore and HLog.
The commit() invocation returns the data to the client only after the operation is written to HLog.
When you read data, the HRegionServer first accesses MemStore cache. If the MemStore cache cannot be found, the HRegionServer searches StoreFile on the disk.
Cache Refreshing
The system periodically writes the content in the MemStore cache to the StoreFile file in the disk, clears the cache, and writes a tag in the HLog.
A new StoreFile file is generated each time data is written. Therefore, each Store contains multiple StoreFile files.
Each HRegionServer has its own HLog file. Each time the HRegionServer is started, the HLog file is checked to confirm the latest startup. Check whether a new write operation is performed after the cache is refreshed. If an update is detected, the data is written to MemStore and then to StoreFile. At last, the old HLog file is deleted, and HRegionServer provides services for you.
Merging StoreFiles
A new StoreFile is generated each time data is flushed, affecting the search speed due to the large number of StoreFiles.
The Store.compact() function is used to combine multiple StoreFiles into one.
The merge operation is started only when the number of StoreFiles reaches a threshold because the merge operation consumes a large number of resources.
Store Implementation
Store is the core of a HRegionServer.
Multiple StoreFiles are combined into one Store.
When the size of a single StoreFile is too large, splitting is triggered. One parent region is split into two sub-regions.
HLog Implementation
In a distributed environment, you need to consider system errors. HBase uses HLog to ensure system recovery.
The HBase system configures an HLog file for each HRegionServer, which is a write-ahead log (WAL).
The updated data can be written to the MemStore cache only after the data is written to logs. In addition, the cached data can be written to the disk only after the logs corresponding to the data cached in the MemStore are written to the disk.
HBase Highlights
Impact of Multiple HFiles
The read latency prolongs as the number of HFiles increases.
HBase Compaction
Compaction is used to reduce the number of small files (HFiles) in the same column family of the same region to improve the read performance.
Compaction is classified into minor compaction and major compaction.
Minor: indicates small-scale compaction. There are limits on the minimum and maximum number of files. Generally, small files in a continuous time range are merged.
Major: indicates the compaction of all HFile files under the column family of the region.
Minor compaction complies with a certain algorithm when selecting files.
OpenScanner
In the OpenScanner process, two different scanners are created to read HFile and MemStore data.
The scanner corresponding to HFile is StoreFileScanner.
The scanner corresponding to MemStore is MemStoreScanner.
BloomFilter
BloomFilter is used to optimize some random read scenarios, that is, the Get scenario. It can be used to quickly determine whether a piece of user data exists in a large data set (most data in the data set cannot be loaded to the memory).
BloomFilter has possibility of misjudgment when determining whether a piece of data exists. However, the judgment result of "The data xxxx does not exist" is reliable.
BloomFilter's data in HBase is stored in HFiles.
Summary of HBase-related posts
That's all, thanks!