【HBase】HBase Scan & Filter原理/流程详解(2)

[复制链接]
发表于 : 2015-1-23 17:20:27 最新回复:2015-03-12 18:33:03
5415 1
Jieshan    

2.3.4 Get expect scanning results

In order to have a better understanding of the below explanation, I need to explain how does a KeyValue be stored in HFile.

n How does a KeyValue be stored

Figure 4 KeyValue Storage format

We should pay special attention to Key length which present in the first part of the above picture: It’s used to store an integer value, which represents the Key length. Actually, the value is calculated by the below formula:

Ø Key length = Key infrastructure size + Actual RowKey length + Actual Column family length + Actual Qualifier length.

Ø Key infrastructure size

=

Size of space where stores RowKey length +

Size of space where stores Column family length +

Size of space where stores TimeStamp +

Size of space where stores Datatype

= 2 + 1 + 8 + 1 = 12.

We can see, this structure was designed without storing the qualifier length. So basing on the stored value, how to calculate the qualifier length?

Figure 5 Calculate qualifier length

n Introduction to all types of Scanners

Figure 6 Scanning flow

Ø InternalScanner can get a list of KeyValues. KeyValueScanner can only get one KeyValue. The implementation of InternalScanner is relegated to KeyValueScanner.

Ø RegionScanner,StoreScanner implement the interface of InternalScanner.

Ø StoreScanner, MemStoreScanner, StoreFileScanner implement the interface of KeyValueScanner.

Definition: If one scanner A is made up by Scanner A-1, Scanner A-2, Scaner A-3, we call the scanner A as parent-Scaner, and call Scanner A-1, Scanner A-2, Scaner A-3 as child-Scaner. This definition will be used in scope of this document.

² How to organize a group of same/similar type of scanners

A RegionScanner is made up by a group of StoreScanner. And a StoreScanner is made up by a MemStoreScanner and a group of StoreFileScanner. All those scanners are merged in a class named KeyValueHeap:

Figure 7 KeyValueHeap

There’s one member variable defined as “PriorityQueue<KeyValueScanner> heap” which used to store all the child-Scanners.

² How to switch from one scanner to another scanner

Figure 8 Switching between StoreScanners

The critical data structure is the PriorityQueue and the Comparator(The exact name is KVScannerComparator):

Ø PriorityQueue: Poll one Scanner from this queue each time, and put back after using it(If peek to the next KeyValue of this scanner is null, the scanning is finished to this scanner. So we can close it. No need to put it back.). Which Scanner should be polled determined by the below Comparator.

Ø Comparator: This is how does the comparator works:

Figure 9 KVScanenrComparator

Ø Regarding on the peek: Just look at the next KeyValue in this Scanner, but do not iterate scanner.

Switch from one StoreFileScanner to another StoreFileScanner is much easier than the above flow. Only finish one StoreFileScanner then close and switch to the next StoreFileScanenr.

² How to seek a given KeyValue from a StoreFileScanner

Seeking is widely used in Scanning and Filter. Likes the below scenarios:

Ø Start a scaning with a specified startKey. We need to seek to the nearest KeyValue larger or equal with the given startKey.

Ø One row or one columnFamily is totally filtered out. So seek to the next valid KeyValue.

This is the flow of seeking:

Figure 10 How to seek a KeyValue

² Summarize all scanners

Let’s summarize all the scanners, and see the relationships between them:

Figure 11 Relationship between all scanners

Ø KeyValueHeap: This class is used at the Region level to merge across Stores and at the Store level to merge across the memstore and StoreFiles. It is a member variable of RegionScannerImpl and StoreScanner.

· RegionScannerImpl use KeyValueHeap to merge across Stores.

· StoreScanenr use KeyValueHeap to merge across MemStore and StoreFiles.

So, we can say KeyValueHeap is a parent-Scanner. It has one or more child-Scanners. During the instantiation of this class, it will load all the child-Scanners.

Ø KeyValueScanner: It is used to get the next KeyValue. Meanwhile, it can be used to look up the next KeyValue(Without iterate scanner), seek a KeyValue.

Ø InternalScanner: It is used to get number of KeyValues(We can specify how many KeyValues we get by setting the parameter of batch. Default is get all the KeyValues from one row.)

1.1.1        Close RegionScanner

The closure flow is simple:

Figure 12 Closure flow

  • x
  • 常规:

点评 回复

跳转到指定楼层
DDR  精英   发表于 2015-3-12 18:33:03 已赞(0) 赞(0)

先mark,太高深了

  • x
  • 常规:

点评 回复

发表回复
您需要登录后才可以回帖 登录 | 注册

内容安全提示:尊敬的用户您好,为了保障您、社区及第三方的合法权益,请勿发布可能给各方带来法律风险的内容,包括但不限于政治敏感内容,涉黄赌毒内容,泄露、侵犯他人商业秘密的内容,侵犯他人商标、版本、专利等知识产权的内容,侵犯个人隐私的内容等。也请勿向他人共享您的账号及密码,通过您的账号执行的所有操作,将视同您本人的行为,由您本人承担操作后果。详情请参看“隐私声明
如果附件按钮无法使用,请将Adobe Flash Player 更新到最新版本!

登录参与交流分享

登录
快速回复 返回顶部