Got it

FusionInsight HBase cluster faults and cause analysis Highlighted

Latest reply: Aug 15, 2019 00:49:26 3874 2 7 0 4

Hello, everyone!

This is a case about HBase is faulty and all RegionServers are restarted, causing customer business to be affected.

Problem description

HBase is faulty and all RegionServers are restarted, causing customer business to be affected.

Problem analysis

1. The status of the HBase service is normal. All the instances are in a normal state. All the RegionServer processes on the HBase Web UI are restarted at 18:10 ~ 18:15, causing all regions (33W+) to be re-lived. Observed that the region was gradually coming online.

2. Analyzed the reason for the restart of the RegionServer processes and found that it was due to the connection of the ZooKeeper heartbeat timeout.

161923qfkqxyh8pvu33wg3.png?7.png

3. Checked the cluster alarm and reported that the ZooKeeper service was unavailable - alarm on 2019-03-13 10:12 cluster:

161947h0rnqmnv2f7qmc8s.png?8.png

4. After analyzing the reason why the ZooKeeper service was unavailable, found out that it was due to the fact that the node HDCDDSJAPP18106 of the leader on the ZooKeeper instance was stuck in the whole system from 18:10 to 18:18 and the basic OS commands could not be executed. E.g.:

162027of40f4mmkzmyj80j.png?9.png

The system command vmstat is broken.

5. Analyzed the cause of the HDCDDSJAPP18106 system being stuck, checked the message log and found that there was a problem with the RAID card during the time period when the problem occured.

162059jhvcbh5oohxj2v6v.png?11.png

162118wj5iiriaf2qeqbq0.png?12.png

6. Analyzed the ZooKeeper log at that time. The node where the leader was located was the stuck node. After the Leader node is stuck, other nodes cannot maintain the session heartbeat with the ZooKeeper node. After the leader node of ZooKeeper has sensed the abnormality of the leader node, the ZooKeeper service is restored - right after the leader is re-elected.

1) View the HDCDDSJAPP18106 node; 30 seconds once the instance health check is processed 18:09:49S. Last check to normal.

162144in8u0a7ulf7n7l7k.png?13.png

2) View the logs of the other 4 ZooKeeper instances, both at 18:13:04 and the leader read the data timeout 'Read time out', then disconnected and was re-elected.

162207ha4wwcaroc5clwcg.png?14.png

3) The ZooKeeper service disconnected at 18:13:04, launched a new election and began to select a new Leader.

2019-03-13 18:13:04,641 | INFO  | WorkerReceiver[myid=793] | Notification: 2 (message format version), 789 (n.leader), 0x1004967222 (n.zxid), 0x2 (n.round), LOOKING (n.state), 789 (n.sid), 0x10 (n.peerEPoch), FOLLOWING (my state)1000000000 (n.config version)
2019-03-13 18:13:04,641 | INFO  | NIOServerCxnFactory.AcceptThread:hqcddsjapp181014/10.180.181.14:24002 | Accepted socket connection from /10.180.181.196:50426

8:13:04,641 | WARN  | QuorumPeer[myid=793](plain=/10.180.181.14:24002)(secure=disabled) | PeerState set to LOOKING

4) 2019-03-13 10:12 The election is completed, HQCDDSJAPP181014 is the new leader and the ZooKeeper service is restored.

2019-03-13 18:13:04,890 | INFO  | QuorumPeer[myid=793](plain=/10.180.181.14:24002)(secure=disabled) | Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id793,name1=replica.793,name2=LeaderElection]
2019-03-13 18:13:04,893 | INFO  | QuorumPeer[myid=793](plain=/10.180.181.14:24002)(secure=disabled) | zookeeper.leader.maxConcurrentSnapshots = 10
2019-03-13 18:13:04,893 | INFO  | QuorumPeer[myid=793](plain=/10.180.181.14:24002)(secure=disabled) | zookeeper.leader.maxConcurrentSnapshotTimeout = 5
8:13:04,894 | INFO  | QuorumPeer[myid=793](plain=/10.180.181.14:24002)(secure=disabled) | LEADING - LEADER ELECTION TOOK–5

7. ZooKeeper principle description

The ZooKeeper component is distributed across a group of machines, the leader is elected at startup, and the Leader broadcasts atoms to all other nodes. Each component writes data to ZooKeeper and maintains a heartbeat session with the ZooKeeper server for a specified time. During the Session timeout period, the client and ZooKeeper send a message every second. If the ZooKeeper fails to respond to the request within the Session timeout period, the session times out.

The timing of ZooKeeper's leader failure is that when Follower and Leader send a ping, the "majority" follower cannot receive the request from the leader. At this time, it will enter the LOOKING mode and start the election. The timeout period of the ping is determined by the server of ZooKeeper. Parameter control, the default time is 60 seconds (syncLimit 15 * tickTime 4000ms). In the scenario where the ZooKeeper's Leader network is interrupted and the node is faulty or the leader is abnormally exited, the Follower can elect the Leader in time; in the scenario where the ZooKeeper's Leader node and the instance hang, the network can only detect the abnormality after the network timeout expires, and restart the election. When ZooKeeper does not have an instance of the Leader state, it cannot respond to the client's heartbeat and read and write requests.


162301hokwzvs9d25des5e.png?15.png


Summary: During the running of HBase, RegionServer and ZooKeeper maintain the heartbeat Session, and the ZooKeeper Leader node is stuck. During the Follower's perception of the leader exception re-election, the ZooKeeper service is unavailable and cannot maintain the heartbeat of the client, such as RegionServer. In response to the client read and write requests, the HBase service is eventually unavailable. After ZooKeeper elected the Leader, the RegionServer went live again.


Root cause

Summary: During the running of HBase, RegionServer and ZooKeeper maintain the heartbeat Session, and the ZooKeeper Leader node is stuck. During the Follower's perception of the leader exception re-election, the ZooKeeper service is unavailable and cannot maintain the heartbeat of the client , such as RegionServer. In response to the client read and write requests, the HBase service is ultimately unavailable. After ZooKeeper elected the Leader, the RegionServer went live again.


Workaround

Adjust the parameters of the ZooKeeper server to reduce the delay of the ZooKeeper's follower to the leader card. In the event that the ZooKeeper is dead in the leader card, the follower detects the death of the leader in time and restores the session before the timeout of the component such as HBase. The adjustment method is Change the ZooKeeper syncLimit parameter from the default value of 15 to 1.


Solution

The operating system and hardware analysis HDCDDSJAPP18106 nodes such as RAID card abnormalities, resulting in the root cause of system being stuck and resolved.

That's all, thanks!

When encounter problems, first see the cluster if there is a alrm
View more
  • x
  • convention:

FusionInsight HBase cluster
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.