Upgrade Failure Due to Rabbitmq Service Exceptions

51 0 1 0

Hi, everyone! Today I'm going to introduce you a case about upgrade failure due to rabbutmq service exceptions.

Problem Description

After FusionSphere OpenStack is upgraded from V100R006C00SPC103 to V100R006C10SPC500, the alarm indicating that the rabbitMQ service is faulty is repeatedly reported on the FusionSphere OpenStack OM web client.


Alarm

RabbitMQ service is faulty.


Problem Analysis

1.     Log in to the FusionSphere OpenStack controller node and switch to user root. Import environment variables.

source set_env

1

FusionSphere123


2.     Run the following command to check the RabbitMQ service status:

cps template-instance-list --service rabbitmq rabbitmq

161137uto5kmmyuyppn66m.png

The rabbitMQ service experiences active/standby switchovers repeatedly.

(1). If the rabbitmq service is abnormal, other components will be abnormal. Such as nova-compute service. Which will cause VM migration failures. In addition, the nova-compute log frequently prints " AMQP server on 172.28.120.8:5672 is unreachable.", indicating that the RabbitMQ connection is abnormal.

161137b888uldrm0a48esu.jpg


(2). Log in to the node where the rabbitmq service is deployed and view the rabbitmq logs. The error information is as follows:

Log path: /var/log/fusionsphere/component/rabbitmq,

161137uqfu3asix08iivqi.png


(3). Check whether the rabbitMQ memory watermark and host resource isolation are configured.

Run the command: cps  template-params-show   --service  rabbitmq   rabbitmq

If the returned value of memory_high_watermarkis null, it indicates that the default configuration is used.

161137acewxz5c3cejj3j0.png

 

We also found that resource isolation is not configured for the host.

 

It can be concluded that resource isolation is not configured for the host where rabbitMQ is deployed. As a result, the service restarts repeatedly after the system is upgraded to V100R006C10.


Solution

Perform the following steps:

(1) If resource isolation is not configured for the host where rabbitMQ resides, configure resource isolation based on the product documentation.

(2) If the number of VMs exceeds 500, configure the rabbitMQ memory watermark based on the scale.

(3) Log in to the FusionSphere OpenStack web client, choose Configuration > OpenStack > RabbitMQ, and configure the memory watermark.

(4) After the resource isolation and memory watermark are configured, wait for 2 to 3 minutes. The RabbitMQ service automatically recovers.

 

Summary and Suggestion

Resource isolation must be configured for the RabbitMQ host. If the number of VMs exceeds 500, the RabbitMQ service needs to adjust the memory watermark. Otherwise, the RabbitMQ service may restart repeatedly due to insufficient isolation resources at the IaaS layer.

 

Before the upgrade from V100R006C00SPCxxx to V100R006C10SPCxxx, check whether resource isolation has been configured for the host accommodating RabbitMQ.


Any solutions will be appreciated!

 

  • x
  • convention:

Comment

You need to log in to reply to the post Login | Register

Notice Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " Privacy."
If the attachment button is not available, update the Adobe Flash Player to the latest version!
Login and enjoy all the member benefits

Login and enjoy all the member benefits

Login