[Insider sharing] Troubleshooting iStack switchover problem. Highlighted

Created Feb 28, 2016 20:53:33Latest reply Feb 29, 2016 08:44:49 2759 1 0 0

Hi Guys,

Recently I've encounter a problem related to iStack switchover functionality. As you know iStack it's a very common enterprise reliability feature which is used a lot in enterprise networking, so I would like to take this opportunity to share this with you. In order to make sure we are on the same page I will present first some basic concepts of iStack and then we will get further with the case.

Switches that have joined a stack are member switches. Each member switch in a stack plays one of the following roles:
  • Master switch

    The master switch manages the entire stack. A stack has only one master switch.

  • Standby switch

    The standby switch is a backup of the master switch. When the master switch fails, the standby switch takes over all services from the master switch. A stack has only one standby switch.

  • Slave switch

    A slave switch forwards service traffic. The more slave switches in a stack, the higher forwarding performance the stack can provide. Apart from the master and standby switches, all the other switches in a stack are slave switches.

Now, I will get back to the case. The main goal was to perform an upgrade without having any kind of impact on network, so customer planned to reboot each stack member separately. The system was composed by 2x CE7850-32Q-EI  master and standby.

The first operation, reboot the standby device was performed well without any interruption. But second step, rebooting the master, generate a total outage because whole stack system reboot unexpectely. What could go wrong?

In order to understand what really happen we need to make a short backtrack and check the logging for the operation as it was at that specific moment of time. I will share with you the best way to collect all possible logs that the Cloud Engine system can generate in order to backtrack a problem in the past.

      <HUAWEI> save logfile //Collect common user log file log.log.

<HUAWEI> system-view

[~HUAWEI] diagnose

[~HUAWEI-diagnose] save logfile diagnose-log //Collect diagnostic log file

diag.log generated when the device is running.

[~HUAWEI-diagnose] collect diagnostic information

After running the above commands, you download by FTP all the files found in the logfile folder from the flash of both devices ( on the master the path is flash:/logfile / and on the slave the path is slave:/flash:/logfile/ ) .

Example:
<R7_U18_CE6850>dir                                                                                                                 
Directory of flash:/                                                                                                               
                                                                                                                                   
  Idx  Attr     Size(Byte)  Date        Time       FileName                                                                        
    0  drwx              -  Oct 01 2015 19:52:00   $_checkpoint                                                                    
   13  drwx              -  Oct 05 2015 03:17:30   logfile 
<R7_U18_CE6850>cd logfile   
<R7_U18_CE6850>dir                                                                                                                 
Directory of flash:/logfile/                                                                                                       
                                                                                                                                   
  Idx  Attr     Size(Byte)  Date        Time       FileName                                                                        
    0  -rw-      6,128,295  Oct 05 2015 03:17:30   diag.log                                                                        
    1  -rw-        470,275  Jul 17 2015 14:39:48   diaglog_1_20150717153947.log.zip                                                
    2  -rw-        563,056  Sep 05 2015 03:25:46   diaglog_1_20150905032545.log.zip                                                
    3  -rw-        526,418  Aug 12 2015 21:28:27   diaglog_2_20150812212827.log.zip                                                
    4  -rw-        167,785  Oct 05 2015 03:17:30   diagnostic_information.zip                                                      
    5  -rw-      2,420,941  Oct 05 2015 03:21:16   log.log   
 

Checking the logging information shown some usefull  information, the time between slot 1 and slot 2 events is too short, less than 5 minutes, making the switchover synchronization impossible.

 Slot 1 reset time :
Jan  7 2016 18:57:35 xxxxx  %%01CLI/5/CMDRECORD(s):CID=0x80ca2716;Recorded command information. (Task=VTY0, Ip=x.x.x.x  VpnName=_public_, User=xxxxx, AuthenticationMethod="Local-user", Command="reset slot 1".)

Slot 2 reset time:
Jan  7 2016 19:02:16 xxxxx %%01CLI/5/CMDRECORD(s):CID=0x80ca2713;Recorded command information. (Task=VTY0, Ip=x.x.x.x, VpnName=_public_, User=xxxxx AuthenticationMethod="Local-user", Command="reset slot 2".

 

Moreover, in the session log we saw that the customer didn't check the switchover status. Generally if switchover status is not ready, the switchover will fail and the reliabilty character of this function can't be used accordingly. Check below how the status should look like:

 <HUAWEI> display switchover state
   Switchover State  :  Ready
   Switchover Policy :  Board Switchover
   MainBoard         :  1
   SlaveBoard        :  2

Actually the system warn you before rebooting the system:

 Jan  7 2016 18:57:37 xxxxx%%01CLI/5/INTER_CMDRECORD(s):CID=0x80ca2716;Recorded command information. (Task=VTY0, Ip=s.s.s.s, VpnName=_public_, User=xxxxx Command="reset slot 1", PromptInfo="Warning: Resetting the board in slot 1 may cause system reboot while the switchover state is not ready. Continue?  [Y/N]:", UserInput="Y".)

The conclusion for this case is to always read the upgrade guide/ documentation carefully before starting any operation and in case you have problems with understanding some operations, please do not hesitate to contact TAC for support.

I hope you will find this document useful. Bye!

  • x
  • convention:

user_2790689  Expert   Created Feb 29, 2016 08:44:49 Helpful(0) Helpful(0)

Thank you.
  • x
  • convention:

Responses

Reply
You need to log in to reply to the post Login | Register

Notice:To ensure the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but not limited to politically sensitive content, content concerning pornography, gambling, drug abuse and trafficking, content that may disclose or infringe upon others' intellectual properties, including commercial secrets, trade marks, copyrights, and patents, and personal privacy. Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see“ Privacy Policy.”
If the attachment button is not available, update the Adobe Flash Player to the latest version!
Fast reply Scroll to top