Hi Guys,
Recently I've encounter a problem related to iStack switchover functionality. As you know iStack it's a very common enterprise reliability feature which is used a lot in enterprise networking, so I would like to take this opportunity to share this with you. In order to make sure we are on the same page I will present first some basic concepts of iStack and then we will get further with the case.
-
Master switch
The master switch manages the entire stack. A stack has only one master switch.
-
Standby switch
The standby switch is a backup of the master switch. When the master switch fails, the standby switch takes over all services from the master switch. A stack has only one standby switch.
-
Slave switch
A slave switch forwards service traffic. The more slave switches in a stack, the higher forwarding performance the stack can provide. Apart from the master and standby switches, all the other switches in a stack are slave switches.
Now, I will get back to the case. The main goal was to perform an upgrade without having any kind of impact on network, so customer planned to reboot each stack member separately. The system was composed by 2x CE7850-32Q-EI master and standby.
The first operation, reboot the standby device was performed well without any interruption. But second step, rebooting the master, generate a total outage because whole stack system reboot unexpectely. What could go wrong?
In order to understand what really happen we need to make a short backtrack and check the logging for the operation as it was at that specific moment of time. I will share with you the best way to collect all possible logs that the Cloud Engine system can generate in order to backtrack a problem in the past.
<HUAWEI> save logfile //Collect common user log file log.log.
<HUAWEI> system-view
[~HUAWEI] diagnose
[~HUAWEI-diagnose] save logfile diagnose-log //Collect diagnostic log file
diag.log generated when the device is running.
[~HUAWEI-diagnose] collect diagnostic information
After running the above commands, you download by FTP all the files found in the logfile folder from the flash of both devices ( on the master the path is flash:/logfile / and on the slave the path is slave:/flash:/logfile/ ) .
Example:
<R7_U18_CE6850>dir
Directory of flash:/
Idx Attr Size(Byte) Date Time FileName
0 drwx - Oct 01 2015 19:52:00 $_checkpoint
13 drwx - Oct 05 2015 03:17:30 logfile
<R7_U18_CE6850>cd logfile
<R7_U18_CE6850>dir
Directory of flash:/logfile/
Idx Attr Size(Byte) Date Time FileName
0 -rw- 6,128,295 Oct 05 2015 03:17:30 diag.log
1 -rw- 470,275 Jul 17 2015 14:39:48 diaglog_1_20150717153947.log.zip
2 -rw- 563,056 Sep 05 2015 03:25:46 diaglog_1_20150905032545.log.zip
3 -rw- 526,418 Aug 12 2015 21:28:27 diaglog_2_20150812212827.log.zip
4 -rw- 167,785 Oct 05 2015 03:17:30 diagnostic_information.zip
5 -rw- 2,420,941 Oct 05 2015 03:21:16 log.log
Checking the logging information shown some usefull information, the time between slot 1 and slot 2 events is too short, less than 5 minutes, making the switchover synchronization impossible.
Slot 1 reset time :
Jan 7 2016 18:57:35 xxxxx %%01CLI/5/CMDRECORD(s):CID=0x80ca2716;Recorded command information. (Task=VTY0, Ip=x.x.x.x VpnName=_public_, User=xxxxx, AuthenticationMethod="Local-user", Command="reset slot 1".)
Slot 2 reset time:
Jan 7 2016 19:02:16 xxxxx %%01CLI/5/CMDRECORD(s):CID=0x80ca2713;Recorded command information. (Task=VTY0, Ip=x.x.x.x, VpnName=_public_, User=xxxxx AuthenticationMethod="Local-user", Command="reset slot 2".
Moreover, in the session log we saw that the customer didn't check the switchover status. Generally if switchover status is not ready, the switchover will fail and the reliabilty character of this function can't be used accordingly. Check below how the status should look like:
<HUAWEI> display switchover state
Switchover State : Ready
Switchover Policy : Board Switchover
MainBoard : 1
SlaveBoard : 2
Actually the system warn you before rebooting the system:
Jan 7 2016 18:57:37 xxxxx%%01CLI/5/INTER_CMDRECORD(s):CID=0x80ca2716;Recorded command information. (Task=VTY0, Ip=s.s.s.s, VpnName=_public_, User=xxxxx Command="reset slot 1", PromptInfo="Warning: Resetting the board in slot 1 may cause system reboot while the switchover state is not ready. Continue? [Y/N]:", UserInput="Y".)
The conclusion for this case is to always read the upgrade guide/ documentation carefully before starting any operation and in case you have problems with understanding some operations, please do not hesitate to contact TAC for support.
I hope you will find this document useful. Bye!