Hello everybody!
Today I'd like to share with you an interesting case of Network Redundancy with static Vxlan Tunnel. Please see below for more information on the topic.
ISSUE DESCRIPTION
Customer X has an IS-IS running MPLS network. All of the clients for system A are located on Site 1. Server Farm 1 and Server Farm 2 are the distributed virtual server resources for these clients.
Server Farm 2 is located on Site 2. Static Vxlan is used for servers to extend server network in order to keep servers synchronized. Layer3 gateways of clients and servers are configured on PE3 & PE4, which are dual-active systems based on M-LAG.

In scope of redundancy tests:
PE3 is powered off - there is no service interruption;
PE3 is powered off + PE1 is powered off - there is no service interruption;
PE3 is powered on while PE 1 is powered off - there is no service interruption;
PE3 is powered on + PE 1 is powered on - there is no service interruption.
After a few minutes of PE1 being powered on, PE4 is immediately powered off: there is a service interruption for approximately 1.5 minutes. All of the servers are rebooted in order to select masters because the heartbeat over Vxlan and L3 connection between witness server and server farms are down.

The Vxlan tunnel is down and L3 gateways are not reachable during this period.
HANDLING PROCESS
Traffic is recovered 1.5 minutes later without any configurations.
ROOT CAUSE
According to the ISIS configurations on the PE devices, set-overload on-startup is configured with the default settings, that is, 10 minutes.
'If an IS-IS device needs to be temporarily isolated, configure the IS-IS device to enter the overload state to prevent other devices from forwarding traffic to this IS-IS device and prevent blackhole routes.'
When PE4 is powered off 10 minutes prior to the PE1 startup (which is approximately 8.5 minutes), all traffic from PE3 & PE4 cannot be routed via PE1.

aa bb xxxx 12:38:11+03:00 PE1 %ISIS/3/isisDatabaseOverload(t):CID=0x8086055c-OID=1.3.6.1.3.37.2.0.1;The overload state of IS-IS LSDB changed. (isisSysInstance=1, isisSysLevelIndex=2, isisSysLevelOverloadState=2)
aa bb xxxx 12:48:00+03:00 PE1 %ISIS/3/isisDatabaseOverload(t):CID=0x8086055c-OID=1.3.6.1.3.37.2.0.1;The overload state of IS-IS LSDB changed. (isisSysInstance=1, isisSysLevelIndex=2, isisSysLevelOverloadState=1)

SOLUTION
Before redundancy tests, prepare an SOP document properly in order to estimate all possible scenarios. An alternate scenario for this topology can be suggested to the customer. Please refer to the below suggestions.
SUGGESTIONS AND SUMMARY
Do not use a distributed server architecture. Since all clients are located on Site 1, keep all active servers on Site 1.
Do not use only one gateway on Site 1.
Use the bgp evpn distributed vxlan scenario instead of static vxlan scenario to configure the same IP address on Site 1 and Site 2. This will allow the witness server to reach at least one of the server farms and prevent the rebooting of all servers. This allows all clients and Server Farm 1 to communicate via PE3 & PE4 even though PE1 & PE2 are not reachable.


