During our last Cisco ACI upgrade, we encountered a vSAN Stretched cluster partition with Cisco UCS. After completing the upgrade of the Cisco switches for the secondary site, all the VMs belonging to this site became inaccessible at the VMware cluster level:
Technical environment:
- Cisco UCS Infrastructure 4.2(1m)
- Cisco UCS C240-M5SX
- Cisco APIC 5.2(6e)
- Cisco Nexus 9000 15.2(6e)
- ESXi 7.0 U3g
To clarify, vSAN health check complained about the MTU with the ping of large packet size and I also had a high pNIC error rate and no more vSAN traffic communication for all the ESXi hosts located on the secondary site:
I immediately noticed an issue with the vSAN MTU size configured to 9000 (Jumbo Frame) which resulted in the vSAN partition. Each ESXi started to be isolated on its own vSAN partition.
To confirm these 2 issues, I connected by SSH to one of the ESXi host concerned by this problem and I ran the following 2 commands:
- vmkping -I vmk4 192.x.x.x -s 8000 (vmk4 used for vSAN traffic here)
- esxcli vsan cluster get
Furthermore, Jumbo Frames vmkping for vSAN didn’t work between ESXi hosts located on the same UCS domain or cross site UCS domain, but still worked for any ESXi hosts sitting outside the Cisco UCS infrastructure!
Note that all our UCS Fabric Interconnect are configured in End Host mode and directly connected to our Cisco Nexus 9000 ACI leaf.
After my first investigations, our network team quickly discovered the following Cisco bug which was very similar to the issue we were facing:
Indeed, all the symptoms enumerated in this Cisco bug article are matching with our issue:
- DPP feature enabled in Cisco APIC
- COS3 traffic enabled by default in Cisco UCS QoS System Class
- Packet size higher than 2000 range dropped (in my case, it started to drop at 2284)
Cisco DPP, what is it?
“Dynamic Packet Prioritization (DPP) prioritizes a configured number of packets of every new flow in a particular class of traffic is prioritized and sent through a configure class of traffic that DPP is mapped to.” – Cisco NX-OS Quality of Service Configuration Guide
And bingo! After disabling the DPP feature from Cisco APIC, no more issue and our vSAN traffic with Jumbo Frame worked again.
If you read the Cisco article, you can see that this bug should have been fixed since the release of Nexus 9000 14.0. However, it seems that a similar problem has reappeared in the new version that we deployed (15.2 here).
Cisco ELAM
In order to gather more information and find the root cause, our network engineers installed a very useful application provided by Cisco named ELAM: ELAM Assistant – Cisco DC App Center
ELAM stands for Embedded Logic Analyzer Modules and helps to perform packet capture on any ACI node (switch) belonging to the ACI Fabric.
With the ELAM captured packet information, we can clearly identify the CoS 3 from the Outer L2 Header section when DPP feature is enabled.
In most environments, you don’t need to enable DPP but it’s pretty scary to think that this bug could bring down your entire infrastructure if you’re using Jumbo Frame!
In conclusion, we have provided all the evidences to Cisco support/engineering as we would like to understand what triggered this issue knowing that Jumbo Frame worked fine in our UCS Infrastructure with DPP enabled prior to the upgrade.
2 responses to “vSAN Stretched Cluster partition with Cisco UCS & Cisco ACI”
Hi Nico,
I just had a deja vu, reading your blog post 😉
Cheers
Richy
Hi Richy,
Indeed, it is deja vu! Thanks to your prompt action and finding the bug, we solved that issue quickly 🙂
Nico