3 Node Cluster Consisting of:
OS: ESXi 6.0.0 2615704 booting from (SD Card)
HW: Dell R820 / 512GB / Xeon E5-4620 4x8
Storage 2x: ISP2432-based 4Gb FC PCI Express HBA
NIC 2x: Broadcom QLogic 57810 10Gb Dual Port
25 LUNs over FC provisioned from a VNX 5600
1 x 2 Node Win 2012 & SQL 2014 WSFC & MSCS - 3 "Phyiscal" RDMs which are mapped on 1 LUN to 2 VMs.
Disclaimer - Since upgrading all hosts to 2615704 the ESXi hosts were not updated/patched/restarted
- We have a 3 node cluster identical with the above hardware and software config.
- For one reason or another I decided to restart one of the nodes (and with this occasion what better time to perform updates?!).
- 1 node was patched to the latest updates via Update Manager (2809209)
- During the remediation process I noticed that it was taking far longer than normal to boot - up to 1 hour
- iDRAC opened it was stuck at a few stages, one of them being nfs41client
- ALT+F12 displayed the following results (similar to VMware KB: FCoE storage connections fail when LUNs are presented to the host on different VLANs):
- T.363Z cpu2:4616)<6>host2: fip: fcoe_ctlr_vlan_request() is done
- T.365Z cpu0:4606)<6>host2: fip: host2: FIP VLAN ID unavail. Retry VLAN discovery.
- T.365Z cpu0:4606)<6>host2: fip: fcoe_ctlr_vlan_request() is done
- T.366Z cpu2:4622)<6>host2: fip: host2: FIP VLAN ID unavail. Retry VLAN discovery.
- Thinking that it was a new driver issue off I go and apply this fix - Zenfolio | Michael Davis | Broadcom BCM57810 FCoE and ESXi
- No luck - booting time still takes a long, long time
- Decide to rebuild the host it as was getting silly in the hours spent troubleshooting
- Disconnected my FC cables from the storage HBA, re-install ESXi from VMware-VMvisor-Installer-6.0.0-2494585.x86_64-Dell_Customized-A00.iso and boot it up - instantly noticing that startup time was back to normal
- Proceed to reconnect storage HBA, do some network config and restart - now starting up takes as long as before
- While waiting for the system to startup I come across this VMware KB: ESXi/ESX hosts with visibility to RDM LUNs being used by MSCS nodes with RDMs may take a long time to sta…
- I proceed to set my LUN which contains the RDM mappings for my 2 MSCS nodes
- esxcli storage core device setconfig -d naa.xxxx --perennially-reserved=true
- Verify that it is set with esxcli storage core device list -d naa.xxxx and proceed to reboot
- Reboot still takes a long time...
So I am not sure
- if --perennially-reserved=true still applies for ESXi 6.0?
- of course to really confirm if the other 2 nodes in the VMware cluster are experiencing this issue (related to RDMs) I would have to restart at least one of them
- if it's a driver related thing?
- if I have missed something else?
- if I am barking at the wrong tree?
I look forward to any comments, questions, ideas, suggestions, etc!
Thanks
Corin