HPE Gen10 Server SD Card issues
Background
It started some month ago at one of my customers. The customer patched all of their ESXi host during the maintenance window from 6.5 U3 to 6.7 U3 through an highly automated Powershell script. The script is not only patching the ESXi hosts but also loads the HPE SPP (firmware) ISO into the iLO port and patches the firmware to the most recent version. The host reboots after everything is done. But during this maintenance windows something odd happened. Everything looked good after the host’s first reboot. The customer had to reboot a second time but during the second reboot the ESXi reverted to the old 6.5 build. We assumed the SD card was broken or not writeable anymore during a troubleshooting session and the customer replaced it. After that everything was fine again.
Issue
In February there was again a quarterly maintenance windows and the customer hit the same issue again. But this time it was more severe. Out of a 10 node cluster 6 nodes showed this issue. So that was no coincidence. During a remote session with the customer we checked if something was wrong with the installation. After doing a ls -lrth the output was this:
On a not affected host the same command gave this output:
As you can see with the corrupt installation bootbank is pointing to /tmp and all other symbolic links are not available like /store, /altbootbank and other links are pointing to wrong folders. We then came across this KB2149444 article. It explains everything we have encountered so far but with 7.0 U1. I checked the mentioned log files and correlated the time stamps and as far as I can remember it was the same problem. But it was still really odd why 6 hosts out of 10 in the cluster were affected and in another cluster only 3 out of 10 hosts. We digged a little bit deeper and checked what else is different with the Gen10 servers. We found 2 components in the firmware section of the iLO 5 port which were different on affected hosts compared to non affected hosts.
Although the customer already did a SPP firmware upgrade with the newest version it looks like, for whatever reason the Innovation Engine (IE) firmware and the Server Platform Services SPS firmware wasn’t updated accordingly. We also checked the installed queue for the firmware and found this issue.
During the update of the firmware the Innovation Engine firmware throw an exception and didn’t continue. The Server Platform Services SPS firmware also did not get installed because the other stucked firmware.
We removed both firmware and tried again but got the same exception again. There is no information what this exception was or how to handle it. By accident during boot up of the server and going through the bios initialization process we saw in a blink of an eye that the mentioned Innovation Engine firmware was not installed because of “Device Error”. Unfortunately we didn’t catch the whole wording on a screenshot. Based on this information the customer contacted HPE support to ask on how to proceed further.
Solution
HPE came back after some time with the solution to reset the CMOS/NVRAM in the affected Gen10 servers. For the procedure they mentioned a blog post from my buddy Andreas Lesslhumer (www.running-system.com). After this procedure we tried again to update the firmware on the affected hosts. This time it worked like a charm. Also the upgrade on these hosts to 6.7 U3 was not a problem anymore. We haven’t seen any SD card issue since then.
But why are these two pieces causing this weird issue. Unfortunately I don’t have an answer to that yet. So what are these two components.
Intel Server Platform Services (SPS): Designed for managing rack-mount servers, Intel® Server Platform Services provides a suite of tools to control and monitor power, thermal, and resource utilization. [1]Intel® Xeon® Processor E3-1200 v6 Product Family Product Brief [2]Intel Management Engine
Intel Innovation Engine (IE): Innovation Engine is an embedded core in the Peripheral Controller Hub (PCH). It is a dedicated subsystem that system vendors (OEMs) can use to customize their firmware (not Intel firmware).
In other words, Intel Innovation Engine eliminates some limits on how OEMs can customize and differentiate their offerings by letting them run firmware of their own creation or choosing. End-user developers can tailor data center infrastructure to more fully meet organizational needs. [3]What Is Intel Innovation Engine and Why Does It Matter? [4]Intel’s New Innovation Engine Enables Differentiated Firmware
Conclusion
So based on our findings and based on the VMware KB the problem we were facing was a timing issue triggered by 2 HPE firmware updates which were stuck in the firmware update process caused by a device error. Unfortunately until now I have no clue how these two components interact with the SD card reader to produce such error pattern. I have contacted a HPE support engineer to get some answers but until now I got no reply on it. As far as I got new information I will update this post.
Hi,
Just curious , will be lost any RAID config and /or BIOS config after resetting NVRAM?
We were having a strange Innovation Engine FW issue, and your post helped a lot 🙏.
Thank you for posting!