vSphere 7.0 U3c update – VCDB size issue
Introduction
It’s been a while since we released vSphere 7.0 U3 and also pulled it again. You can read more about it here. When we released vSphere 7.0 U3c in January most of our customers were really cautious about if they should go for it or not. Some weeks after the release most of my customers were moving to vSphere 7.0 U3c and so far haven’t encountered severe issues. You can encounter some small hick-ups here and there. But there is nothing that cannot be fixed in minutes.
The Problem
This week another customer moved to vSphere 7.0 U3c but encountered some issues I have never seen before with other customers. This customer is running 2 vCenters in Enhanced Linked Mode. They have also connected NSX for vSphere (NSX-v), vRealize Operation, vRealize Log Insight and NSX Advanced Load Balancer to one of the vCenters. The problems started after the update to 7.0 U3c but only in one of the vCenters. Unfortunately, it was the vCenter to which most of the solutions were connected to.
The issues manifested in the following behaviors:
- vROPs was not able to collect metrics through the vCenter Adapter
- Collection State was Collecting
- Collection Status was NONE
- NSX-v Manager failed with a white website after vCenter was rebooted and was unmanageable
But besides these issues, everything was working normally. The NSX-v Manager was running again after a reboot. vROPs is used for charging VMs. Therefore we had some pressure to get vROPs working again. Therefore we opened a case with GS and received the answer that the connection to vCenter was not working correctly. This was already our assumption because other solutions were impacted as well and vCenter was the only component that was updated recently.
Another case was opened but this time with the vCenter team. We told them that we think it’s related to the API. In parallel, I had also a look at the vCenter logs just to be sure. After some analysis and a remote session with a TSE, we haven’t found anything. Logs were clean, lsdoctor showed no issues and VDT (vSphere Diagnostic Tool) was also clean. Not a single hint why it’s not working. We decided to move back to the vROPS theory although we definitely knew that it must have something to do with vCenter.
An hour later my customer called me again and told me that their VM deployment based on PowerCLI is now also affected and not working. We check the scripts and found out that Get-VIEvent was really slow or got stuck. We also checked the events for a VM and saw that the progress circle was spinning but took really long and ended up in “No events available” (or similar wording).
My assumption after knowing that events are not working correctly was now that there is a problem with the backend vPostgres database or the communication between the vCenter service and the DB. We connected via SSH to the affected vCenter and checked some items of the database.
//Connect to the vPostgres database
/opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB
//Get the biggest TOP20 tables of the VCDB (KB2147285)
SELECT nspname || '.' || relname AS "relation",
pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
AND C.relkind <> 'i'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) DESC
LIMIT 20;
As you can see the VPX_TASK is quite big as well as some VPX_EVENT_ARG and VPX_HIST_STAT tables. Also, the VCDB itself had a size of 181GB. From my perspective, this was way to big for this environment. We also checked the vCenter configuration (Select vCenter – Configure – General). The Task cleanup job was configured to 180 days and the Event cleanup job to approximately 50 days. W hich is quite high. The reason for it was in the past that there was a tool that has the requirement for this.
The Solution
Based on our findings we decided to truncate the SEAT (Stats, Events, Alarms & Tasks) tables and set the maximum time for cleanups to 30 days. For this, there is KB2110031. The confirmation of my assumption that this would fix the problem came after seeing the following loglines from the vROPs vCenter Adaptor and the explanation of a vROPs TSE.
2022-03-22T23:29:17,281+0000 WARN [Collector worker thread 5] (83) com.integrien.adapter.vmware.EventProcessing.getEvents - Due to duplicate events we might miss some events.
2022-03-23T00:24:19,880+0000 WARN [Collector worker thread 11] (83) com.integrien.adapter.vmware.EventProcessing.getEvents - Due to duplicate events we might miss some events.
2022-03-23T04:59:39,039+0000 WARN [Collector worker thread 15] (83) com.integrien.adapter.vmware.EventProcessing.getEvents - Due to duplicate events we might miss some events.
2022-03-23T06:07:49,911+0000 WARN [Collector worker thread 4] (83) com.integrien.adapter.vmware.EventProcessing.queryEventsPageByPage - Query events exited with timeout of 120000 ms.
2022-03-24T13:29:39,146+0000 ERROR [Collector worker thread 9] (83) com.integrien.adapter.vmware.EventProcessing.sendAllChangeEvents - Error in sendAllChangeEvents
com.vmware.vim.vmomi.client.exception.ConnectionException: https://VCENTER.FQDN/sdk invocation failed with "java.net.SocketTimeoutException: Read timed out"
I came across the last line in the logs but I had no clue that it references gathering Events from vCenter which was at that point in time not possible as there were too many Events available and the worker process run into a time out.
The whole truncate of the DB took about 2,5 to 3 hours and ended up in the following final size.
Truncating was finished and we saw that the VPX_TASK was now gone completely (not under the TOP 20). The other tables are now also a little bit less in size. The biggest change was the VCDB size. It went down from 181GB to 85GB.
After this, we checked all the other solutions and lo and behold everything was working as expected again.
Conclusion
To be honest this was one of the hardest issues I have ever troubleshot. The challenge was that we knew that it must have to do with vCenter itself. But we didn’t have any lead based on the vCenter logs and the other tools. This made it nearly impossible to get to a solution. We were really lucky that some PowerCLI scripts were failing. This led us to the “event view issue” hint and furthermore to the VCDB problem.
Personally, I don’t think that the update to U3c itself was the issue here. I suspect that the post process of some of the database update and migrate processes marked events in a way that the vROPs vCenter adapter tried to gather it again. But based on the sheer amount of the events it ran into a timeout.
In the beginning, it was not clear why the vROPs vCenter adapter was not working but the vSAN adapter worked. The reason for this is that the vSAN adapter is gathering information through a different API (vsan-health service). It is also not gathering events.
Based on the lesson learned here I would recommend checking the VCDB through the code mentioned in this blog post before updating vCenter to 7.0 U3c. The default settings for the cleanup jobs are set to enabled and 30 days. Therefore everything should work in most cases.
If you have any further questions on this topic, please drop me a comment or drop me an e-mail.
Thanks for this important info. I have quite similar environment and I’m creating a plan to upgrade to 7.x.
You’re welcome. Glad it helped you. Cheers, Fred