VCF 9 Fleet Deployment Task Fails

The Issue?

Note: I have only had this occur once and it was in my lab, thus this is likely be lab related and IS NOT a widespread issue. Consider it an FYI in case you run into the issue.

TLDR: I always deploy my fleet appliances on VCF Networking (VMware NSX) Overlay networks. During the process of building a management domain in VCF 9, there is an option to skip deploying the fleet appliances, which gives you flexibility in their deployment, using SDDC Manager API and NSX networks. When doing this, the deployment in SDDC Manager failed after deploying VCF Operations and Fleet manager, while deploying Ops proxy.

Unfortunately I didn’t capture the error message in SDDC Manager to paste here, but the message was quite generic.

Working The Issue

The code block below is my payload to deploy VCF Operations, VCF Operations Proxy, and VCF Automation, as you can see two networks are specified down the bottom, local-region and xRegion. Both are NSX overlay networks in my environment.

{
    "vcfOperationsFleetManagementSpec": {
        "hostname": "fleet.shank.com",
        "rootUserPassword": "VMware123!VMware123!",
        "adminUserPassword": "VMware123!VMware123!",
        "useExistingDeployment": false,
        "version": "9.0.0.0.24695816"
    },
    "vcfOperationsSpec": {
        "nodes": [
            {
                "hostname": "ops.shank.com",
                "rootUserPassword": "VMware123!VMware123!",
                "type": "master"
            }
        ],
        "useExistingDeployment": false,
        "applianceSize": "medium",
        "adminUserPassword": "VMware123!VMware123!",
        "version": "9.0.0.0.24695812"
    },
    "vcfOperationsCollectorSpec": {
        "hostname": "opsproxy.shank.com",
        "rootUserPassword": "VMware123!VMware123!",
        "version": "9.0.0.0.24695833",
        "applianceSize": "small"
    },
    "vcfAutomationSpec": {
        "ipPool": [
            "10.100.0.200",
            "10.100.0.201"
        ],
        "nodePrefix": "auto",
        "hostname": "automation.shank.com",
        "internalClusterCidr": "198.18.0.0/15",
        "adminUserPassword": "VMware123!VMware123!",
        "version": "9.0.0.0.24701403"
    },
    "vcfMangementComponentsInfrastructureSpec": {
        "localRegionNetwork": {
            "networkName": "local-region",
            "subnetMask": "255.255.255.0",
            "gateway": "10.101.0.1"
        },
        "xRegionNetwork": {
            "networkName": "mgt-overlay",
            "subnetMask": "255.255.255.0",
            "gateway": "10.100.0.1"
        }
    }
}

I am skipping the step of code validation in this article, but you should always validate your payload before POSTing to ensure it is going to work.

As with any failed deployment, SDDC manager had a failed task that you could retry, which continuously failed after each retry. The domain-manager.log and vmware_vrlcm.log file weren’t highlighting any specific issues, the network was reachable, DNS records existed, no firewall ports between fleet / ops / NSX. Curl commands between all appliances, as well as telnet were working perfectly.

As it is a greenfield deployment, there wasn’t too much else to check. Being a lab, I decided I would just redeploy the fleet appliances, I deleted the VMs from vCenter, and when I attempted to deploy them using the exact same API and payload, I hit the below error.

VCF 9 inventory all fleet appliances deleted
 "Deployment of VCF Management Components is not allowed as an existing deployment is already present".

Error: “Deployment of VCF Management Components is not allowed as an existing deployment is already present”.

Investigating and Fixing

From what I could tell, the only place there was any mention of the deployment status was in the vcf_management_component postgres database on SDDC Manager. The image below is what a healthy deployment looks like.

SDDC Manager postgres database for vcf_management_component

In my failed deployment, VCF Operations and fleet management deployed, and then errored on deploying the proxy. Rather than seeing “SUCCEEDED“, I had “FAILED” for the cloud proxy and “NOT STARTED” for the rest.

Note my deployment failed early, after ops was deployed and registering the proxy. If you use this at a later stage ensure you have taken a snapshot / backup first!

The Fix

ALWAYS back up and/or snapshot SDDC Manager before any database change. Realistically you should avoid modifying the database altogether, especially without support. This process may leave you unsupported, ensure you are working with support if you do encounter this issue.

Putty / SSH onto SDDC Manager, then run;

psql -h localhost -U postgres
c platform
x

TRUNCATE TABLE vcf_management_component;

You should get some updated rows output and see the below when verifying.

verify vcf_management_component database after updating postgres

After this you should be able to revalidate and redeploy the appliances using API. Ignore the hostname, this was a fresh SDDC Manager deploy and how it looks before anything is deployed.

Summary

The deployment of VCF 9 fleet components fails part way through, unable to restart the task successfully. You may need to forcefully delete VMs and update the database, which will allow you to redeploy successfully.

Similar Posts