Three Controllers to Rule Them All (that just doesn’t have the same ring to it, does it?)

So we’ve got an NSX Manager now, and the uber-cool nsxcli, so what’s next? We need to build out the control plane, which means it’s NSX Controller time.

We can do this the easy way, or the hard way. We’ll talk about both. But first, let’s think about NSX and what the Controllers do.

The Control Plane is broken up into two main pieces: the Central Control Plane (CCP) and the Local Control Plane (LCP). The Control Plane programs the data plane and maintains the current, or realized state of the network.

That sounds rather similar to NSX-V controller nodes, doesn’t it? That’s because it maintains the same general purpose of the controllers in NSX-V. Except it does more. For example, when DFW rules are published, they are published from NSX Manager to the NSX CCP, and the CCP is then responsible sending the rules to the Local Control Plane for data plane programming.

Another difference is that the interaction between NSX Manager and the controller nodes is that they communicate via a RabbitMQ message bus, not private APIs.

So far, so good. Let’s think about another change: Controller sizing. In NSX-V, each controller had 4 vCPUs, 4 GB of RAM, and (according to my lab) a 28 GB virtual disk. NSX-T, again, provides some sizing flexibility:

  • Small: 2 vCPU, 8 GB RAM, 120 GB disk
  • Medium: 4 vCPU, 16 GB RAM, 120 GB disk
  • Large: 8 vCPU, 32 GB RAM, 120 GB disk.

Why would I use the different sizes, you might ask? Well, Small is generally for labs or proof of concept deployments. Generally, Medium is a good starting point. And Large, well, how big of a shop are you in? You might need it, though.

If you’re looking at your home lab or other small lab environment, you can get away with a single controller, just like NSX-V. Just don’t blame me when something doesn’t work.

For all the general similarities between the Controllers in NSX-V and NSX-T, T introduces some pretty significant under-the-hood changes that have enabled some pretty nice modularity in the Controllers themselves.

So how do we get these things up and running? I mentioned an easy way and a hard way earlier.

The easy way is to just have NSX Manager deploy the entire Controller cluster for you.

In the NSX Manager web UI, go to System > Components, and click on the “Add Controllers” link. NSX Manager will deploy the controller nodes and configure them for you based on your inputs. The downsides to this are that it only works with vCenter, and it only deploys Medium-sized Controllers. You’ve got to go to Fabric > Compute Managers and add your target vCenter server so that NSX Manager has someplace to deploy these virtual appliances. Great for quick deployment, terrible for a small lab environment.

So we have to look at things the hard way for a number of other use cases. Proof of concept, KVM-only shops, big shops that need Large controllers, etc.

To deploy your controller nodes manually, just grab the OVF (or .qcow2, if you’re in a KVM shop) and deploy it. If you’re deploying the OVF, you provide the same kinds of information you provided for the NSX Manager deployment – admin, auditor, and root passwors, network identity, etc. There are some options to add managment plane configuration, as well, should you choose to go that route.

Once they’re deployed, you need to join the controllers to the managment plane. Before we do that, we need the API thumbprint from NSX Manager (we collected that here).

Then it’s a simple matter of running the “join managment-plane” command and passing it the IP address of NSX Manager, the admin username and password, and the API thumbprint. On all of the controllers, we should set the controller cluster security model. Right now, the only option is a shared secret. We run the “set control-cluster security-model shared-secret secret” command to do that.

On the first controller, run the “initialize control-cluster” command to, well, initialize the controller cluster.

From the remaining two controllers, get the certificate thumbprint (“get control-cluster certificate thumbprint” on the controller node). With that information, we head back to the first controller (where we initialized the cluster), and join the other two nodes with the “join control-cluster” command, passing the target controller IP address and thumbprint.

We need to go back to the second and third controllers (one at a time) and run the “activate control-cluster” command. All this work, and we finally have our managment and control planes set up and ready to go. How about that?!

We’re almost ready to build logical network constructs. Almost.

 

~$ history
Introduction: From NSX-V to NSX-T. An Adventure
NSX-T: The Manager of All Things NSX
The Hall of the Mountain King. or “What Loot do We Find in nsxcli?”

NSX Controller Logs

Have you ever wondered what log files matter for day-to-day troubleshooting on the NSX Controller nodes?  There are certainly a plethora to choose if you just type show log and press ‘Enter’.

If you haven’t looked at the new VMware Documentation site yes, I encourage you to check it out.  There’s a whole new layout.  Once you get accustomed to it, I think it’s actually easier than the old web-based documentation.

Anyway, I specifically wanted to call out the NSX CLI Cheat Sheet 1 that’s in the documentation, which walks through common things an NSX Administrator may need to know.

In the Troubleshooting and Operations course, we mention NSX Controller logs a couple of times, and I’d like to expand on that content just a bit.

syslog is, well, the core OS system log.  Not entirely unlike any other Linux system. In addition to the standard logging content, however, some HTTP access logs are also included.

Then, there’s the Zookeeper log (cloudnet/cloudnet\_java-zookeeper<timestamp>.log). This log contains the logged data related to the Zookeeper process that enables NSX Controller Clustering. Some things you may see in this log are disk latency warnings, that could indicate issues with Controller syncing:

Finally, we have the core NSX Controller log file, cloudnet/cloudnet.nsx-controller.root.log.INFO.<timestamp>. This file contains a wealth of information about the operation of the NSX Controller. Let’s look at some of these messages individually.

What we’re seeing in the above screenshot is an issue with the Controller cluster. Fortunately, it’s very short-lived and does not trigger a control plane issue. The Controller Cluster can’t find any functional nodes, so it announces that the cluster will shut down in 30 seconds. This will trigger all connections to this surviving node to drop, causing a control plane outage. A cluster member, however, joined before the 30 second timer completed. The cluster shutdown is aborted, and the Sharding Manager is invoked to distributed slices to the new cluster member.

The next image simply shows us a VTEP Leave Report being acted upon by the Controller:

Here’s an interesting one:

What we see here is that a host sent a VTEP Join Report to the Controller, but the VTEP was already joined to the VNI. If we look carefully, we see that the existing VTEP Join Report came across Connection ID 7 (connId=7), while the new, conflicting report came across Connection ID 8. Also worth noting here is that the control plane sync state for the original VTEP Report was good (isOutOfSync=False), where the new connection has not yet resynchronized its control plane (isOutOfSync=True).

And have you ever wondered about hosts sending ARP information for VMs after the VM has been identified? Take a look at this:

There’s a lot to look at when you get into log analysis, but once you can narrow down the important files, interpreting them is actually pretty straightforward.

  1. https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.troubleshooting.doc/GUID-18EDB577-1903-4110-8A0B-FE9647ED82B6.html