NSX Controller Logs

Have you ever wondered what log files matter for day-to-day troubleshooting on the NSX Controller nodes?  There are certainly a plethora to choose if you just type show log and press ‘Enter’.

If you haven’t looked at the new VMware Documentation site yes, I encourage you to check it out.  There’s a whole new layout.  Once you get accustomed to it, I think it’s actually easier than the old web-based documentation.

Anyway, I specifically wanted to call out the NSX CLI Cheat Sheet 1 that’s in the documentation, which walks through common things an NSX Administrator may need to know.

In the Troubleshooting and Operations course, we mention NSX Controller logs a couple of times, and I’d like to expand on that content just a bit.

syslog is, well, the core OS system log.  Not entirely unlike any other Linux system. In addition to the standard logging content, however, some HTTP access logs are also included.

Then, there’s the Zookeeper log (cloudnet/cloudnet\_java-zookeeper<timestamp>.log). This log contains the logged data related to the Zookeeper process that enables NSX Controller Clustering. Some things you may see in this log are disk latency warnings, that could indicate issues with Controller syncing:

Finally, we have the core NSX Controller log file, cloudnet/cloudnet.nsx-controller.root.log.INFO.<timestamp>. This file contains a wealth of information about the operation of the NSX Controller. Let’s look at some of these messages individually.

What we’re seeing in the above screenshot is an issue with the Controller cluster. Fortunately, it’s very short-lived and does not trigger a control plane issue. The Controller Cluster can’t find any functional nodes, so it announces that the cluster will shut down in 30 seconds. This will trigger all connections to this surviving node to drop, causing a control plane outage. A cluster member, however, joined before the 30 second timer completed. The cluster shutdown is aborted, and the Sharding Manager is invoked to distributed slices to the new cluster member.

The next image simply shows us a VTEP Leave Report being acted upon by the Controller:

Here’s an interesting one:

What we see here is that a host sent a VTEP Join Report to the Controller, but the VTEP was already joined to the VNI. If we look carefully, we see that the existing VTEP Join Report came across Connection ID 7 (connId=7), while the new, conflicting report came across Connection ID 8. Also worth noting here is that the control plane sync state for the original VTEP Report was good (isOutOfSync=False), where the new connection has not yet resynchronized its control plane (isOutOfSync=True).

And have you ever wondered about hosts sending ARP information for VMs after the VM has been identified? Take a look at this:

There’s a lot to look at when you get into log analysis, but once you can narrow down the important files, interpreting them is actually pretty straightforward.

  1. https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.troubleshooting.doc/GUID-18EDB577-1903-4110-8A0B-FE9647ED82B6.html

Lab Network

So I’ve had a few questions about the network in my lab, since I’m teaching almost nothing but NSX these days.  So let’s talk about it for a bit.

My network is purposefully simple.  And I’ve just rebuilt pretty much everything, so it seems like a good time to document it.

At the edge of my network is a Ubiquiti Networks EdgeRouter Lite (ERL).  It deals with all of my routing inside the network, as well as routing to the outside world.  It’s a 3 interface device – one to the outside world (cable modem), one to my default VLAN and home network, and the third interface is carved into a bunch of sub-interfaces for my VLANs in the lab.

The two internal-facing interfaces are attached to a Cisco SG300-20 that I could also use for routing, but I chose to let the router deal with that.  This is where I have several VLANs set up for my different environments, and that’s all I’ve done with the Cisco switch – no IGMP Snooping, no routing, just VLANs:

  • Local Management – this is where all of my common stuff lives – the vCenter for my physical hosts, vROps, Log Insight, etc
  • Production Management – this is where my GA-versioned vESXi hosts live, along with their relevant supporting pieces – vCenter, NSX Manager, etc
  • Production NSX Control – I set this up simply to have a dedicated network for my NSX 6.1 Controllers.  These could just as easily gone into my Production Management VLAN
  • Production NSX Transport – this is here to simulate a dedicated VXLAN transport network.  Currently, this is superfluous, as NSX 6.0/6.1 VTEPs don’t deal well with VLAN tagging in a nested environment.  Not sure what that’s all about, <sarcasm>I must be running in an unsupported config </sarcasm>
  • Production Management Branch – this network provides a simulation of a remote site
  • Production NSX Transport Branch – again, simulation of a remote site, but much like the Production NSX Transport, this one’s completely superfluous at the moment.
  • I’ve got a matching set of VLANs for my non-GA environment, so that I can have a stable and unstable environments and maintain some level of isolation.  

Since my lab is completely nested, I also have VSAN and vMotion VLANs configured on my distributed switches, but they don’t map to anything in the physical network.

On the NSX side of things, well, I’m rebuilding that right now.  My thought process, since this is a lab, is to attach my outside-facing Edge VMs to the relevant Management network, depending on where I need the Edge.  This sort of flies in the face of having a dedicated Edge cluster, but hey, this is a lab 🙂  

Inside the Edge, my DLR(s?) will attach to a common Transit network, as will the inside interfaces of the Edges.  I’ll set up some OSPF areas so that the EdgeRouter Lite can advertise some networks into the Edge.  The DLR will also advertise its routes up to the Edge, which will in turn advertise back to the ERL.  This should be a pretty simple OSPF config.  I could eliminate the need for OSPF between the Edge and the ERL simply by configuring a default route, but what fun is that?

Then my workloads will attach to whatever Logical Switches I want them attached.  The sky’s the limit inside the SDN.  

For simplicity’s sake at this point, each network segment (VLAN or VXLAN) will have its own /24, though many of them could make due with a /28 or /29 pretty easily.  But I’m not strapped for IP addresses, thanks to our friend RFC 1918, so I’m not going to make things any more complicated than I need to.  

Everything works pretty well.  Sure, I run into some goofy behavior once in a while (see the VTEP VLAN tagging thing above), but this environment is entirely unsupported.  Honestly, it’s a miracle that any of this works at all, and is a galvanizing testament to what VMware software is actually capable of doing.  

Someday, maybe I’ll draw this all up.  But today is not that day.  

New Lab Server

Here I am, procrastinating on other stuff, to talk about the new lab setup.  I promised in the last video, I’d do a writeup, and I figured “no time like the present”!

So what do I have going on?  I’ve ripped out the ML370 and ASUS RS500A boxes, and replaced them with a veritable steal from the Dell Outlet.  I decided to go all-in for a nested lab, since I can’t come up with a good reason to put together a physical lab.

So I found a Precision T5610 with a pair of Xeon E5-2620v2s.  Twelve hyper threaded cores of processor get up and go.  The Scratch and Dent unit I bought had 32 GB of RAM installed, along with a 1 TB spindle.  Not a bad start.  But I had gear to work with, and needed more RAM.  

So I ordered a nice 64 GB upgrade kit from Crucial (well, 4 16 GB kits, technically, hoping it’d give me a 96 GB box to work with.  The pre-installed RAM and the Crucial RAM didn’t play too nicely together (Windows wouldn’t load from the spindle, nor would the ESXi installer start).  Boo.  So I pulled out the factory RAM, ran memtest86 for about a day, just to be safe, and am proceeding, for the moment, with 64 GB.

Still just a start, though, as I needed storage.  I had an Icy Dock 4-bay SATA chassis and an IBM M1015 SAS RAID controller in my little HP Microserver.  That box didn’t need those things anymore, so a transplant was necessary.  4 screws later, I had lots of room for 2.5” SATA drives.  I already had a pair of 120 GB Intel 520 SSDs in Icy Dock trays, and I pulled the 2 1 TB Crucial M550 SSDs from the ASUS box.  Now I have plenty of lab storage.  If I need more, I can always take the performance hit and attach something from the DS412+ (like I’ve already done with my “ISO_images” NFS share.

**EDIT** I ran into another snag.  The LSI controller and Icy Dock combination seem to not be playing nicely with the host.  The SSDs seem to randomly just drop offline periodically, which makes me sad.  It could be a power thing (these big SSDs are kind of notorious for power problems, especially in small drive chassis or NAS units), but I’m not going to push it any further.  I pulled the controller and drive cage out of the system, and I should have a SATA power splitter after UPS shows up today (I was too lazy to leave the house LOL), so I’m just going to run the SSDs off the on-boad SATA controller channels, and be happy about it.  

At this point, I thought I was ready for ESXi, but upon installation, I hit a snag.  The T5610 has an Intel Gigabit NIC.  As such, I expected no issues, but the 82579 isn’t noticed by ESXi 5.5u2 for whatever reason.  No biggie – this thing has a nice tool-less case, and less than a minute later, I had a dual-port Intel NIC installed and ready to go (82571EB if you’re curious).  One more reboot and ESXi was installing.

Everything’s pretty happy at the moment.  Here’s what it looks like now that I’ve built up all my virtual ESXi hosts:

Screenshot 2014 09 24 08 47 22


Screenshot 2014 09 24 08 48 53


Oh, and the “Remote_External” cluster is a set of ESXi virtual machines I have running in VMware Workstation on a Precision M4800 laptop.

I’ll follow up with some network details shortly, since that’s also gotten _way_ complex recently.  All in the name of scenario-based play with NSX. More fun later!


Deploying NSX Manager

Well, another day, another post.  

Ok, that may be exaggerating a bit, but I’m trying.  Again.  Still.

I’m rebuilding my lab (more to come on that later), and with the big push toward VMware NSX in my life right now, I thought I’d capture my fun and excitement in deploying everything.  Here’s Part 1, where I deploy the NSX Manager and register it with vCenter.  Nothing earth-shattering here, but it might help someone.

Still here. Or “Meandering thoughts about the lab that time forgot”

So, I do still exist.  I am still here, and look at this!  I even update once in a while.

My focus has shifted a little bit over the past few months.  I’m working a lot with NSX, and have been teaching the NSX: Install, Configure, Manage for a few months now.  Like, a lot.  Q3 has been absolutely nuts for me, especially with the announcement of NSX 6.1 at VMworld.

So, what does all that mean for me, and this blog, exactly?

Well, for the blog, it means a little bit less vSphere stuff, and a little bit more networking.  

Which leads us to “What does this mean for me?”.  Well, I’ve got a stack of Cisco gear here in the office now.  That must mean Cisco certification prep.  Someday, I’ll have a break and be able to actually do that 🙂

So, what Cisco stuff is in the pipeline?  CCNA – Routing and Switching is first.  that’s the basic stuff.  CCNA Data Center is soon to follow.  I’ve been tossing around CCDA because design is just a thing I do.  Do I pursue those to the NP/ND level?  I don’t know.  My guess is no, but I never leave anything off the table.

I’ve also got the VCP-NV on my list, and the forthcoming VCIX-NV when it launches.  I’ve taken a swing at the VCP-NV – I did that while I was out at VMworld, but I just missed the passing mark.  I was caught completely off-guard by some of the stuff in the exam.  It was my own fault, however, as I spent a grand total of zero time preparing and studying for the exam.  I even talked to the cert devs about the Exam Blueprint, that I didn’t actually read until the evening _after_ I wrote the exam.

That’ll be easy to rectify, however.  I know what I need to study to bump my score up over the pass mark.  Now I just need what everyone needs – time.

I also have some new lab gear incoming.  I caught a steal from the Dell Outlet the other day.  I have a scratch & dent T5610 with dual 6-core Xeon E5s on its way, and an additional 64 GB of RAM just showed up on my front porch today.  I’ll talk through the build I’m planning as I’m working through it.  I figure with 24 threads and 96 total GB of RAM, I should be able to run many, many nested ESXi hosts, especially with the VMware Tools fling and the MacLearn dvfilter fling for nested ESXi.  My old ASUS RS500A ran pretty well with 2 6-node clusters and a 4-node management cluster, each nested ESXi host with 8 GB of RAM each.  And the RS500A only had 64 GB of RAM.  

So I’m going to consolidate the ASUS and the old ML370 into a single, more modern (and likely more power efficient) box.  The tradeoff is that I’m giving up the iKVM and iLO capabilities of my existing hosts.  It’ll be ok.  I guess I’ll just have to look at IP KVMs now.  

In other news, I’ve moved my edge to higher-end gear.  Nope, nothing so fancy or cost-prohibitive as Cisco, but definitely powerful.  I picked up a Ubiquiti Networks EdgeRouter Lite the other day, and it’s pretty slick.  I’ve got a complementary UniFi AC on the way, and that should be waiting for me when I get home.  The EdgeRouter is pretty slick.  EdgeOS is based on Vyatta Core 6.3 (just before the Brocade acquisition), so I’m familiar, in passing, with the OS.  This thing is fast!

So this will all end up in a home network / lab series as I do more cool stuff with it.

As it stands, I gotta get back to class – we’re just about done deploying Controllers…