It’s Hard to Fire a CCIE

It’s actually hard to fire a CCIE. And no, this isn’t a veiled threat to my boss. Let’s put ourselves in the shoes of management for a moment. The network is a scary black box, you can’t just plug a VGA monitor into it and see what is on the screen.

There are a bunch of different tools, and monitoring systems, none of which work 100% of the time. When the network is really doing something unusual or misbehaving in some way, the answer is never clear. The problem can only be discerned by looking at some crazy looking log files, or worse. If you are asking yourself “what could be worse than a log file?”, then you have obviously never had to read a packet trace.

For those just joining us, this is the second part in this series. In this post I will refer to all network experts as CCIEs. If you are a CCIE, or a JNCIE, any other sort of IE, or just a rock star network architect/engineer, when I talk about CCIEs I’m TALKING ABOUT YOU! It’s a big tent, come in and join us…

So let’s go back to my statement last time. A CCIE does basically these 4 things:

Breaks the Network — My last post
Fixes the Network — This post
Moves/Adds/Changes
Project management

If we jump back into the shoes of our manager, we look at our CCIE employees as a very valuable asset. Lots of companies have one CCIE at most, a few have more than one, and the vast majority of businesses do not have any. Finding CCIEs, hiring CCIEs, employing CCIEs, and keeping CCIEs is difficult. They can be rather expensive and sometimes a bit odd to interact with. So for a manager that actually has one on their staff, what are the benefits?

CCIEs (in addition to doing actual, you know, work) are a bit like insurance. They are the top dogs in the IT organization, they are who you go to when things are REALLY broken. They can’t fix all the application layer issues, but they can at least point to what system is not doing what it should. So your CCIE, possibly by reading a packet capture, can tell your organization where the problem lies. Which server is not responding? And if things are broken and you don’t have a CCIE, what sort of options do you have?

Call the network vendor (ex. Cisco TAC, Juniper JTAC)
Call your managed service provider
Hire a consultant and hope they are not too busy to help you
Go to StackOverflow, ChatGPT, Discord
Contact your local deity

None of these options are particularly good. Each of them requires some time delay, possible the exchange of some money (or in the case of the deity, your first born), and they all require someone else to get up to speed on your environment. If your business is DOWN HARD and you have to wait for some consultant to understand your problem, get access to your systems, and actually troubleshoot, let’s just say that these minutes/hours/days are going to tick by like an eternity. Thus, if your business is that big and important where you cannot tolerate this delay, you hire a CCIE. And having that person down the hall from your office is your own form of insurance. They are the backstop. The buck stops there.

But what if the CCIE can’t fix it? First off, that would push the CCIE into the mental well of impostor syndrome. There is always a problem that can befuddle any expert, including CCIEs. Luckily, they can still contact any of the 3rd parties previously mentioned, and while they are on hold they continue to troubleshoot and capture information to help someone else understand the root of the problem. And if they engage a remote support entity like a Technical Assistance Center (TAC), they act as the expert remote hands, performing all of the necessary physical actions to get information to identify the problem.

All CCIEs suffer from impostor syndrome. There is always something that we can’t figure out or fix, and while that is sometimes unsettling, it is a reality when we are operating equipment built by someone else.

In my last post I had a few mea culpa (translation: I messed up) moments. But there were some times when I definitely did NOT have impostor syndrome. There were some times when I saved the day, or as it says on my resume, sometimes I do indeed kick ass.

A Brief History of Me Kicking Ass

I rushed into the global headquarters of a global media and broadcast company at 2AM when the entire network was down hard, and spent hours troubleshooting and sneaking around the campus by myself like a bandit until I found a dusty old switch in an HVAC office which was 4 stories below street level in the middle of Manhattan. The switch was in a room where a facilities manager monitored the air conditioning for an entire corporate campus on a small screen, and when I burst into his room unannounced at 4AM and ran over to the switch and pulled out one of the cables it was a scene out of Lord of the Rings. This guy (who clearly never had visitors and looked like he didn’t get very much sun) was certainly freaked out that I was entering his lair and messing with his Precious. Whatever. The network came back to life and I slowly navigated my way out of the steam tunnels. HUGE WIN. Lesson: As the expert, you should take time to visually visit and inspect every single piece of equipment, no matter how remote.

I once troubleshooted a hung Cisco 7206VXR while I was on vacation via a spotty phone connection, using a completely unskilled individual as my hands in the remote data center. Luckily I have something of a photographic memory, so I could describe the exact location of the device, as well as where to find an aux cable and dongle, and where the console port is located amidst a nest of old cables. My remote buddy got his first experience building a crash cart and connecting to the console (I probably scarred him for life) and once we knew the device was hung we rebooted it. Network restored. On the conference bridge were my direct manager, the department manager, and the head of a recently acquired company that owns several movie studios. They were effusive in their praise. I hung up the call and got myself (another) pina colada. Lesson: Leave extra console cables everywhere you can. Lesson 2: Enjoy your time off, you deserve it.

One time I ended a week-long troubleshooting that someone had been doing on a Nexus 7k that was indiscriminately deciding which packets to forward. I took a trip to the site to help a junior engineer, it took me 2 minutes to spot the problem. “Why is the chassis crooked?” Fast forward a few hours to him admitting that they dropped it “just once” while moving it into position. The crowd goes wild. Lesson: Dropped equipment is probably defective.

I single handedly solved a problem that caused sporadic outages between two large media companies who were merging. The super-senior network architect at the other company decided it would be cute to isolate administrative controls by putting two SUP4s in a single 6509, with each company controlling one. BAD IDEA. Lesson: Redundant Supervisors need to have identical configurations.

I saved a major ecommerce website from being taken down during the East Coast blackout incident in 2003. I was at my desk when the lights went off in the building. My first thought was “Oh sh-t, I hope I didn’t do that” and then I lept into action, running up three flights of stairs to a colocation center to check and see what happened to our data center. Luckily the colo was on battery backup so everything was running, but in a freak prescient moment I decided to check if our core switches were all connected to the backup power grid which would be the sole source of power in under 5 minutes. I noticed that someone had miscabled the redundant power supplies to one core switch. Looking something like the Road Runner, I found a replacement cable and inserted it about 3 seconds before the power switched from battery to backup generator. Had I not done that, the entire business would have gone down. Lesson: Be proactive in evaluating disaster recovery scenarios, and test everything frequently.

Don’t Build Snowflakes

But not everyone can afford to have a network expert on their staff 24/7. That’s why there are tons of network monitoring, analysis, and troubleshooting tools available. Also, the industry has come a long way in the past 10 years. We now build networks with a few fairly common designs, and many products implement design best practices by default. In fact, you can design a network online and have all of the techniques and optimizations used by the major cloud providers built into the design. Not only do these enhancements eliminate problems in advance, but well designed and deterministic network topologies are much easier to troubleshoot, should you ever inherit one.

Are CCIEs less valuable now? If your network is down, then the answer is clearly NO, you’re likely to pay any amount of money to get things back online. But the world has changed a lot in the past twenty years. We really don’t use a ton of different protocols (RIP: DLSW, Appletalk, IPX/SPX, DECnet, Vines, etc), and certain network topologies and designs have proven to be the best technical solution for certain use cases (ex. Leaf-Spine Clos Fabric for data center server connectivity). What that means is that architects are building repeatable (and reliable) network units, instead of something wonky that is hard to troubleshoot. And this means that you are less likely to need a CCIE. Let’s continue talking about this in Part 3.

Till next time!

A CCIE Gets Fired

It’s Hard to Fire a CCIE

A Brief History of Me Kicking Ass

Don’t Build Snowflakes

0 Comments

Contact