Concurrently, a longer-term plan was being developed, just as Tom Black and the data (Ethernet) team moved full steam ahead to ensure Cerent would deliver Ethernet switching capability in the product. The software engineering organization was under the gun and in a state of flux as Gary arrived to take the helm and realize the vision of the Cerent 454 as a multi-service provisioning platform (MSPP).
Spaghetti Code
Spaghetti code is a derogatory term for the source code of many products and it means that the code contains a significant amount of convoluted, and perhaps, incorrect logic. Software bugs are often difficult to fix on spaghetti code [2] because any fixes applied often have side effects that can worsen product performance.
This situation applied to the Cerent 454 in the early days. The software code the initial software engineers put in place for the first release of the Cerent 454 product was basically sound. The code was not exactly elegant, but it was workable. It formed a reasonable base upon which to build product functionality and do so quickly. However, Chip and his fellow software engineers never put in place the structure, or allocated time in the schedule, to deal with the development of so many plug-in cards concurrently – DS3, OC-3, OC-12 (marketing needed an added 1550 nm version too), OC-48 (multiple wavelength versions), Ethernet (three versions), DS1 (an added interface), DS3 Transmux, STS-1, OC-192 (ultimately delayed) and more.
Some engineers were reluctant to acknowledge the term "spaghetti code" as it is too pejorative, but it accurately described what the new software engineering hires dealt with. The difficulty with software engineering, I’m told, is that everyone comes to a problem with their own mental framework and there are so many decisions along the way that the number of possible solutions can be quite large. And there were so many new engineers brought on board during late 1998 and early 1999 that a myriad of perspectives were brought to the table.
This diversity of ideas led to a couple of issues.
First, some engineers looked at someone else's code and said “that's crap” and wanted to rewrite it. This happened, for example, with the coding of the DS3 option of the Cerent 454 [2]. Code may be poorly designed or it may be that the second programmer does not want to reshape his thinking to conform to the structure started by the first programmer.
Second, if a subsequent programmer is not mentally flexible enough to adapt to the framework put in place by the first programmer, then multiple styles or design philosophies start to be reflected in the software code as it expands.
The Cerent team was trying to build a system with multiple cards almost in parallel—the controller, the cross connect, the optical cards, the electrical cards and then soon after the data cards (Ethernet and the planned ATM modules).
Chip recalls, “If we had the time, we could have been more disciplined by having a common card-layer platform and team to write the code, which could have provided all card-layer services like initialization, fault detection, performance monitoring, and switchover. On top of that we could have built common optical and DSn layers and to some extent we did, but not to the level of abstraction we wanted for a system of this complexity.”
Cerent initially had multiple software engineers dealing with a common card platform but each engineer duplicated the code to deal with the peculiarities of each card. Sometimes changes were made to accommodate the hardware engineers making slightly divergent decisions on how a card would operate or trying to include cost-reductions.
“Even at the optical OC-n and electrical DSn layers,” Chip recalls, “sometimes we had two or more engineers working in parallel with very little time to cooperate on ensuring each one solved the same problem in the same way on their respective card. Similarly, if something got added to the software of one card, someone had to take the time later to add it to the other cards and make sure it worked with the other cards and address any code peculiarities.” Good in theory, but . . .
As it turns out, rather than having an inverted pyramid with a common card framework at bottom and card-specific behavior on top, the initial Cerent 454 architecture effectively ended up with silos of independent software code. When there was some semblance of common code, it tended to be rife with case statements and entrenched if-then-else statements to deal with the peculiarities of each card or each programmer's method of solving the same problem. This meant that a number of software engineers had to fix the same bugs multiple times and often deal with the challenge of applying a fix on one card that affected the behavior of another card.
Oh-oh . . .
A failed ASIC can prevent a company from reaching the marketplace. Failing software can remove a company from the marketplace.
Cerent avoided this first pitfall by having Hui Liu’s ASIC team design the first ASICs without any respins during 1998.
However, the company’s very existence was in peril owing to the quality of the software initially released to support the SONET multiplexer functionality of the Cerent 454. Most of 1999 would be needed to completely rewrite the software [3].
Gary Baldwin, Anointed Engineering Savior of Cerent
“Gary Baldwin was the person that made the Cerent 454 a success,” Ajaib Bhadare recalls.
“The early software architecture of the ‘454’ wouldn’t have survived in the marketplace. Gary re-architected the software for the future. This company would not be Cerent without Gary. It took a lot of convincing to pull someone out of the east coast and come to wine country. When I recruited Gary, I told him that he could have my job, ‘I want you here,’ and I offered him my position. That’s what it took to bring in these outstanding contributors.”
Gary defers a lot of the credit to his software team for making the ‘454’ such a success. As Release 2.0 of the product was being developed, Gary acknowledges his debt to software developers such as Mike Lilie and Mark Lambert. Like Ajaib before him, Gary recruited Mike and Mark with a plea, “I need you here.” After the initial firestorm of fixing an overwhelming number of bugs, Mike had design evolution for the Cerent 454 software and was charged with bringing in the CORBA layer [4] of the platform, while Mark had more of the backend software support to ensure product robustness.
Humphrey Chin observed, “Gary brought on Mike Lilie and Mark Lambert who had done this type of work at other companies. That was the major change, building that CORBA interface for Release 2.” He added, “Chris Eich was involved with that too because he was on the receiving end of the newly inserted CORBA layer. I was only minimally involved because it was my data structure that was feeding his. That was the extent of it. [The added CORBA layer] didn’t affect the embedded side at all.”
“We scrapped a lot of the software that was initially developed,” Gary recalls. When he joined Cerent in August 1998, Gary argued the first iteration of the Cerent 454 software was good for a small product, but it was not scalable, so in order for it to be scalable, “we’re going to have to redo it.”
Indeed, Release 1.0 was shipped to customers with a lot of the components that Chip and the early software designers developed. However, more than 100 customers took advantage of the product’s initial capabilities in just a few months time.
For Release 2.0 the Cerent 454 was going to become a brand new platform. This re-architected software platform would take this optical transport towards becoming “carrier class.”
Mike is viewed as the software developer who single-handedly rewrote the software code for the ‘454.’ He has been characterized as the software version of Martin Roberts, Cerent’s prolific ASIC designer, and Mike’s code has been described as digital poetry.
Mike helped make the ‘454’ the best platform it could be. He coded in the upper echelon of the software development firmament. Gary characterized Mike’s work: “It was like haiku, crisp and concise.”
Whenever Mike’s fellow engineers examined his code in answer to a problem to be solved, they would invariable ask themselves, “Why didn’t I think of that?” Mike was not a braggart with an ego; he simply went about the business of coding. What he produced was modular, robust, and reusable.
Mark Lambert, affectionately called Marko, followed Mike to Petaluma after they got their SOS calls from Gary in late 1998. The urgent appeal for help brought Mark on board to sort out the numerous problems with the Release 1.1 software.
Gary recognized that Mark was the most tenacious debugger on the planet. This focus, to find faults to fix, set Mark apart from most other software coders. He had to have software code that worked reliably over and over again. Time and time again he’d hunt down any glitch in the software, whether it took a day, three days, or even two weeks. Mark’s thoroughness would lead him to fix the core problem. The use of software Band-Aids was not an option.
These heisenbugs were the kind of elusive software coding bugs that Mark tracked down. The harder you looked for them, the tougher they were to find, but Mark had a natural gift for exposing them. Mark’s force of will allowed him to capture what appeared to be randomly appearing and disappearing bugs. He’d write test codes to force a particular bit of software code to fail and then bag his bug. It wasn’t sexy stuff to spend your time on, but he made the software code work better [6].
[1] The six months schedule turned into a nine-month interval, with the envisioned Release 2.0 broken up into two releases: 2.0 and a later 2.1.
[2] Humphrey Chin echoes the inherent spaghetti code nature of the initial software as he worked on coding support for the DS1 hardware, in late 1998, “The guy in charge of the DS3 hardware looked at the code being written for him by a software colleague of mine. The decision was made to port my 1:N protection DS1 code over to the DS3 card to support that protection algorithm. It was easier to do that then use the spaghetti code the other software guy had written.”
[3] Carl Russo insisted any software rewrite be completed before Cisco announced its acquisition of Cerent on August 25, 1999. Williams Communications was the customer driving this release since much of the product’s functionality had been promised well before any proposed August date. Indeed, on August 20, 1999, software Release 2.0 became generally available (a Friday), primarily due to this mandate from Carl. The CORBA-layered software therefore had to be split into Release 2.0 and a subsequent Release 2.1 to ensure specific customer service dates were met.
[4] Chris Eich on adding CORBA to the Cerent 454 after the product’s initial release: “The key [challenge for us in migrating to Release 2.0] was to develop a well-defined interface that you could evolve in a controlled manner. The first, lowest level interface we dealt with was to the network element itself. And that was done using CORBA – Common Object Request Broker Architecture – which was all the rage back in the late 1990s and early 2000s. You made object oriented calls within Java but they were actually going across the network and were being implemented in the network element. This allowed you to have a very clean interface to the network element. And each network element would have its own slightly different flavor because it had different capabilities and different line cards that you could plug into it. And that would be abstracted or wrapped up by the element JAR – Java Archive – that would then expose its own interface to the Cerent Transport Controller (CTC) network layer. All the element layer JAR files for whatever type of equipment they were supporting would then have a uniform interface to the network layer. The network layer didn’t have a whole lot of entanglement. It was a fairly thin interface compared to the element interface down at the node. It really had to do mostly with navigating up and down the user interface so if you were at the network level and you wanted to drill down into a given element it would be a way to bring up its user interface and conversely if you were at the element layer and wanted to go up to the network user interface you could do the actual provisioning or monitor network level alarms and so on.”
[5] You can read more about Werner Hesienberg, in his own words, in The Physicist’s Conception of Nature (Das Naturbild der heutigen Physik), originally published in 1955 (Rowohlt Taschenbuch Verlag, GmbH) and translated from the German by Arnold J. Pomerans in 1970.
[6] Gary Baldwin adds, “Pretty much any ‘sinister SW bug’ has to be able to be reproduced reliably to be fixed. The nature of a heisenbug is that ‘the more debug work you do (e.g. print statements, breakpoints, etc.) the less likely the bug is to occur!’ It's perverse really, but sometimes observation does impede finding the bug. Part of Marco's genius was finding a way to reproduce the bugs even in these heisenbug situations and, of course, to fix them!”
[7] A key component of the Cerent 454 software was the adjunct management software, critical to compel individual ‘454’ network elements to work as a uniquely configured system, in a customer’s existing network. This task fell to Gary’s “GUI” software team led by Dave Smith.