From Punched Cards to SmartNICs: A Personal Journey

pugs78

3 years ago

Preface

This series was originally a DriveScale blog series, and is recovered here thanks to the Internet Archive’s Wayback machine.

Intro

One of the DriveScale predictions for 2020 is that 2020 is the year of the SmartNIC! But, what is a SmartNIC? In this blog series – covering 7 decades – you’ll learn more than you ever wanted to know about Front-End Processors, er, I mean SmartNICs. And you’ll see how my personal history with computers is deeply intertwined with the SmartNIC concept. I’m a huge fan of computing history, a collector, and a pack rat, so there’ll be some interesting illustrations…

Part 1: The 1950s

Computers are older than I am, but only by about 5 years. Since the dawn of computing, computers have been helped out – offloaded – by less expensive devices. The first computers in the early 50s, like the IBM 701, had so little memory that they were actually not very good at “data processing”, i.e., I/O intensive but compute-light operations.

But for decades before computers, there was sophisticated data processing happening with punched card machines – tabulating, sorting, searching, even some simple arithmetic. So many of the earliest “data centers”

” would use punched card equipment to do the data-intensive things, and reserve the computer for truly CPU intensive things. It wasn’t particularly easy to hook up computers to all these punched card card machines, and for a long time,

a hand-held box of cards held more memory (160K!) than most computers.

My high school district (in the ’70s) had a “data processing” training room fill with punched card equipment, including one actual computer, an IBM 1130. But most of the rest of the equipment was from the 1940s or 50s and still mostly mechanical.

So computers were really fast at computing, but couldn’t handle much data. Sound familiar? Today GPUs are way faster than CPUs, but have far less memory. Is the GPU an accelerator, or is the CPU a front-end processor? All depends on your point of view.

By the late ‘50s, it was common to have two or three computers together – one fast one for computing, and the other(s) for light jobs and I/O versatility. How did they communicate with each other? Reel-to-reel tape and Sneakernet! (They did have sneakers in the ‘50s, but probably not at work) Each tape was a “batch” of “jobs” for the computer, incoming or outgoing.

Part 2: The 1960s

By the early ‘60s, sane I/O architecture was emerging and the first real coupled products, like the IBM 7090/7040 “Direct Coupled System” came out. And then, in 1964, IBM blew the world away with the 360 series.

The IBM 360 defined a “channel” based I/O architecture, which mostly still survives on today’s mainframes, and influenced a huge number of future technologies like SCSI and Fibre Channel. And to connect a couple of 360s, they provided the “channel-to-channel” adapter (CTCA), whose hardware was remarkably easy to program.

I should clarify – writing a driver for the CTCA was easy; writing drivers for channel “programs” – not so much. Many people speak of “channel processors”, but they were not really fully programmable. More like glorified DMA engines. Compared to the CDC 6600, with its integrated I/O processors, it was quite primitive. So the 360 had DMA controllers while the CDC had Smart “NICs”.

Anyhoo, so once the computers could talk, what would they say? What was the protocol stack? Well, the protocols – data communication – were much more heavily influenced by the need to communicate remotely, e.g., cross-country, then by the lucky few who had more than one computer.

But remember all that punched card equipment? Data communications and protocols existed for them before computers! And it was bug ugly under the covers. But the basic idea was to get a “deck” of punched cards from point A to point B.

When the punched card equipment was hooked to computers, it was still about getting decks of cards from point A to point B. This was Job Entry – controlled by special cards using Job Control Language (JCL). When distant, it was Remote Job Entry – RJE.

So once computers could talk directly and quickly to each other on a CTCA, what did they say? – “Have I got a job for you!”, just like remote equipment, and just like Sneakernet. Because protocols are forever.

IBM had 2 competing software products for managing RJE to mainframes – HASP and ASP. HASP was for a single mainframe talking to lots of remotes; ASP worked better for small “clusters” of mainframes sharing lots of remotes.

HASP was developed for one customer – NASA – and stood for “Houston Automatic Spooling Priority”. BTW, “spooling” was the activity of moving tapes between systems. SPOOL was an early retronym – “Simultaneous Peripheral Operations OnLine”. HASP evolved to JES2 which is still commonly used in mainframes.

ASP was developed as a full-on ambitious IBM software product, and therefore did not work as well as HASP (sigh). ASP stood for “Attached Support Processor” but was later changed to mean “Asymmetric Multi-Processing”. ASP evolved to JES3, which after about 40 years is finally being deprecated in favor of JES2.

Part 3: The 1970s

My Princeton days centered around an IBM 360/91 running ASP/JES3 with its little brother the 370/158, while bunches of 360/20 systems were remotely feeding in jobs from around the campuses and even as far as Colorado. (360 Model 20s were not “real” 360s – not good for much except handling card readers and line printers)

The way all the remote devices, whether RJE or interactive terminals, plugged into the mainframes was via a 2703 terminal multiplexer. Think of it as a bunch of channel-attached serial ports implemented in stone age technology. More later.

When Princeton got the VM/370 operating system on the 370/158, there were then lots of virtual machines which had to communicate with each other and the outside world with – what else – virtual decks of cards. VM came with RSCS, the Remote Spooling Communications Subsystem, which was a lot more pleasant to use than ASP and JCL, but still had to interoperate.

So that was the mid ‘70s. Starting in 1969, a little thing called the Arpanet started to happen. The Arpanet was used to connect the big computers among researchers into one network, but the way it was done was by putting all of the new networking stuff onto IMPs – Interface Message Processors. These were programmed by BBN, but the hardware was the DDP-516 minicomputer from Honeywell. So the IMP was another kind of SmartNIC or front-end processor.

Lots of history out there about the Arpanet and IMPs. But another early predecessor of the Internet was BITNET – and it all started on IBM machines running VM and RSCS. So all that punched card communication did eventually evolve into something more familiar.

By the late ‘70s I was working on UNIX under VM/370 at Amdahl (IBM 370 clone company) and writing drivers and protocols to make UNIX fit in to the IBM world. Yes, there was a virtual card reader driver. Virtual CTCAs, check. RSCS job submission, etc. But where it got really weird was with the 3705 – a “Front End Processor”.

The IBM 3705 was a programmable box – truly a Smart NIC – which could emulate and replace the old 2703 terminal mux, but was also the centerpiece of SNA – IBM’s Systems Network Architecture. They were everywhere in the IBM world by 1980 or so, and SNA was creeping in. So I figured I’d look into implementing SNA on UNIX. But the complexity of SNA utterly broke me, and ended my belief in the future of IBM mainframes!

Part 4: The 1980s

Luckily, I escaped to Sun Microsystems in 1982, where my life was all about device drivers…

And networking. And drivers. And more drivers. And protocols. Like NFS.

Early Sun systems were based on Intel’s Multibus (I) for I/O. There were two approaches to Multibus cards: all-in-hardware, which could be complex, or “intelligent” using some 8 bit processor to do a lot of work. What I learned is that it is hard for an all-hardware device to be slow, whereas iit’s really easy for an intelligent device to be slow. But the market for intelligent boards tended to be larger, because an intelligent host operating system was rare!

We had to look at various intelligent networking cards which were all either too slow or too complicated to integrate with the UNIX OS. More subtly, the cards which weren’t too slow today would be too slow when the next processor generation came out. So I developed great dislike for SmartNICs, or FEPs, because they were always too slow.

I have a paper describing some of this from the USENIX Summer 1985 conference: “All the Chips That Fit”, co-authored with my dear friend Joseph Skudlarek.

Intel came along with Multibus II – the next generation – which changed everything but pretty much required all cards to be intelligent! Fortunately, Sun had escaped to the VME bus by then. Later, Sun’s SBus cards were also mostly non-intelligent, and were deliberately tiny to avoid creeping complexity!

Did I mention protocols? I was the lead engineer for the SunLink family of products, something like 17 different products, that implemented all of the important non-TCP/IP/Ethernet protocols. IBM 3270 terminal emulation, Bi-Sync RJE, X.25, Token Ring, DECnet, SNA, OSI, … Keep in mind that the dominance of TCP/IP didn’t really happen until 1995 – with the release of Windows 95. The project I had the most fun with the SunLink Channel Adapter – connecting to an IBM channel!

Part 5: The 1990s

In 1994, I founded Ipsilon Networks, where we invented IP Switching. At that time, all IP routing still relied on software based forwarding. But the Telco world had invented Asynchronous Transfer Mode (ATM) – and was building incredibly fast all-hardware switching systems that implemented it. My idea at Ipsilon was to keep the semantics and routing protocols of IP networks, but to accelerate the forwarding by using ATM switches. So there’s that accelerator / host concept again. We built a full IP routing software stack to go with partners’ ATM switch hardware, and created some open protocols. One of them, General Switch Management Protocol, helped to inspire the Software Defined Networking industry, started at Stanford by folks familiar with Ipsilon.

Unfortunately, ATM proved to be the kind of ATM where you can only put money “in”. Nokia acquired Ipsilon and used the IP stack in 3G wireless networks. And the Ethernet world figured out how to build L2/L3 switches.

In the 1998-2000 time frame, I was on the board of directors of Terraspring, a very early cloud computing company founded by friends from Sun. This was the same time of “Dot Com” bubble craziness when Mark Andreessen’s LoudCloud was also operating – they were the competition. Terraspring exactly met the definition of Infrastructure-as-a-Service – but before virtual machines! The architecture depended on using front-end processors – one for networking, and one for storage (SCSI based) – to “virtualize” all of the I/O that the servers saw. Those front-end processors were entirely separate FreeBSD based boxes (sometimes called “pugs” boxes), but the function was mostly the same as what today’s SmartNICs can do.

Part 6: The 2000s

Fast forward to 2005/6. I was a founder at Nuova Systems, which developed the UCS server series for Cisco. The world of I/O had evolved from SBus to PCI to PCI Express, 1Gb Ethernet was everywhere, and 10Gb Ethernet was just starting. But servers still had a mess of storage controllers, SCSI and Fibre Channel, which were mostly “intelligent”.

When Intel came out with the Nehalem generation of processors (spurred on by AMD & Opteron), servers got a lot faster at compute, memory, and I/O all at once. A lot of the intelligent I/O devices all of a sudden became the bottlenecks for storage, and the vendors had to scramble to get new chips with much more hardware acceleration into them.

At Nuova we also understood that controlling I/O could go a long way to “virtualizing” a server – making it think it had more, less, or different connectivity than it really did. So we created what has become the Cisco UCS VIC – virtual interface card –
which can appear to be any number of PCI Express devices, each with network-controlled configuration. This is also a kind of a SmartNIC – quite smart in the control plane, but with a mostly hardware data path to avoid performance issues.

In 2009, Mike O’Dell, another gent of great experience, published a paper called “Network Front-end Processors, Yet Again”. At that time TCP Offload Engines were all the rage. Mike pointed out that the protocols required between a host and a front-end are often just as complex as the protocols that you’re trying to offload!

Part 7: The 2010s and into 2020

The 2010s saw the explosion of big data and other scale-out systems using vast amounts of Direct Access Storage instead of the more manageable, but more expensive, SAN and NAS approaches. In the mid 2010s we started DriveScale, where we provide Composable Infrastructure, creating servers with any storage configuration on demand, not at purchase time. Until now we’ve done it without SmartNICs, but a new type of SmartNIC lets us do our job without any host involvement, by emulating NVMe devices.

Most people have heard of NVMe SSDs and their blazing speed. But NVMe itself is just a protocol standard that layers on top of PCI Express. It allows any SSD to work with any operating system as long as they both support NVMe. And every operating system supports NVMe now.

NVMe-over-Fabrics takes the NVMe protocol and puts it onto an Ethernet, allowing crazy fast remote storage. But NVMe-oF needs cooperation from the host OS. Linux has great support, but both VMware and Windows have been lagging. But the new SmartNICs can speak NVMe on the PCI Express bus, while translating to NVMe-oF for the network. So it allows a more universal solution that can support more use cases, and offload protocol overhead at the same time.

Have I mentioned the cloud? Hyperscaler spending is what is driving all hardware architecture now. SmartNICs have been in use in the cloud for about 5 years, but each hyper operator tends to have their own idea of what the SmartNICs should do. Microsoft’s Azure uses FPGA based SmartNICs to offload network and accelerate compute functions. Most SmartNICs today are only doing network offloads (those precious x86 cycles can be sold for real money).

Amazon AWS bought the chip firm Annapurna Labs about 5 years ago and has been really pushing the meaning of SmartNIC. The AWS Nitro chips now provide both network and storage connectivity to x86 machines. The storage uses the NVMe emulation approach. But a really compelling feature that many overlook is that the Nitro chips are providing the security services for the system as well. Only the Nitro chips, not the x86, can write any firmware state. The Nitro chips provide wire speed encryption for both network and storage, and are also involved with all secure boot operations.

In the age of CPU vulnerabilities like Spectre and Meltdown, it becomes troubling to rely on any CPU based separation of user code and operator code. Letting the user have the whole x86 (bare metal or VMs) and keeping the operations code on a SmartNIC is a much better way to reduce the attack worries for the operator.

But wait! What about my old dislike for slow network controllers? Well, it turns out that silicon CPUs have hit a frequency “wall” and cores are just not getting faster anymore. And the cores inside of a SmartNIC (typically ARM64) can be just as fast as the cores inside of an x86 processor, though with less access to big caches and memory. So offloads and front-end processing are much more viable, and true acceleration can come with more exotic hardware embedded into the SmartNICs. In my early testing of SmartNICs, I was using cheap x86 servers which were substantially slower than the SmartNIC!

DriveScale’s storage provisioning is working on SmartNICs today, scaling to thousands of servers, and we plan to do a lot more to help bare metal operators, offering public or private services. So check us out and stay in touch!

Part 8: 2021 and Beyond DriveScale

Unfortunately, DriveScale failed, and the remnants were sold to Twitter. The future is in progress!