My System Administration Philosophy

When I talk about system administration here, I’m generally following SAGE’s definition of a sysadmin as “One who, as a primary responsibility, performs system administration duties on behalf of another such as an employer or client.” Of course, if you’re running your own machine at home — regardless of what operating system it is — then you are the sysadmin for that machine, like it or not. (This is part of why home users get into such trouble with their unsecured broadband connections: not only do they not understand system administration at all, they don’t even realize that they need to, or that they are acting as sysadmins!) But that’s a different essay. Here, I’m not talking about system administration as a hobby or enjoyable home pastime, but rather as a profession, job, or career. (Though hopefully still an enjoyable one — if it’s become unappealing to you, then it’s time to find a new career!)

It’s all about serving human beings. The machines don’t exist for their own benefit, but for ours.

My system administration philosophy has been formed partly by early exposure to Æleen Frisch’s Armadillo Book, and mostly by ongoing interactions with x86 systems and their users for the past seven years, in diverse environments.

Serving the Users is Job One

Always remember that the entire reason your systems exist — the reason computers exist at all — is to serve humans. To serve users. The machines you’re maintaining exist because your employer needs them to do things for the company’s benefit. Your job is not just to make sure that the machines run smoothly, but that they do things that benefit other people.

This may sound like I’m advocating a very subservient role towards users — something that makes many sysadmins rather twitchy, as they’d generally rather be the masters of their silicon domain. But subservience would only result from a fairly myopic application of this principle. Just because one user comes to you and says, “I need the machine to do foo, and I need it right now!” doesn't mean you necessarily have to jump to it. That user is not the only user of the system, and may not even be the primary person to whom you’re responsible.

Most likely, your primary responsibility is to the company as a whole, and the company’s stated goals and policies. Naturally, if the user's request is one that will mess things up for the rest of the company, or violates company security rules or conduct regulations, you turn it down. But you should never get to the point where you see the users’ desires as being in opposition to your own. Taking on an adversarial relationship with users is one of the quickest routes to burnout that I can think of.

Instead of seeing them as enemies, make friends with your users. It’s a much more natural and healthy relationship. An even better model is to see yourself as curator of a shared resource — and, to some extent, a technological mentor for the rest of the company.

Providing a Stable Platform

The most basic way that a sysadmin needs to serve users is simply by providing a stable platform for them to work on. Whether this is a development server, a mail, Web, DNS or database server, an entire LAN or WAN, or something completely different, the basic feature of solid, dependable stability is one that’s crucial. Everything else the users do rests on the foundation we provide; it has to be rock-steady for them to get things done.

This doesn’t mean that we can never fix, change, or upgrade. Heck, people expect that kind of thing to happen. But at the very least, warn the users before you shift the ground underneath their feet! And remember that stable, reliable systems that always work the same way, and have uptimes measured in months or years, are better than systems with all the latest doo-dads and buzzwords if they have unscheduled downtime every week.

Naturally, it’s better to schedule the downtime for the middle of the night, when people aren’t actually working with the systems and won’t be (as) affected. But they will still be affected somewhat, when they come in the next morning to a system that behaves a little differently. Let them know ahead of time! If you’re upgrading Perl to 5.8.0, the dev team needs to know; if you’re tweaking the Samba config on the main file server, the people who use it should be made aware that things will now behave in New Manner X.

Avoiding Techspeak Overload

When you warn your users that there will be downtime — or explain to them why there was downtime — you can easily run up against a nasty double-bind problem. Some users just want to know, in fairly vague terms, that something went wrong and that it won’t happen again (or that something will happen, and how it will affect them). Others want the details — which is actually good, because it means that they want to feel some involvement and connection with what happens with the tech systems!

Don’t try to second-guess how much information to provide. Provide it all; just put the simple stuff first. Build up in complexity, and people will choose for themselves when to stop reading.

But the first group will simply zone out if you tell them that /dev/hdc1 was corrupted and that fsck took a while to complete. Even worse, you can’t just assume that one group (for example, Sales and Marketing) just wants the quick overview while another (R&D, Biz Dev) wants all the techy details. Everyone’s an individual, and just because someone’s in a very non-tech division at work doesn’t mean they aren’t running FreeBSD at home. Even worse, some people shift back and forth from one group to the other, depending on how busy they are, how much coffee they’ve had recently, and other factors that you know nothing about.

Of course, it’s possible that entire divisions won’t be affected, and don’t even need to know. If you’re upgrading the Accounting department’s print server, HR and Engineering simply don’t care at all. But you still might need to Cc: the VP of Finance (who wants to know about things that affect her department, but doesn’t need every little detail) and your own supervisor (who’s keeping a detailed paper trail of things the tech team does, to justify a bigger budget allocation for the IT department next quarter).

My usual solution to this problem is to give all the detail anyone could need — but put it at the bottom of the message. The top consists of a broad overview in standard-user-speak; no techspeak at all unless absolutely necessary. The email starts off as high-level abstraction, then moves into lower levels and greater detail as it progresses. Here are a couple of examples:

From: sysadmin@random.com
To: affected-users@random.com
Cc: vp-finance@random.com, it-director@random.com
Subject: Upcoming System Maintenance

Greetings, folks. I want to warn you all that the main file server will be going down for a hardware upgrade, starting at 3pm on Monday. I'll email and messenger everyone five minutes before shutting down the server, so you'll have time to log out and save your files. If you want to keep working on anything on the file server, please save a local copy to your own hard drives for the duration, and remember to put it back on the file server on Tuesday morning, when the server will be back up.

If you want more detail, read on; otherwise, you can stop reading here.

Over the past few weeks, we've noticed increasing congestion on the file server. This has taken a variety of forms: Network access has been slow when more than 5 people try to access the server simultaneously; CPU and RAM consumption show high loads when moving large directories around. Additionally, the drive is nearing its capacity.

Accordingly, I'm going to upgrade practically everything. The CPU will be upgraded from a PII-733MHz to an Athlon 1.6 GHz; the RAM will be doubled (from 256 to 512 MB); the network card will be changed from 10 megabit to 100 megabit, and I'll swap the 80 GB drives for 120 GB versions. Naturally, all the data on the old drives will be copied to the new ones; you don't need to back up your files anywhere else. We expect the box to be back online by EOD Monday PST, but don't count on it before Tuesday AM.

Please don't hesitate to ask me if you need any more information.

This lets the “don’t bother me with techspeak” users know what they need to know: “file server down, happens then, I’ll be warned, I don’t need to worry, back on Tuesday.” They can get these necessary details by scanning one paragraph, and then go on with their day. At the same time, the more tech-friendly users get all that information, and can then read on and find out things like: Why is this happening? What, exactly, is going to happen? What can I expect to see afterwards?

(Of course, I’d normally start such an operation around end-of-business, rather than the middle of the afternoon. This is just an example.)

This is also handy for the chronically busy types in management: they can simply skim the “executive summary” at the top, and the presence of the extra details reassures them that you’re on top of things. Executives like it when their staff are on top of things; it means one less thing for them to worry about.

For the following “after an accident” situation, note how the same principles apply:

From: sysadmin@random.com
To: affected-users@random.com
Subject: Apology for Recent Server Down-Time

Good afternoon. I'd like to apologize for this morning's server crash. I hope nobody's work was too badly affected by this accident. The server's main hard disk stopped working, taking the server down until the disk drive could be replaced.

Please note that the drive that went bad *did not* have any of your files on it. That drive contained the server's operating system (the "Unix stuff"); all of the user files (aka "your stuff") are on a different disk, which is still fine. (You may also take comfort from the knowledge that your files are backed up every night.)

If you want more detail, read on; otherwise, you can stop reading here.

The file server has four physical hard drives, even though it only looks like one when you browse it on the network. The first drive is a 20 GB disk, which holds the Unix operating system (or "OS") that allows the computer to do things like read and write files, participate on the network, and so on. The other three drives are each 80 GB in size, and they are mirrored together using RAID technology to make one 160 GB partition (the one you see on the network). (The other 80 gigabytes get used for parity and error checking, which make it easy to recover information in case of a drive crash. For more information on RAID, see http://www.webopedia.com/TERM/R/RAID.html .)

The three 80 GB drives are all fine. However, the 20 GB drive went dead (with six months still left in its warranty!), meaning that none of the other drives could be accessed. Nor could the server be accessed via the network. The unfortunate down-time caused by this was the time required to pull a new drive from storage, install Unix on it, and configure it to participate in our network as a file server.

This problem should not occur again, as hard drives are not prone to failure so early in their lives. (For comparison, the drive failing this early would be sort of like a car's engine failing completely after only about 70,000 miles.) Please don't hesitate to ask me if you need any more information.

Note what things go first: an apology, along with hopes that the problem didn’t bother people too much. Then a brief explanation... which is mostly focused on soothing people’s fears that their files may have been munged. There is just enough information there to let people know that no, it wasn’t your fault, but the main focus is on telling them what they want to know, rather than what you want them to know.

In the extended technical information, some techspeak does get used... but that’s okay, because this is the section that nobody’s under any compulsion to read. If they don’t want techspeak, they should have stopped earlier. But it’s still not completely impenetrable, and also contains a link to more “educational” information for those who are having trouble following along. Again, there's some emphasis on letting people know that their files are okay, and it assumes that during the downtime, you heard people around the office going “Gee, I hope my files are still there” — naturally, if you know that the big question in your office is something different, you should target that instead.

But overall, in both of these communiqués, there’s a general structure of moving from elementary information that is of interest even to non-technical users, on up the scale toward more “meaty” technical detail. And this subtly encourages the tech-phobic users, over time, to become more interested and less hostile. And demystifying technology and getting people to be more friendly to it has been on my agenda for some time now.

User Education

I’ve already mentioned treating your users as friends and allies, rather than enemies. One way of doing this is by educating them — which isn’t a one-sided effort, either, as users will often come to you with questions. This is a good thing, not an interruption; it means they’re trying to learn. This is always worth encouraging.

One of my favorite and most successful tactics has been to start off light and easy, then slowly ramp up the techspeak until the user starts to look just a little bit out of their depth. Then, back off. Re-explain whatever you’ve just said, in simpler terms. Then stop, and give the user time to absorb whatever you’ve just told them. The next time, you can build on that: “Remember what I told you about the swap file being like a big note pad?”

Finding real-world analogies for computer concepts is important, because most computer concepts are abstrac­tions. Most people relate better to concrete things.

Users remember such things fairly well, if you’ve presented them with at least one or two vivid images. When dealing with non-technical users, it’s incredibly helpful to relate computers to some real-world item that they understand well... cars, music, a TV show, photography, whatever. If your company’s small enough, it’s worth keeping track of the hobbies and other interests of some of your less technical users, just so you can make personalized analogies with things in their own lives.

Naturally, you shouldn’t try to draw invalid analogies, just for the sake of getting in a reference to a user’s hobby. Don’t claim that “the CPU is like the quarterback in football” just because you know this user’s a sports fan; the analogy makes a very poor fit. Where does the CPU throw the ball to? What’s the equivalent of the running back or the tackle?

Far better to analogize with something else, even if it isn’t the user’s favorite activity: “The CPU’s like the conductor in a symphony orchestra” — even if this user hates symphonies, he or she must have seen conductors before, and has a good idea of what they do. (Then you can build on that analogy, comparing the sheet music to the program that’s running...) The analogy doesn’t even have to be with any kind of “hobby” activity; it can be something as generic as a traffic light or “people trying to get past each other in a hallway”, as long as it’s a real-world object or event that the user has some experience with. That real-world referent is crucial because, to most users, the things inside a computer are completely unreal.

To us, things like swap partitions, named pipes, and device drivers are so real, we easily forget that they’re actually just abstractions; they have no presence at all in the world of mass and tangibility that most people inhabit. Even physical things like DIMMs and Athlon chips are vaguely mystical, indistinct concepts to many users; they may be “things inside a computer”, but they’re not things these users ever see or touch. They’re strange, fuzzy ideas, in much the same way that racing shocks or yellow-bellied thrushes are strange and fuzzy to me: I’ve never actually seen either one, even though I know there are millions of them out there. (Auto racing and bird-watching don’t happen to be among my hobbies.)

So you tell a user: “The swap file is sort of like a scratch pad, or a memo pad near your phone for jotting stuff down on — you use it when there’s too much stuff to remember in your head, but it takes longer to read and write than it does to just remember something.” And now, they have a real-world image that they can associate with the swap file, and have a general idea of what it does. This sort of analogizing can easily be extended

It’s also crucial to wait for the users to come to you before doing such education. You can’t just hit them with it at any random time and expect them to be anything other than annoyed. But users will occasionally start coming to you with questions — sometimes, they’ll even be prompted by the messages you send out, as in the previous section. “Say, you said something in your email about upgrading the file server, and making it an ‘Athlon’. How is an Athlon better than what we already have?”

Paranoia and Trust

In discussing security, I’ll necessarily have to delve a bit into the field of a network administrator — securing a system that’s not on a network is generally as simple as locking the door to the room it’s in. Of course, there is some overlap between sysadmins and netadmins anyway. Regardless, I’ll continue to just use the term “sysadmin”.

Sysadmins and other techies are often accused of being “overly paranoid” — generally by users who want to do things that are insecure. Of course, if anything actually gets into the network, we weren’t being paranoid enough — but this is never phrased as such; instead, it’s usually that we “weren’t doing our jobs”.

Remembering that you and the users are on the same side, not in opposition, can get you out of lots of mental double-binds.

Obviously, then, part of the job of a system (or network) administrator is to be paranoid. But paranoia isn’t an end in itself; really we’re just being paranoid on behalf of our users — and, by implication, on behalf of the systems we maintain for those users.

And remembering that is the trick to escaping the double-bind represented by this “how paranoid is too paranoid?” question. Like many other situations, it doesn’t have to be “you versus the users” when it can just as easily be — and most naturally is and should be — “you working on behalf of the users”.

Explain it to them. You don’t always have to go into complete technical detail, but you can at least tell users things like: “Just because that email claims to come from your friend Bob Smith, that doesn’t mean it’s safe to run the attachment. See, most viruses, when they infect someone’s computer, will simply mail themselves to everyone in that person’s address book. That means you’re most likely to get a virus from a friend, not a stranger.” You can even explain seemingly “difficult, technical” topics, like why a firewall doesn’t stop email viruses: “A firewall simply blocks certain channels of the Internet, just like blocking the X-rated channels from your kids. Email messages are one channel, Web pages are another, IM messages a third, and so on. So a firewall could block out the whole email channel, but it can't just block ‘emails with viruses’; that's like telling your cable box to only block ‘the bad comedies’ on the Comedy Channel. Now, you can get your TiVo to take a guess at what’s good and what’s bad, but first it has to be trained and learn your tastes. Anti-virus software is a lot like that; it will try to block out the virus emails, but whenever a new one comes out, it takes time for the anti-virus software to learn about it.”

Technical topics? You betcha! But you don’t have to get into service port numbers, IP addresses, and the like to communicate the basic gist of the problem to even the most unlearned user. You can make the user understand the issues without using jargon, especially if you can find a metaphor that corresponds to something they already understand, as described earlier.

When you explain to users what the logic is behind your security decisions, the user will see that leaving that port open, or running that email client that auto-executes all attachments, or whatever, is just as bad an idea as always leaving their car keys in the ignition so they won’t get lost, or turning off the whole Comedy Channel because they don’t like Eddie Izzard. The issues become real to them — it’s their computer that’s going to be open for attack, after all. (Once the user really understands, he or she can even go and evangelize the secure view to other users at your site — a double win.)

Making allies of your users may take a bit of effort in education, but it more than pays for itself in terms of the effort you save, by not having to treat them as enemies.

Allow Users Their Choice of Tools

Many large computing environments like to standardize on a particular choice of software — everyone uses Outlook for email, IE for Web browsing, and so on. And I’ve heard endless tales of annoyance and frustration from users who were forced to use software they didn’t like, just because “that’s the corporate policy.”

I recognize that the corporation (or other group) had reasons for standardizing — they can easily clone one reference system and copy that installation onto every new desktop; their IT staff only need to learn and support one application for any given task; etc. But I generally wonder if the amount of user frustration and diminished productivity is really worth it.

Just because you let a user install something doesn’t mean you have to provide complete support for it.

I like to work on a sort of middle course: I’ll let a user work with whatever tool most easily helps him or her get the job done, assuming it's not violating company policy or security — but I might not support it. In email clients, for example, I’m a Eudora partisan (though I’m gaining an appreciation for Thunderbird as well). If I’m the one who gets to choose the company email client as well as set up the drive images, Eudora would be installed on everyone’s desktop. But if someone really can’t stand Eudora, and would rather use Outlook, or Pegasus, or Netscape Mail, or whatever else... I’m fine with that. I’ll tell the user: “You install it, you maintain it. If something in it breaks, you can ask me to look at it... but I may shrug and say ‘Heck, I dunno.’ That’s the risk you take.”

Of course, if the policy against a particular piece of software is for security reasons, that’s a very different kettle of fish. Then, I’ll explain the security problems to the user as above.

This lets the user know that he or she is getting a special bonus or perk; it’s not something to rely on. They cannot just assume that they’ll be able to get tech support on their chosen, non-standard software. But it also gives them the freedom to do things their way; it lets them know that they don’t just have to be a faceless cog in the corporate machine (thus raising morale), and it makes you look like a friendly and helpful human being instead of an obstacle to the person’s goals. Plus, of course, it means the user will get to work with their preferred software, thus increasing their productivity.

The Importance of Planning

Almost every sysadmin agrees that planning some things out ahead of time is a good idea. Whether you’re setting up network architecture and hostnames, a security policy, or a filesystem partition structure, it’s useful to have some idea of where you’re going with it. Growing things in an ad-hoc manner generally leads to troubles later on.

Unfortunately, so much of system administration seems to devolve into continual firefighting that it can often seem impossible to ever find the time to do any planning in. Especially when you’ve just inherited a site whose previous admin wasn’t on the ball, it becomes too easy to look at the mail server that’s getting overwhelmed by spam, and the five machines that need motherboard upgrades, and the other two that have strange, unknown issues, and the three dozen user requests marked “urgent” stacked up in your in-box, and so on... and fall into the trap of seeing “planning” as an impossible, Quixotic dream.

It doesn’t take very long to develop a plan for where you’re going and in what order. And that small time will pay for itself quickly.

And honestly, in a situation like that, it is. At that point, you just need to use triage, and start getting the problems dealt with in quick order. But one of your triage categories should be “not as important as developing a plan”. And once you’ve dealt with all the things that legitimately are more crucial, then it’s time to grab at least a spare half-hour and quickly detail what order other things should happen in.

Once you’ve done that, you’ll find time to do it again, soon. Because imposing structure and order on the chaos of your situation will help ease the pressure. And the more you impose order — the more you actually administer your systems, instead of just reacting to problems — the more you’ll bring things under control.

Taking a plan to your supervisor or management shows, for one thing, that you’ve got your eye on the future — but it’s crucial to explain to folks why you want to make changes. “What’s wrong with your current system?” they will ask. “Why’d we hire this admin if they can’t work with things the way they are?” If you can say, “Given six hours of downtime and 300 bucks for some new cables and UPS, I can reorganize the servers so that things will run 15 percent faster and the Web and mail servers won’t fall over every evening when the cleaning crew plugs in their vacuum cleaners,” that’s a visible, understandable improvement, and a good idea.

Eventually, all this planning gets you to the point where you have a good grasp of everything that’s going on. And when a new fire breaks out, you’ll have a plan ready to deal with it. It will no longer be a crisis, but a known contingency, that can be fixed in short order.

It’s easy to give up on planning in the face of chaos. But it’s worth it to keep imposing order, to keep looking ahead.

User Requests are Your Job

As mentioned above, the entire raison d’etre of computers is to serve humans, and fulfill user requests. As “the person who makes the computers go,” your job is to make sure that those user requests get fulfilled by the computer in a timely and satisfying fashion. Bottom line: when a user comes up to you and asks, “can we make Thing X happen?”, it is not an interruption. It’s what you are there for.

Of course, this doesn’t mean that any and all user requests, no matter how trivial, take automatic precedence over anything else. We’ve all had occasions when a user asks us to do something like change the toner cartridge in the laser printer while we’re trying to deal with a DDoS attack or a drive crash — and when that happens, of course you say, “Sorry, I’ve got more important things going on right now; it will have to wait.”

But I’ve also seen many sysadmins get so accustomed to dealing with “deniable” user requests that they start to look upon all user requests as irrelevant... as trivial, not worth their time, or as “presumed stupid until proven otherwise.” And what I’m advocating here is a recognition and remembrance of the fact that user requests are really the primary focus of keeping a system running. At the very least, we can default to assuming that a user request will be reasonable until proven otherwise — it makes a big difference in our attitudes, and the users notice this.

I saw a fellow admin at a previous employer get subtly shifted out the door because of his attitude toward user requests. His technical skills were fine. And he was not a glasses-wearing nerd with poor social skills; he was generally a friendly, outgoing sort of guy. But after a few months on the job, he seemed to have decided that all users were stupid, and that making the users happy was less important than doing arcane back-end optimizations. Users started complaining (amongst themselves, then to management — not directly to him) that if they asked him for help, he’d patronize them, condescend to them, and kludge together any solution that would get them out of his hair, because he felt their problems weren’t worth his time.

And so they stopped asking him for help. And he got to spend his time with the back-end, not having to deal with the messiness and illogic of user requests. And management noticed that he wasn’t serving the users anymore... and slid him out the door.

Conclusion

System administration is frequently a nerve-wracking profession, and one that’s infamous for garnering no accolades. But it has its joys. Many of them, from the ethereal bliss of rolling out a new service or cluster and having it work perfectly, the first time, to the vengeful glee of seeing a spammer get canned from his provider because you tracked him down to his home IP, are ones that the average person can’t really relate to.

But for me, the greatest joy in system administration is providing a space for others to work in — to create, to build, to meet and communicate... to do whatever computers can allow people to do in today’s world. And then seeing them use, and enjoy, the platform I’ve provided for them. I think that’s a fairly universal joy, much like seeing someone enjoy a meal you cooked, or open a present you bought.

To me, it’s fitting that the most central reward in system administration is one that’s universal, because I’m focused on the human side of administration as well as the technical end. Because, if it doesn’t serve some human need, then it doesn’t matter to me how fast or stable or nifty it is. It’s just a machine — and the humans are more important.