Tag Archives: btrfs

Fedora 18 – A Sysadmin’s view

Road leading down into clouds

The road less traveled

At our school we have around 100 desktops, a vast majority of which run Fedora, and somewhere around 900 users. We switched from Windows to Fedora shortly after Fedora 8 was released and we’ve hit 8, 10, 13, 16, and 17 (deploying a local koji instance has made it easier to upgrade).

As I finished putting together our new Fedora 18 image, there were a few things I wanted to mention.

The Good

  1. Offline updates: Traditionally, our systems automatically updated on shutdown. In the 16-17 releases, that became very fragile as any systemctl scriptlets in the updates would block because systemd was in the process of shutting down. Now, with systemd’s support for offline updates, we can download the updates on shutdown, reboot the computer, and install the updates in a minimal system environment. I’ve packaged my offline updater here.
  2. btrfs snapshots: This isn’t new in Fedora 18, but, with the availability of offline updates, we’ve finally been able to take proper advantage of it. One problem we have is that we have impatient students who think the reset button is the best way to get access to a computer that’s in the middle of a large update. Now, if some genius reboots the computer while it’s updating, it reverts to its pre-update state, and then attempts the update again. If, on the other hand, the update fails due to a software fault, the computer reverts to its pre-update state and boots normally. Either way, the system won’t be the half-updated zombie that so many of my Fedora 17 desktops are.
  3. dconf mandatory settings: Over the years we’ve moved from gconf to dconf, and I love the easy way that dconf allows us to set mandatory settings for Gnome. This continued working with only a small modification from Fedora 17 to Fedora 18, available here and here.
  4. Javascript config for polkit: I love how flexible this is. We push out the same Fedora image to our school laptops, but the primary difference compared to the desktop is that we allow our laptop users to suspend, hibernate and shutdown their laptops, while our desktop users can’t do any of the above. What I would really like to do is have the JS config check for the existence of a file (say /etc/sysconfig/laptop), and do different things based on that, but I haven’t managed to work out how to do that yet. My first attempt is here.
  5. systemd: This isn’t a new feature in 18, but systemd deserves a shout-out anyway. It does a great job of making my workstations boot quickly and has greatly simplified my initscripts. It’s so nice to be able to easily prevent the display manager from starting before we have mounted our network directories.
  6. Gnome Shell: We actually started experimenting with Gnome Shell when it was first included in Fedora, and I switched to it as the default desktop in Fedora 13. As we’ve moved from 13 to 16, then 17, and now 18, it’s been a nice clean evolution for our users. When I first enabled Gnome Shell in our Fedora 13 test environment, the feedback from our students was very positive. “It doesn’t look like Windows 98 any more!” was the most common comment. As we’ve upgraded, our users have only become more happy with it.

The Bad

The bad in Fedora 18 mainly comes down to the one area where Linux in general, and Fedora specifically, is weak – being backwards-compatible. This was noticeable in two very specific places:

  1. Javascript config for polkit: While I was impressed with the new javascript config’s flexibility, I was most definitely not impressed that my old pkla files were completely ignored. As a system administrator, I find it frustrating when I have to completely rewrite my configuration files because “now we have a better way”. I’ve read the blog post explaining the reasoning behind the switch to the JS config, but how hard would it have been to either keep the old pkla interpreter, or, if it was really desired, rewrite the pkla interpreter in javascript? The ironic part of this is that the “old” pkla configuration was itself a non-backwards-compatible change from the even older PolicyKit configuration a little less than four years ago.
  2. dconf mandatory settings: With the version of dconf in Fedora 18, we now have the ability to have multiple user dconf databases. This is a great feature, but it requires a change in the format of the database profile files, which meant my database profile files from Fedora 17 no longer worked correctly. In fact, they caused gnome-settings-daemon to crash, which crashed Gnome and left users unable to log in. Oops. To be fair, this was a far less annoying change because I only had to change a couple of lines, but I’m still not impressed that dconf couldn’t just read my old db profile files.

As a developer, I totally understand the “I have a better way” mindset, but I think backwards compatibility is still vital. That’s why I love rsync and systemd, but have very little time for unison (three different versions in the Fedora repositories because newer versions don’t speak the same language as older versions).

I know some people will say, “If you want stability, just use RHEL.” That’s fine, but I’m not necessarily looking for stability. I like the rate of change in Fedora. What I dislike is when things break because someone wanted to do something different.

All in all, I’ve been really happy with Fedora as our school’s primary OS, and each new release’s features only make me happier. Now I need to go fix a regression in yum-presto that popped up because of some changes we made because we wanted to do something different.

Under the hood

Car with hood open

Two years ago, as mentioned in btrfs on the server, we set up btrfs as our primary filesystem on our data servers. After we started running into high load as our network expanded (and a brief experiment with GlusterFS as mentioned in GlusterFS Madness), in March we switched over to ext4 with the journal on an SSD.

So, as of March, we had three data servers. datastore01 was the primary server for usershare, our shared data. datastore03 was the primary server for users, which, surprisingly, held our users’ home directories. datastore02 was secondary for both usershare and users which were synced using DRBD.

One of the things I had originally envisioned when I set up our system was a self-correcting system. I played around with both the Red Hat Cluster suite and heartbeat and found that they were a bit much for what we were trying to achieve, but I wanted a system where, if a single data server went down, the only notice I would have would be a Nagios alert, and not a line of people outside my office asking me what the problem is.

While I never achieved that level of self correction, I could switch usershare from datastore01 to datastore02 with a less than 30-second delay, and the same applied with switching users from datastore03 to datastore02. NFS clients would connect to an aliased IP that switched when the filesystem switched, so they would only freeze for about 30 seconds, and then come back.

This made updating the systems pretty painless. I would update datastore02 first, reboot into the new kernel and verify that everything was working correctly. Then, I would migrate usershare over to datastore02 and update datastore01. After datastore01 came back up, I would migrate usershare back, and then repeat the process with users and datastore03.

We also had nightly rsync backups to backup01 which was running btrfs and which would create a snapshot after the backup finished. We implemented nightly backups after a ham-fisted idiot of a system administrator (who happens to sleep next to my wife every night) managed to corrupt our filesystem (and, coincidentally, come within a hair’s breadth of losing all of our data) back when we were still using btrfs. The problem with DRBD is that it writes stuff to the secondary drives immediately, which is great when you want network RAID, but bad when the corruption that you just did on the primary is immediately sent to the secondary. Oops. Anyhow, after we managed to recover from that disaster (with lots of prayer and a very timely patch from Josef Bacik), we decided that a nightly backup to a totally separate filesystem wouldn’t be a bad idea.

We also had two virtual hosts, virtserver01 and virtserver02. Our virtual machines’ hard drives were synced between the two using DRBD. We could stop a virtual machine on one host and start it on the other, but live migration didn’t work and backups were a nightly rsync to backup01.

So, after the switchover, our network looked something like this (click for full size):
Server chart

I was pretty happy with our setup, but our load problem popped up again. While it was better than it was before the switch, it would still sometimes peak during breaks and immediately after school.

As I was asking myself what other system administrators do, it hit me that one of my problems was my obsession with self-correcting systems. More specifically, my obsession with automatic correction of a misbehaving server, rather than the more common issue of automatically “correcting” misbehaving hard drives. Because of that, I had been ignoring NAS’s as none of them seemed to have something that worked along the same lines as DRBD.

I started looking at FOSS NAS solutions, and found NAS4Free, a FreeBSD appliance that comes with the latest open-source version of ZFS. The beauty of ZFS when it comes to speed is that, unlike btrfs, it allows you to set up a SSD as a read cache or as the data log.

After running some tests over the summer, I found that ZFS with a SSD cache partition and SSD log partition was quite a bit faster than our ext4 partitions with SSD log, especially with multiple systems hitting the server at the same time with multiple small writes.

So we switched our data servers over to NAS4Free, reduced them to two, and added another backup server. The data servers are configured with RAIDZ1 plus SSD caches and logs. The backups are configured with RAIDZ1, no cache, no SSD log.

A nice feature of ZFS (which I believe btrfs also recently got) is the ability to send a diff between two snapshots from one server to another. Using this feature (which isn’t exposed in the NAS4Free web interface, but accessible using a bash script that runs at 2:00 every morning), I’m able to send my backups to the backup servers in far less time than it used to take to run rsync.

One other nice feature of NAS4Free is the ability of ZFS to create a “volume” which is basically a disk device as part of the data pool, and then export it using iSCSI. I switched our virtual machines’ hard drives from DRBD to iSCSI, which now allows us to live migrate from one virtual host to the other. We also get the bonus of automatic backups of the ZFS volumes as part of the snapshot diffs.

Now our network looks something like this (click for full size):
Computer graph

There is one major annoyance and one major regression in our system, though. First the annoyance. ZFS has no way of removing a drive. You can swap out a drive in a RAIDZ or mirror set, but once you’ve added a RAIDZ set, a mirror or even a single drive, you cannot remove them without destroying the pool. Apparently enterprise users never want to shrink their storage. More on this in my next post.

The major regression is that if either of our data servers goes down, the whole network goes down until I get the server back up. I can switch us over to the backups, but we’ll be using yesterday’s data if I do, so that’s very much a last resort. This basically means that I need to be ready to swap the drives into a new system if one of our data servers does go down. And there will be downtime if (when?) that happens. Joy.

So now we have a system that gives us the speed we need, but not the redundancy I’d like. What I’d really like would be a filesystem that is fully distributed, has no single point of failure and allows you to store volumes on it. GlusterFS fits the bill (mostly), but I’m gunshy at the moment. Ceph looks like it may fit the bill even better with RBD as well as CephFS, but the filesystem part isn’t considered production-ready yet.

So where does that leave us? As we begin the 2012-2013 school year, file access and writing is faster than ever. We’d need simultaneous failure of four hard drives before we start losing data, and, once I deploy our third backup server for high-priority data, it will take even more to lose the data. We do have a higher risk of downtime in the event of a server failure, but we’re not at the point where that downtime would keep us from our primary job, teaching.

GlusterFS Madness

Background
As mentioned in Btrfs on the server, we have been using btrfs as our primary filesystem for our servers for the last year and a half or so, and, for the most part, it’s been great. There have only been a few times that we’ve needed the snapshots that btrfs gives us for free, but when we did, we really needed them.

At the end of the last school year, we had a bit of a problem with the servers and came close to losing most of our shared data, despite using DRBD as a network mirror. In response to that, we set up a backup server which has the sole job of rsyncing the data from our primary servers nightly. The backup server is also using btrfs and doing nightly snapshots, so one of the major use-cases behind putting btrfs on our file servers has become redundant.

The one major problem we’ve had with our file servers is that, as the number of systems on the network has increased, our user data server can’t handle the load. The configuration caching filesystem (CCFS) I wrote has helped, but even with CCFS, our server was regularly hitting a load of 10 during breaks and occasionally getting as high as 20.

Switching to GlusterFS
With all this in mind, I decided to do some experimenting with GlusterFS. While we may have had high load on user data server, our local mirror and shared data servers both had consistently low loads, and I was hoping that GlusterFS would help me spread the load between the three servers.

The initial testing was very promising. When using GlusterFS over ext4 partitions using SSD journaling on just one server, the speed was just a bit below NFS over btrfs over DRBD. Given the distributed nature of GlusterFS, adding more servers should increase the speed linearly.

So I went ahead and broke the DRBD mirroring for our eight 2TB drives and used the four secondary DRBD drives to set up a production GlusterFS volume. Our data was migrated over, and we used GlusterFS for a week without any problems. Last Friday, we declared the transition to GlusterFS a success, wiped the four remaining DRBD drives, and added them to the GlusterFS volume.

I started the rebalance process for our GlusterFS volume Friday after school, and it continued to rebalance over the weekend and through Monday. On Monday night, one of the servers crashed. I went over to the school to power cycle the server, and, when it came back up, continued the rebalance.

Disaster!
Tuesday morning, when I checked on the server, I realized that, as a result of the crash, the rebalance wasn’t working the way it should. Files were being removed from the original drives but not being moved to the new drives, so we were losing files all over the place.

After an emergency meeting with the principal (who used to be the school’s sysadmin before becoming principal), we decided do ditch GlusterFS and go back to NFS over ext4 over DRBD. We copied over the files from the GlusterFS partitions, and then filled in the gaps from our backup server. Twenty-four sleepless hours later, the user data was back up and the shared data was up twenty-four sleepless hours after that.

Lessons learned

  1. Keep good backups. Our backups allowed us to restore almost all of the files that the GlusterFS rebalance had deleted. The only files lost were the ones created on Monday.
  2. Be conservative about what you put into production. I’m really not good at this. I like to try new things and to experiment with new ideas. The problem is that I can sometimes put things into production without enough testing, and this is one result.
  3. Have a fallback plan. In this case, our fallback was to wipe the server and restore all the data from the backup. It didn’t quite come to that as we were able to recover most of the data off of GlusterFS, but we did have a plan if it did.
  4. Avoid GlusterFS. Okay, maybe this isn’t what I should have learned, but I’ve already had one bad experience with GlusterFS a couple of years ago where its performance just wasn’t up to scratch. For software that’s supposedly at a 3.x.x release, it still seems very beta-quality.

The irony of this whole experience is that by switching the server filesystems from btrfs to ext4 with SSD journals, the load on our user data server has dropped to below 1.0. If I’d just made that switch, I could have avoided two days of downtime and a few sleepless nights.

Nuclear explosion credit – Licorne by Pierre J.. Used under the CC-BY-NC 2.0 license.

Slipping over the edge

Man on edge of cliff

On a Sunday a few weeks ago, I finally decided to take the plunge and install the Fedora 15 Alpha on my primary workstation. I’ve been using GNOME Shell pretty much exclusively since Fedora 13, and I was looking forward to an even cleaner setup as it got closer to its first official release. The installation went smoothly, and, soon enough, I had the new interface up and running, and, I have to say, it’s looking great!

Just a quick aside to say that I really appreciate where GNOME Shell is going. I love the favorite apps/running apps on the left, desktops on the right concept. I do miss having persistent desktops (Firefox is always alone in the first desktop, and it’s a bit of a pain to get it back there when I restart Firefox). It’s also much harder to get to my files since the recent documents list on the left has gone. But though it’s taken some getting used to, the notifications menu on the bottom has ended up being really nice.

Anyhow, on Monday after the installation, I brought my laptop to my office in the morning, booted it up, and then went off to teach my first two classes. When I returned to my office, my screen had a nice kernel panic on it saying something about sda write errors, unable to write to disk, the end of the world, etc.

Being the incredibly sophisticated hacker (read “complete idiot”) that I am, I proceeded to do a hard reboot of the laptop without taking a picture of the screen. Oops. Then, on the reboot, the system ran into a small problem. It wasn’t able to mount any of my filesystems. A quick reboot into my livecd and one e2fsck later, and my system partition was back up again (apparently with no major errors). Unfortunately, that wasn’t the end of the story.

I should probably take this opportunity to explain my incredibly cunning partition setup here. You see, I have a boot partition, two 20GB system partitions that I switch between every time I update my system, and a swap partition. I then have a 400+GB encrypted home partition, a btrfs encrypted home partition, created back in the F13 days. This home partition contains all of my data. In the world. Everything.

So I booted from my now repaired system partition and… my home partition refused to mount. It said the filesystem was unrecognizable. Oookay. How about a btrfsck /dev/sda5. “No valid Btrfs on /dev/sda5.” That can’t be good. Ok, I’m a system administrator, I should have a good backup somewhere. Check around, and there it is, dated… May 8, 2010. Well, that sucks.

This was the point where I started to get slightly worried. I hopped onto the #btrfs IRC channel on freenode and that’s where my bacon was saved. Apparently whatever caused the kernel panic also caused some major problems when btrfs was writing its metadata. Unfortunately, since I had no record of the panic except my spotty memory, we weren’t able to track down the cause. All we knew was that the primary superblock had been corrupted, and that was why none of the tools could read it.

At this point, Chris Mason (the creator of btrfs) walked me through compiling btrfs-progs from the “next” branch in git, and then compiling btrfs-select-super (which isn’t built using the normal Makefile). I used btrfs-select-super to switch to the second superblock, and voilĂ , I was able to mount the filesystem (read-only, of course)! Some of the metadata was pointing to junk, and I ended up losing all my files that had been changed in the last few days, but most of them were emails, which I did have backed up elsewhere.

I still don’t know what caused the problem. There were some SMART errors on the drive, but repeated extended offline scans found no errors, and manually overwriting the entire partition using dd and then reading it found nothing amiss. There was some talk of it possibly being related to luks, but no evidence pointing in that direction.

So, now I’m running Fedora 15 Alpha, with a newly created encrypted btrfs filesystem as my home partition… and daily backups. A huge thank you to Chris and the others on #btrfs on freenode who gave me such great help!

Steep cliff credit: Steep cliff by Rob Lee. Used under CC BY-ND

Config Caching Filesystem (ccfs)

Man with Jetpack

One of the problems we’ve had to deal with on our servers is high load on the fileserver that holds the user directories. I haven’t worked out if it’s because we’re using standard workstation hardware for our servers, or if it’s a btrfs problem.

The strange thing is that the load will shoot up at random times when the network shouldn’t be that taxed, and then be fine when every computer in the school has someone logged into it.

Anyhow, we hit a point where the load on the server hit something like 60 and the workstations would lock for sixty seconds (or more) while waiting the the NFS server to respond again. This seemed to happen most often when all of the students in the computer room opened Firefox at the same time.

In a fit of desperation, I threw together a python fuse filesystem that I have cunningly called the Config Caching Filesystem (or ccfs for short). The concept is simple. A user’s home directory at /netshare/users/[username] is essentially bind-mounted to /home/[username] using ccfs.

The thing that separates ccfs from a simple fuse bind-mount is that every time a configuration file (one that starts with a “.”) is opened for writing, it is copied to a per-user cache directory in /tmp and opened for writing there. When the user logs out, /home/[username] is unmounted, and all of the files in the cache are copied back to /netshare/users/[username] using rsync. Any normal files are written directly to /netshare/users/[username], bypassing the cache.

Now the only time the server is being written to is when someone actually saves a file or when they log out. The load on the server rarely goes above five, and even then it’s only when everyone is logging out simultaneously, and the server recovers quickly.

A few bugs have cropped up, but I think I’ve got the main ones. The biggest bug was that some students were resetting their desktops when the system didn’t log out quickly enough and were getting corrupted configuration directories, mainly for Firefox. I fixed that by using –delay-updates with rsync so you either get the fully updated configuration files or you’re left with the configuration files were there when you logged in.

I do think this solution is a bit hacky, but it’s had a great effect on the responsiveness of our workstations, so I’ll just have to live with it.

Ccfs is available here for those interested, but if it breaks, you get to keep both pieces.

Jetpack credit: Fly with U.S. poster by Tom Whalen. Used under CC BY-NC-ND

btrfs on the server

As mentioned back here and here, our current server setup looks something like this:

storage-server01+storage-server02->drbd->lvm->ext3->nfs->clients

Current server configuration

One thing not noted in the diagram is that fileserver, our dns server, ldap server, web server, and a few others all run as virtual machines on storage-server01 and storage-server02.

The drawback to this is that when disk io gets heavy, our virtual machines start struggling, even though they’re on separate hard drives.

Another problem with our current system is that we don’t have a good method of backup. Replication, yes, but if a student accidentally runs rm ./ -rf in their home directory, it’s gone.

So, with a bit of time over the summer after I’ve set up the school’s Fedora 13 image, I thought I’d tackle these problems. We now have three new “servers” (well, 2GB desktop systems with lots of big hard drives shoved in them). Our data has been split into three parts, and each server is primary for one part and backup for another.

The advantage? Now our virtual machines have full use of the (now misnamed) storage-servers01-2, both of which are still running CentOS 5.5. Our three new datastore servers, running Fedora 13, now share the load that was being put on one storage-server.

But this doesn’t solve the backup problem. A few years back, I experimented with LVM snapshots, but they were just way too slow. Ever since then, though, I’ve been very interested in the idea of snapshots and btrfs has them for free (at least in terms of extra IO, and I’m not too worried about space). Btrfs also handles multiple devices just fine, which means goodbye LVM. With btrfs, our new setup looks something like this:

New configuration

New server configuration

I have hit a couple of problems, though. By default, btrfs will RAID1 metadata if you have more than one device in a btrfs filesystem. I’m not sure whether my problem was related to this, but when I tried to manually balance the user filesystem which was spread across a 2TB and 1TB disk, I got -ENOSPC, a kernel panic, and a filesystem that was essentially read-only. This when the data on the drive was under 800GB (though most of the files are small hidden files in our users’ home directories). After checking out the btrfs wiki, I upgraded the kernel to the latest 2.6.34 available from koji (at that point in time), and then copied the data over to a newly created filesystem with RAID0 metadata and data (after all, my drives are already RAID1 using DRBD). A subsequent manual balance had no problems at all.

The second problem is not so easily solved. I wanted to do a speed comparison between our new configuration and our old one, so I ran bonnie++ on all of the computers in our main computer lab. I set it up so each computer was running their instance in a different directory on the nfs share (/networld/bonnie/$HOSTNAME).

Yes, I knew it would take a while (and stress-test the server), but that’s the point, right? The server froze after a few minutes. No hard drive activity. No network activity. The flashing cursor on the display stopped flashing (and, yes, it’s in runlevel 3). Num lock and caps lock don’t change color. Nothing in any logs. Frozen dead.

I rebooted the server, and tried the latest 2.6.33 kernel. After a few minutes of the stress test, it was doing a great imitation of an ice cube. I tried a 2.6.35 Fedora 14 kernel rebuilt for Fedora 13 that I had discarded because of a major drop in DRBD sync speed. This time the stress test barely made it 30 seconds.

So where does that leave me? Tomorrow I plan on running the stress test on our old CentOS server. If it freezes too, then I’m not going to worry too much. It hasn’t ever frozen like that with normal use, so I’ll just put it down to NFS disliking 30+ computers writing gigabytes of data at the same time. I did file this bug report, but not sure if I’ll hear anything on it. It’s kind of hard to track down a problem if there aren’t any error messages on screen or in the logs.

The good news is that I do have daily snapshots set up, shared read-only over NFS, that get deleted after a week. So now we have replication and backups.

I’d like to keep this configuration, but that depends on whether the server freeze bug will show up in real-world use. If it does, we’ll go back to CentOS on the three servers, and probably use ext4 as the base filesystem.

Update: 08/26/2010 After adding a few boot options, I finally got the logs of the freeze from the server. It looks like it’s a combination of relatively low RAM and either a lousy network card design or a poor driver. Switching the motherboard has mitigated the problem, and I’m hoping to get some more up-to-date servers with loads more RAM.