Tag Archives: nfs

Config Caching Filesystem (ccfs)

Man with Jetpack

One of the problems we’ve had to deal with on our servers is high load on the fileserver that holds the user directories. I haven’t worked out if it’s because we’re using standard workstation hardware for our servers, or if it’s a btrfs problem.

The strange thing is that the load will shoot up at random times when the network shouldn’t be that taxed, and then be fine when every computer in the school has someone logged into it.

Anyhow, we hit a point where the load on the server hit something like 60 and the workstations would lock for sixty seconds (or more) while waiting the the NFS server to respond again. This seemed to happen most often when all of the students in the computer room opened Firefox at the same time.

In a fit of desperation, I threw together a python fuse filesystem that I have cunningly called the Config Caching Filesystem (or ccfs for short). The concept is simple. A user’s home directory at /netshare/users/[username] is essentially bind-mounted to /home/[username] using ccfs.

The thing that separates ccfs from a simple fuse bind-mount is that every time a configuration file (one that starts with a “.”) is opened for writing, it is copied to a per-user cache directory in /tmp and opened for writing there. When the user logs out, /home/[username] is unmounted, and all of the files in the cache are copied back to /netshare/users/[username] using rsync. Any normal files are written directly to /netshare/users/[username], bypassing the cache.

Now the only time the server is being written to is when someone actually saves a file or when they log out. The load on the server rarely goes above five, and even then it’s only when everyone is logging out simultaneously, and the server recovers quickly.

A few bugs have cropped up, but I think I’ve got the main ones. The biggest bug was that some students were resetting their desktops when the system didn’t log out quickly enough and were getting corrupted configuration directories, mainly for Firefox. I fixed that by using –delay-updates with rsync so you either get the fully updated configuration files or you’re left with the configuration files were there when you logged in.

I do think this solution is a bit hacky, but it’s had a great effect on the responsiveness of our workstations, so I’ll just have to live with it.

Ccfs is available here for those interested, but if it breaks, you get to keep both pieces.

Jetpack credit: Fly with U.S. poster by Tom Whalen. Used under CC BY-NC-ND

I hate virtual machines (was I hate NFS)

(Please note that you’ll probably want to read the previous post before this one)

So, I set up a new virtual machine running Fedora rather than CentOS 5.4 and migrated the services over to it. We did see an improvement, but just not enough. I went into the computer room during break, and several students had gray screens for Firefox and OpenOffice.org.

So I’ve switched us back over to the original configuration (running NFS off of the real servers). I have to admit that I’m quite curious as to what the load will be tomorrow when everyone logs in.

Thanks to those who commented on my last post. The general consensus seems to be that this just isn’t the best area to use a virtual machine.

EDIT:
We’ve been running the new system for a few days now and it’s much more responsive. Logins never take longer than 30 seconds, and none of the students are getting gray windows. Load during breaks now ranges from 7 to 20. I’d still love to see a much lower load, but at least we’re back to a reasonably fast system.

I hate NFS

On our network we have about 100 client computers, most of which are running Fedora 11.  We have two real servers running CentOS 5.4, using DRBD to keep the virtual machine data on the two real machines in sync and Red Hat’s cluster tools for starting and stopping the virtual machines.

We have five virtual machines running on the two real machines, only one of which is important to this post, our fileserver.

Under our old configuration, /networld was mounted on one of the real servers, and then shared to our clients using NFS. Our virtual machine, fileserver, then mounted /networld over NFS and shared it using Samba for our few remaining Windows machines (obviously, a non-optimal solution).

Diagram of old configuration

Old configuration (click on image for full size)

There were a couple of drawbacks to this configuration:

  1. I had to turn on and off a number of services as the storage clustered service moved from storage-server01 to storage-server02
  2. Samba refused to share a nfs4-mounted /networld, and, when mounted using nfs3, the locking daemon would crash at random intervals (I suspect a race condition as it mainly happened when storage-server0x was under high load).

My solution was to pass the DRBD disks containing /networld to fileserver, and allow fileserver to share /networld using both NFS and Samba, which seemed a far less hacky solution.

Diagram of current configuration

Current configuration (click on image for full size)

I knew there would be a slight hit in performance, though I’m using virtio to pass the hard drives to the virtual machine, so I would expect a maximum of 10-15% degradation.

Or not. I don’t have any hard numbers, but once we have a full class logging in, the system slows to a crawl. My guess would be that our Linux clients are running at 1/2 to 1/3 of the speed of our old configuration.

The load values on fileserver sit at about 1 during idle times and get pumped all the way up to 20-40 during breaks and computer lessons.

So now I’m stuck. I really don’t want to go back to the old configuration, but I can’t leave the system as slow as it is. I’ve done some NFS tuning based on miscellaneous sites found via Google, and tomorrow will be the big test, but, to be honest, I’m not real hopeful.

(To top it off, I spent three hours Friday after school tracking down this bug after updating fileserver to CentOS 5.4 from 5.3. I’m almost ready to switch fileserver over to Fedora.)