Tag Archives: centos

From NFS to LizardFS

If you’ve been following me for a while, you’ll know that we started our data servers out using NFS on ext4 mirrored over DRBD, hit some load problems, switched to btrfs, hit load problems again, tried a hacky workaround, ran into problems, dropped DRBD for glusterfs, had a major disaster, switched back to NFS on ext4 mirrored over DRBD, hit more load problems, and finally dropped DRBD for ZFS.

As of March 2016, our network looked something like this:

Old server layout

Old server layout

Our NFS over ZFS system worked great for three years, especially after we added SSD cache and log devices to our ZFS pools, but we were starting to overload our ZFS servers and I realized that we didn’t really have any way of scaling up.

This pushed me to investigate distributed filesystems yet again. As I mentioned here, distributed filesystems have been a holy grail for me, but I never found one that would work for us. Our problem is that our home directories (including config directories) are stored on our data servers, and there might be over one hundred users logged in simultaneously. Linux desktops tend to do a lot of small reads and writes to the config directories, and any latency bottlenecks tend to cascade. This leads to an unresponsive network, which then leads to students acting out the Old Testament practice of stoning the computer. GlusterFS was too slow (and almost lost all our data), CephFS still seems too experimental (especially for the features I want), and there didn’t seem to be any other reasonable alternatives… until I looked at LizardFS.

LizardFS (a completely open source fork of MooseFS) is a distributed filesystem that has one fascinating twist: All the metadata is stored in RAM. It gets written out to the hard drive regularly, but all of the metadata must fit into the RAM. The main result is that metadata lookups are rocket-fast. Add to that the ability to direct different paths (say, perhaps, config directories) to different storage types (say, perhaps, SSDs), and you have a filesystem that is scalable and fast.

LizardFS does have its drawbacks. You can run hot backups of your metadata servers, but only one will ever be the active master at any one time. If it goes down, you have to manually switch one of the replicas into master mode. LizardFS also has a very complicated upgrade procedure. First the metadata replicas must be upgraded, then the master and finally the clients. And finally, there are some corner cases where replication is not as robust as I would like it to be, but they seem to be well understood and really only seem to affect very new blocks.

So, given the potential benefits and drawbacks, we decided to run some tests. The results were instant… and impressive. A single user’s login time on a server with no load… doubled. Instead of five seconds, it took ten for them to log in. Not good. But when a whole class logged in simultaneously, it took only 15 seconds for them to all log in, down from three to five minutes. We decided that a massive speed gain in the multiple user scenario was well worth the speed sacrifice in the single-user scenario.

Another bonus is that we’ve gone from two separate data servers with two completely different filesystems (only one which ever had high load) to five data servers sharing the load while serving out one massive filesystem, giving us a system that now looks like this:

New server setup

New server layout

So, six months on, LizardFS has served us well, and will hopefully continue to serve us for the next (few? many?) years. The main downside is that Fedora doesn’t have LizardFS in its repositories, but I’m thinking about cleaning up my spec and putting in a review request.

Updated to add graphics of old and new server layouts, info about Fedora packaging status, LizardFS bug links, and remove some grammatical errors

Updated 12 April 2017 I’ve just packaged up LizardFS to follow Fedora’s guidelines and the review request is here.


A brave new world (of traffic shaping)

Traffic through a bottleneck

When administering a network of hundreds of computers, phones and tablets that all share a 3 Mbit/s link, one of the more important requirements is some form of traffic shaping. In fact, when you’re watching your emails download at a cool rate of five words a minute because someone is uploading the complete works of Shakespeare (the Blu-ray edition) onto YouTube, the choice becomes that of traffic shaping or homicide. While homicide is the easy option, unfortunately it has become illegal in most countries, so we have to go with the hard option if we want to avoid jail time.

The idea behind traffic shaping isn’t that complex. Imagine that each packet you send and receive is a car and your internet connection is the highway. Now, imagine that your highway has no lines painted on it and that every car pushes its way through as fast as possible. If you only have a few cars on the highway, this setup works fine. Traffic gets through as quickly as possible as there’s no build-up at either end. This is a normal connection with no traffic shaping.

Now, imagine this same highway with a huge amount of traffic. Two words: Traffic jam. Traffic gets backed up at the end of the highway, and, due to the lack of organization, everybody (including the emergency services) has to wait until they’ve managed to push their way through. Obviously not a very optimal way to organize traffic. This is a normal connection when you’re uploading or downloading a movie. Everything else slows to a crawl.

The thing is, not all traffic is created equal. In the real world, we’d like to think that emergency services will be able to make it through any traffic jam quickly, and most of us wish that the truck convoys would get off the road when traffic is really bad. In the same way, some internet traffic depends on being delivered in realtime (think Skype, video conferencing or SSH sessions), while normal traffic should be reasonably fast (think web browsing), and some traffic is best allowed through only when the road is empty (think large downloads or P2P stuff).

Traffic shaping allows us to separate our metaphorical highway into multiple lanes that can expand or shrink depending on need within limits that we set. And in our school, we need lots of lanes. You see, normally you would split your traffic into the three segments listed above, but we want to have our traffic split among teachers, students and guests, with each of their lanes further split in the above segments (realtime, normal, slow).

For the last few years we’ve used a CentOS 5 box running a customized version of the Wonder Shaper script to shape our traffic, but (mainly because of my deficiencies) it’s not quite been the wonder we’ve been looking for. Slow teacher traffic was put into the fast student lane and a guest watching a YouTube video would slow down the net for everyone else.

After some major problems adopting our Wonder Shaper script to multiple WANs (we have two ISPs, one giving us 2M/1M and the other 1M/512K), I finally decided to look around and see what the alternatives were. PfSense is something that I had been playing around with and I decided to try its traffic shaping capabilities.

It’s amazing! You create queues (lanes in our metaphorical highway), and each queue can contain other queues. So we have a teacher’s queue, a student’s queue, a guest queue and a few other top level queues. Inside each top-level queue is a set of child queues for realtime, normal and slow internet. For example, our teachers get an average bandwidth of 30% and a maximum bandwidth of 50%. In other words, if our internet connection is being fully utilized, teachers will get 30%. If nobody is on the net at all, teachers can get up to 50%. But, it gets even better. Within these percentages, realtime stuff gets 30% of the teacher’s bandwidth, normal web stuff gets another 30%, junk (Facebook, YouTube) gets 25% with a hard limit of 60% of the teachers’ maximum bandwidth, and any bulk stuff gets 15% with a hard limit of 30% of the teachers’ maximum bandwidth.

Duplicate the same percentages for the students, and then again for our guests (except they get a lower average bandwidth and much lower maximum bandwidth) and you get the picture. Add in the bandwidth set aside for our servers, and you end up with lane rules that are incredibly complex, but with smoothly-moving traffic that doesn’t get piled up at either end of the highway. And you didn’t have to kill anyone to achieve it.

If there’s interest, I’ll publish a more technical post including a partial rule list and explain how I got this mess to work with squid (which was necessary for being able to sort the different web destinations into different queues).

lesloueizeh.com Fedora mirrors discontinued

Just a heads up for anyone who cares. The presto test repositories for Fedora at http://lesloueizeh.com have been removed. They were only available for Fedora Core 6 – Fedora 10, which have all been EOL’d.

The official Fedora repositories have been carrying deltarpms since Fedora 11, so there was no longer any reason to keep the test repositories around.

The CentOS 5 presto repositories are still available and will be until the CentOS project enables deltarpms for their repositories (if they ever do).

Trash cans credit: trash can lids and handles by shooting brooklyn under CC BY-NC 2.0

Setting up a netboot server in Fedora/CentOS

I’ve had a request for an explanation on how we use PXE and syslinux in our school. In a previous post, I talked a bit about chain-loading pxe, but didn’t explain much on how our system is set up.

So here goes…

Our primary goal in setting up a PXE environment was to have some way of imaging our computers without having to screw around with an ancient version of Norton Ghost and without having to put a floppy in every computer.

The problem was that we really didn’t want students to be able to reimage the computers whenever they wanted to, and there were other tools we wanted to use that we wanted restricted.

The solution was PXELINUX’s simple menu, and it works beautifully! This post will walk you through the process of setting up PXELINUX (and gPXE while we’re at it).

For this post, I am assuming that you already have a DHCP server and a web server set up.

There are three things we need to set up:

  1. TFTP server
  2. gPXE
  3. Syslinux

The first step is to set up a TFTP server to carry our gPXE images.

  1. Run
    yum install tftp-server
  2. Edit /etc/xinetd.d/tftp and change the line that says
    disable = yes
    disable = no
  3. If this is the initial installation of xinetd, you may need to run chkconfig --levels 2345 xinetd on
    service xinetd start
    at this point. Otherwise, it might be a good idea to run
    service xinetd reload

Now, for the next step, we need to download our gPXE images. gPXE is an extended version of PXE that allows you to load images over http and https in addition to the usual tftp. As most (all?) network cards don’t come with gPXE drivers, we will be using PXE to download and bootstrap our gPXE drivers.

As mentioned in my previous post, some of our motherboards seem to have issues mixing PXE and their normal PXE UNDI drivers, so I prefer to use gPXE’s native drivers rather than its UNDI driver.

However, we have four computers whose network cards just don’t work with gPXE’s native drivers, so we will direct those four computers to the gPXE UNDI driver.

So let’s grab and setup these drivers:

  1. Go to ROM-o-matic and choose the latest production release
  2. For output format, choose “PXE bootstrap loader image [Unload PXE stack] (.pxe)”
  3. Choose NIC type “all-drivers”
  4. Click on “Customize”
  5. Check the box that says “DOWNLOAD_PROTO_HTTPS”
  6. Click on the button that says “Get Image”
  7. Save file to /tftpboot/gpxe.pxe
  8. Change NIC type to “undionly”
  9. Click on the button that says “Get Image”
  10. Save file to /tftpboot/undi.pxe

I’m assuming you’re running the ISC dhcp server (dhcp package on both Fedora and CentOS). If not, you’ll have to work out these next steps yourself.

You need to edit /etc/dhcpd.conf and add the following lines:
next-server ip address;

if exists user-class and option user-class = "gPXE" {
    filename "http://webserver/netboot/pxelinux.0";
} else {
    filename "/gpxe.pxe";

Where ip address is the ip address of your TFTP server and webserver is the name/ip address of your web server.

If you have some computers that won’t pxeboot using gPXE’s native drivers (you’ll be able to tell because the computers will show the gPXE loading screen, but won’t be able to get an IP address using DHCP while in gPXE), change the last five lines above to:

if exists user-class and option user-class = "gPXE" {
    filename "http://webserver/netboot
} else {
    if binary-to-ascii(16, 8, ":",
       substring(hardware, 1, 6)) = "mac address 1"
    or binary-to-ascii(16, 8, ":",
       substring(hardware, 1, 6)) = "mac address 2" {
        filename "/undi.pxe";
    } else {
        filename "/gpxe.pxe";

Where “mac address 1” and “mac address 2” are the MAC addresses of the computers that don’t work with gPXE’s native drivers. Please note the MAC address are without leading zeros (i.e. 00:19:d1:3a:0e:4b becomes 0:19:d1:3a:e:4b).

At this point, if you boot any computer on your network off the NIC, you should see something like this:
Picture of a screen with gPXE starting

The next step is to setup PXELINUX, a part of the Syslinux Project. PXELINUX is a small bootloader designed for booting off a network.

  1. On your web server, create a directory called “netboot” in your web root (normally /var/www/html on Fedora/CentOS).
  2. Run
    yum install syslinux
    or, as an alternative, build a newer version of syslinux. I recommend at least 3.75 (the version in Fedora 12), though I’m using 3.82 at the school.
  3. Copy (at minimum) chain.c32, menu.c32, vesamenu.c32 and pxelinux.0 to “netboot” in your web root. (These files will be located in /usr/share/syslinux if you installed the package using yum.) At this point, you’ll probably want to check for other modules that might have some potential. We use ifcpu64.c32 to decide between 32-bit and 64-bit Fedora on the computers.
  4. Run
    yum install memtest86+
    cp /boot/elf-memtest86+-4.00 \

    (Note that “your_web_root” will most likely be /var/www/html)
  5. Download this picture and save it to your_web_root/netboot
  6. Change directory to your_web_root/netboot
  7. Run
    mkdir pxelinux.cfg
    cd pxelinux.cfg
  8. Create a file called “default” that contains the following:
    default vesamenu.c32
    timeout 40
    prompt 0
    noescape 1

    menu title Boot Options
    menu background menu.png
    menu master passwd

    label local
        menu label ^Boot from hard drive
        kernel chain.c32
        append hd0

    label admin
        menu label ^Administrative tools
        kernel vesamenu.c32
        append pxelinux.cfg/admin
        menu passwd

    Please note that both hashed passwords (starting with $4$) should be on the previous lines. There’s just not enough space for it to show correctly.

  9. Create a file called “admin” that contains the following:
    default vesamenu.c32
    timeout 40
    prompt 0
    noescape 1

    menu title Administrative Tools
    menu background menu.png
    menu master passwd

    label memtest
        menu label ^Memory tester
        kernel memtest

    Once again the hashed password (starting with $4$) should be on the previous line. There’s just not enough space for it to show correctly.

If you boot any computer on your network off the NIC, you should see something like this:

Picture of netboot menu

Picture of netboot menu asking for a password

Picture of Administrative tools menu

So now you have a double layered menu system with a password required to get to the second layer. For reference’ sake, the current password is “purple”, and you can generate your own password by running sha1pass (included in the syslinux package).

If you wanted to add other administrative tools, you would add them to the file “admin” in netboot. For more information on how to add items to the menu, see this page.

I hate virtual machines (was I hate NFS)

(Please note that you’ll probably want to read the previous post before this one)

So, I set up a new virtual machine running Fedora rather than CentOS 5.4 and migrated the services over to it. We did see an improvement, but just not enough. I went into the computer room during break, and several students had gray screens for Firefox and OpenOffice.org.

So I’ve switched us back over to the original configuration (running NFS off of the real servers). I have to admit that I’m quite curious as to what the load will be tomorrow when everyone logs in.

Thanks to those who commented on my last post. The general consensus seems to be that this just isn’t the best area to use a virtual machine.

We’ve been running the new system for a few days now and it’s much more responsive. Logins never take longer than 30 seconds, and none of the students are getting gray windows. Load during breaks now ranges from 7 to 20. I’d still love to see a much lower load, but at least we’re back to a reasonably fast system.

I hate NFS

On our network we have about 100 client computers, most of which are running Fedora 11.  We have two real servers running CentOS 5.4, using DRBD to keep the virtual machine data on the two real machines in sync and Red Hat’s cluster tools for starting and stopping the virtual machines.

We have five virtual machines running on the two real machines, only one of which is important to this post, our fileserver.

Under our old configuration, /networld was mounted on one of the real servers, and then shared to our clients using NFS. Our virtual machine, fileserver, then mounted /networld over NFS and shared it using Samba for our few remaining Windows machines (obviously, a non-optimal solution).

Diagram of old configuration

Old configuration (click on image for full size)

There were a couple of drawbacks to this configuration:

  1. I had to turn on and off a number of services as the storage clustered service moved from storage-server01 to storage-server02
  2. Samba refused to share a nfs4-mounted /networld, and, when mounted using nfs3, the locking daemon would crash at random intervals (I suspect a race condition as it mainly happened when storage-server0x was under high load).

My solution was to pass the DRBD disks containing /networld to fileserver, and allow fileserver to share /networld using both NFS and Samba, which seemed a far less hacky solution.

Diagram of current configuration

Current configuration (click on image for full size)

I knew there would be a slight hit in performance, though I’m using virtio to pass the hard drives to the virtual machine, so I would expect a maximum of 10-15% degradation.

Or not. I don’t have any hard numbers, but once we have a full class logging in, the system slows to a crawl. My guess would be that our Linux clients are running at 1/2 to 1/3 of the speed of our old configuration.

The load values on fileserver sit at about 1 during idle times and get pumped all the way up to 20-40 during breaks and computer lessons.

So now I’m stuck. I really don’t want to go back to the old configuration, but I can’t leave the system as slow as it is. I’ve done some NFS tuning based on miscellaneous sites found via Google, and tomorrow will be the big test, but, to be honest, I’m not real hopeful.

(To top it off, I spent three hours Friday after school tracking down this bug after updating fileserver to CentOS 5.4 from 5.3. I’m almost ready to switch fileserver over to Fedora.)