Tag Archives: glusterfs

Benchmarking small file performance on distributed filesystems

The actual benches

As I mentioned in my last post, I’ve spent the last couple of weeks doing benchmarks on the GlusterFS, CephFS and LizardFS distributed filesystems, focusing on small file performance. I also ran the same tests on NFSv4 to use as a baseline, since most Linux users looking at a distributed filesystem will be moving from NFS.

The benchmark I used was compilebench, which was designed to emulate real-life disk usage by creating a kernel tree, simulating a compile of the tree, reading all the files in the tree, and finally deleting the tree. I chose this benchmark because it does a lot of work with small files, very similar to what most file access looks like in our school. I did modify the benchmark to only do one read rather than the default of three to match the single creation, compilation simulation and deletion performed on each client.

The benchmarks were run on three i7 servers with 32GB of RAM, connected using a gigabit switch, running CentOS 7. GlusterFS is version 3.8.14, CephFS is version 10.2.9, and LizardFS is version 3.11.2. For GlusterFS, CephFS and LizardFS, the three servers operated as distributed data servers with three replicas per file. I first had one server connect to the distributed filesystem and run the benchmark, giving us the single-client performance. Then, to emulate 30 clients, each server made ten connections to the distributed filesystem and ten copies of the benchmark were run simultaneously on each server.

For the NFS server, I had to do things differently because there are apparently some major problems with connecting NFS clients to a NFS server on the same system. For this one, I set up a fourth server that operated just as a NFS server.

All of the data was stored on XFS partitions on SSDs for speed. After running the benchmarks with one distributed filesystem, it was shut down and its data deleted, so each distributed filesystem had the same disk space available to it.

The NFS server was setup to export its shares async (also for speed). The LizardFS clients used the recommended mount options, while the other clients just used the defaults (I couldn’t find any recommended mount options for GlusterFS or CephFS). CephFS was mounted using the kernel module rather than the FUSE filesystem.

So, first up, let’s look at single-client performance (click for the full-size chart):

Initial creation didn’t really have any surprises, though I was really impressed with CephFS’s performance. It came really close to matching the performance of the NFS server. Compile simulation also didn’t have many surprises, though CephFS seemed to start hitting performance problems here. LizardFS initially surprised me in the read benchmark, though I realized later that the LizardFS client will prioritize a local server if the requested data is on it. I have no idea why NFS was so slow, though. I was expecting NFS reads to be the fastest. LizardFS also did really well with deletions, which didn’t surprise me too much. LizardFS was designed to make metadata operations very fast. GlusterFS, which did well through the first three benchmarks, ran into trouble with deletions, taking almost ten times longer than LizardFS.

Next, let’s look at multiple-client performance. With these tests, I ran 30 clients simultaneously, and, for the first three tests, summed up their speeds to give me the total speed that the server was giving the clients. CephFS ran into problems during its test, claiming that it had run out of disk space, even though (at least as far as I could see) it was only using about a quarter of the space on the partition. I went ahead and included the numbers generated before the crash, but I would take them with a grain of salt.

Once again, initial creation didn’t have any major surprises, though NFS did really well, giving much better aggregate performance than it did in the earlier single-client test. LizardFS also bettered its single-client speed, while GlusterFS and CephFS both were slower creating files for 30 clients at the same time.

LizardFS started to do very well with the compile benchmark, with an aggregate speed over double that of the other filesystems. LizardFS flew with the read benchmark, though I suspect some of that is due to the client preferring the local data server. GlusterFS managed to beat NFS, while CephFS started running into major trouble.

The delete benchmark seemed to be a continuation of the single-client delete benchmark with LizardFS leading the way, NFS just under five times slower, and GlusterFS over 25 times slower. The CephFS benchmarks had all failed by this point, so there’s no data for it.

I would be happy to re-run these tests if someone has suggestions on optimizations especially for GlusterFS and CephFS.

Summer work

The dog and the river

It’s summer, we’re in the US, and I’m thoroughly enjoying the time with my family. It’s been quite a while since we’ve seen everyone here, and we’ve all been having a blast. The one downside (though my wife is convinced it’s an upside) is that my parents have limited internet, so my work time has been, out of necessity, minimal. It has nothing to do, I assure you, with our beautiful beach on the river.

I have managed to push through a bugfix LizardFS update for Fedora and EPEL, and I’ve been working on some benchmarks comparing GlusterFS, LizardFS and NFS. I’ve been focusing on the compilebench benchmark which basically simulates compiling and reading kernel trees, and is probably the closest thing to our usage pattern at the school (lots of relatively small files being written, changed, read and deleted).

Using NFS isn’t really fair, since it’s not distributed, but it’s still the go-to for networked storage in the Linux world, so I figured it would be worth getting an idea of exactly how much slower the alternatives are. If I can get Ceph up and running, I’ll see if I can benchmark it too.

In other news, I have the privilege of attending Flock again this year. I’m really looking forward to getting a better feel on Fedora’s movement towards modules, something that I hope to put into practice over the next year at the systems in school.

Hopefully, I’ll get a chance to get my benchmarks out within the next couple of weeks, and I’m sure I’ll have a lot to say about Flock.

GlusterFS Madness

Background
As mentioned in Btrfs on the server, we have been using btrfs as our primary filesystem for our servers for the last year and a half or so, and, for the most part, it’s been great. There have only been a few times that we’ve needed the snapshots that btrfs gives us for free, but when we did, we really needed them.

At the end of the last school year, we had a bit of a problem with the servers and came close to losing most of our shared data, despite using DRBD as a network mirror. In response to that, we set up a backup server which has the sole job of rsyncing the data from our primary servers nightly. The backup server is also using btrfs and doing nightly snapshots, so one of the major use-cases behind putting btrfs on our file servers has become redundant.

The one major problem we’ve had with our file servers is that, as the number of systems on the network has increased, our user data server can’t handle the load. The configuration caching filesystem (CCFS) I wrote has helped, but even with CCFS, our server was regularly hitting a load of 10 during breaks and occasionally getting as high as 20.

Switching to GlusterFS
With all this in mind, I decided to do some experimenting with GlusterFS. While we may have had high load on user data server, our local mirror and shared data servers both had consistently low loads, and I was hoping that GlusterFS would help me spread the load between the three servers.

The initial testing was very promising. When using GlusterFS over ext4 partitions using SSD journaling on just one server, the speed was just a bit below NFS over btrfs over DRBD. Given the distributed nature of GlusterFS, adding more servers should increase the speed linearly.

So I went ahead and broke the DRBD mirroring for our eight 2TB drives and used the four secondary DRBD drives to set up a production GlusterFS volume. Our data was migrated over, and we used GlusterFS for a week without any problems. Last Friday, we declared the transition to GlusterFS a success, wiped the four remaining DRBD drives, and added them to the GlusterFS volume.

I started the rebalance process for our GlusterFS volume Friday after school, and it continued to rebalance over the weekend and through Monday. On Monday night, one of the servers crashed. I went over to the school to power cycle the server, and, when it came back up, continued the rebalance.

Disaster!
Tuesday morning, when I checked on the server, I realized that, as a result of the crash, the rebalance wasn’t working the way it should. Files were being removed from the original drives but not being moved to the new drives, so we were losing files all over the place.

After an emergency meeting with the principal (who used to be the school’s sysadmin before becoming principal), we decided do ditch GlusterFS and go back to NFS over ext4 over DRBD. We copied over the files from the GlusterFS partitions, and then filled in the gaps from our backup server. Twenty-four sleepless hours later, the user data was back up and the shared data was up twenty-four sleepless hours after that.

Lessons learned

  1. Keep good backups. Our backups allowed us to restore almost all of the files that the GlusterFS rebalance had deleted. The only files lost were the ones created on Monday.
  2. Be conservative about what you put into production. I’m really not good at this. I like to try new things and to experiment with new ideas. The problem is that I can sometimes put things into production without enough testing, and this is one result.
  3. Have a fallback plan. In this case, our fallback was to wipe the server and restore all the data from the backup. It didn’t quite come to that as we were able to recover most of the data off of GlusterFS, but we did have a plan if it did.
  4. Avoid GlusterFS. Okay, maybe this isn’t what I should have learned, but I’ve already had one bad experience with GlusterFS a couple of years ago where its performance just wasn’t up to scratch. For software that’s supposedly at a 3.x.x release, it still seems very beta-quality.

The irony of this whole experience is that by switching the server filesystems from btrfs to ext4 with SSD journals, the load on our user data server has dropped to below 1.0. If I’d just made that switch, I could have avoided two days of downtime and a few sleepless nights.

Nuclear explosion credit – Licorne by Pierre J.. Used under the CC-BY-NC 2.0 license.