Tag Archives: fedora

Multiseat systems and the NVIDIA binary driver

Building mesa

Building mesa

Ever since our school switched to Fedora on the desktop, I’ve either used the onboard Intel graphics or AMD Radeon cards, since both are supported out of the box in Fedora. With our multiseat systems, we now need three external video cards on top of the onboard graphics on each system, so we’ve bought a large number of Radeon cards over the last few years.

Unfortunately, our local supplier has greatly reduced the number of AMD cards that they stock. In their latest price lists, they have a grand total of two Radeon cards in our price range, and one of them is almost seven years old!

This has led me to take a second look at NVIDIA cards, and I’m slowly coming back around to the concept of buying them and maybe even using their binary drivers. Our needs have changed since we first started using Linux, and NVIDIA’s binary driver does offer some unique benefits.

As we’ve started teaching 3D modeling using Blender, render time has become a real bottleneck for some of our students. We allow students to use the computers before and after school, but some of them don’t have much flexibility in their transportation and need to get their rendering done during the school breaks. Having two or three students all trying to render at the same time on a single multiseat system can lead to a sluggish system and very slow rendering. The easiest way to fix this is to do the rendering in the GPU, which Blender does support, but only using NVIDIA’s binary driver.

So about a month ago, I ordered a cheap NVIDIA card for testing purposes. I swapped it with an AMD card on one of our multiseat systems and powered it up. Fedora recognized the card using the open-source nouveau driver and everything just worked. Beautiful!

Then, a few hours later, I noticed the system had frozen. I rebooted it, and, after a few hours, it had frozen again. I moved the NVIDIA card into a different system, and, after a few hours, it froze while the original system just kept running.

Some research showed that the nouveau driver sometimes has issues with multiple video cards on the same system. There was some talk about extracting the binary driver’s firmware and using it in nouveau, but I decided to see if I could get the binary driver working without breaking our other Intel and AMD seats.

The first thing I did was upgrade the test system to Fedora 25 in hopes of taking advantage of the work done to make mesa and the NVIDIA binary driver coexist. I then installed the binary NVIDIA drivers from this repository (mainly because his version of blender already has the CUDA kernels compiled in). The NVIDIA seat came up just fine, but I quickly found that mesa in Fedora 25 isn’t built with libglvnd (a shim between either the mesa or NVIDIA OpenGL implementation, depending on which card you’re using and your applications) enabled, so all of the seats based on open drivers didn’t come up. But, even when it was enabled, I ran into this bug, so I ended up extending this patch so it would also work with Gallium drivers and applying it.

This took me several steps closer, but apparently the X11 GLX module is not part of libglvnd and NVIDIA sets the Files section in xorg.conf to use it’s own GLX module (which, oddly enough, doesn’t work with the open drivers). I finally worked around this via the ugly hack of creating two different xorg.conf.d directories and telling lightdm to use the NVIDIA one when loading the NVIDIA seat.

Voilà! We now have a multiseat system with one Intel built-in card using the mesa driver, two AMD cards using the mesa Gallium driver, and one NVIDIA card using the NVIDIA binary driver. And it only cost me eight hours and my sanity.

So what needs to happen to make this Just Work™? Either libglvnd needs to also include the X11 GLX module or we need a different shim to accomplish the same thing. And Fedora needs to build mesa with libglvnd enabled (but not until this bug is fixed!)

My mesa build is here and the source rpm is here. There is a manual “Provides: libGL.so.1()(64bit)” in there that isn’t technically correct, but I really didn’t want to recompile negativo17’s libglvnd to add it in and my mesa build requires that libglvnd implementation.

My xorg configs are here and my lightdm configuration is here. Please note that the xorg configs have my specific PCI paths; yours may differ.

And I do plan to write a script to automate the xorg and lightdm configs. I’ll update this post when I’ve done so.

Sidenote: As I was looking through my old posts to see if I had anything on NVIDIA, I came across a comment by Seth Vidal. He was an excellent example of what the Fedora community is all about, and I really miss him.

From NFS to LizardFS

If you’ve been following me for a while, you’ll know that we started our data servers out using NFS on ext4 mirrored over DRBD, hit some load problems, switched to btrfs, hit load problems again, tried a hacky workaround, ran into problems, dropped DRBD for glusterfs, had a major disaster, switched back to NFS on ext4 mirrored over DRBD, hit more load problems, and finally dropped DRBD for ZFS.

As of March 2016, our network looked something like this:

Old server layout

Old server layout

Our NFS over ZFS system worked great for three years, especially after we added SSD cache and log devices to our ZFS pools, but we were starting to overload our ZFS servers and I realized that we didn’t really have any way of scaling up.

This pushed me to investigate distributed filesystems yet again. As I mentioned here, distributed filesystems have been a holy grail for me, but I never found one that would work for us. Our problem is that our home directories (including config directories) are stored on our data servers, and there might be over one hundred users logged in simultaneously. Linux desktops tend to do a lot of small reads and writes to the config directories, and any latency bottlenecks tend to cascade. This leads to an unresponsive network, which then leads to students acting out the Old Testament practice of stoning the computer. GlusterFS was too slow (and almost lost all our data), CephFS still seems too experimental (especially for the features I want), and there didn’t seem to be any other reasonable alternatives… until I looked at LizardFS.

LizardFS (a completely open source fork of MooseFS) is a distributed filesystem that has one fascinating twist: All the metadata is stored in RAM. It gets written out to the hard drive regularly, but all of the metadata must fit into the RAM. The main result is that metadata lookups are rocket-fast. Add to that the ability to direct different paths (say, perhaps, config directories) to different storage types (say, perhaps, SSDs), and you have a filesystem that is scalable and fast.

LizardFS does have its drawbacks. You can run hot backups of your metadata servers, but only one will ever be the active master at any one time. If it goes down, you have to manually switch one of the replicas into master mode. LizardFS also has a very complicated upgrade procedure. First the metadata replicas must be upgraded, then the master and finally the clients. And finally, there are some corner cases where replication is not as robust as I would like it to be, but they seem to be well understood and really only seem to affect very new blocks.

So, given the potential benefits and drawbacks, we decided to run some tests. The results were instant… and impressive. A single user’s login time on a server with no load… doubled. Instead of five seconds, it took ten for them to log in. Not good. But when a whole class logged in simultaneously, it took only 15 seconds for them to all log in, down from three to five minutes. We decided that a massive speed gain in the multiple user scenario was well worth the speed sacrifice in the single-user scenario.

Another bonus is that we’ve gone from two separate data servers with two completely different filesystems (only one which ever had high load) to five data servers sharing the load while serving out one massive filesystem, giving us a system that now looks like this:

New server setup

New server layout

So, six months on, LizardFS has served us well, and will hopefully continue to serve us for the next (few? many?) years. The main downside is that Fedora doesn’t have LizardFS in its repositories, but I’m thinking about cleaning up my spec and putting in a review request.

Updated to add graphics of old and new server layouts, info about Fedora packaging status, LizardFS bug links, and remove some grammatical errors

Flock 2016

Man and woman in driving horse-drawn buggy

Downtown Kraków

I have just returned from a vacation in beautiful Kraków, where, entirely coincidentally, there just happened to be a Fedora conference! My family and I enjoyed the amazing sights around Kraków (if you haven’t visited the salt mine, you really should), but my personal highlight was getting to attend my first Flock, where I got to meet people face to face who I’d only previously talked with via IRC and email.

I got to chance to speak about how we use Fedora in the classroom in our school (slides here). There were some excellent questions from the audience at the end, and I realized (again!) that my biggest problem is creating decent documentation about what we’re doing so others can follow.

One of my goals over the next year is to make sure that our work is easily reproducible in other schools, both from the sysadmin side and from the educational side.

My biggest take-away from the conference is that Fedora is moving into some very interesting times as it starts to expand from rpms being the only system delivery mechanism. I’m very interested in ostree with its concept of a read-only system partition and the work they’re doing on layered trees so you can have multiple system images branching off of one base image.

I’d really like to thank the event organizers for all the work they did putting Flock together, the design team for the beautiful t-shirts, and the Fedora community for just being great. And, while I’m at it, I’d like to extend personal thanks to Rafał Luzynski and his wife, Maja, for their hospitality.

Talk – Using Fedora in the classroom

Spreadsheet assignment

Spreadsheet assignment

So I’m sitting here in Kraków, doing some last-minute preparation for my talk (Fedora in the Classroom) at the upcoming Flock conference next week.

I’ll be looking at why we use Fedora in our school, what tools we use to setup and maintain our workstations, and the actual subjects that we teach our students, complete with actual projects1 that our students have done.

If you’re a teacher looking for ways to use open source software in the classroom, an administrator looking for a computer curriculum that emphasizes creativity and comprehension over memorization and rote learning, or you’re just interested in seeing how Fedora is effectively used in a school environment, please do come check it out.

 
[1] Projects have been anonymized to protect student privacy

Notes on a mass upgrade to Fedora 23

Picture of Fedora 23 desktop

Fedora 23

One of the hardest parts of running Fedora in a school setting is keeping on top of the upgrades, and I ended up falling a few months behind. Fedora 23 was released back in November, and it took me until February to start the upgrade process.

For our provisioning process, we’ve switched from a custom koji instance to ansible (with our plays on github), and this release was the first time I was really able to take advantage it. I changed our default kickstart to point to the Fedora 23 repositories, installed it on a test system, ran ansible on it, and voilà, I had a working Fedora 23 setup, running perfectly with all our school’s customizations. It was the easiest upgrade experience I’ve ever had!

Well, mostly.

As usual, the moment you think everything is perfect is the moment everything goes wrong. On our multiseat systems, we have three external AMD graphics cards along with the internal Intel graphics. The first bug I noticed was that the Intel card wasn’t doing any graphics acceleration. It turns out that VGA arbitration is automatically turned on if you have more than one video card, and Intel cards don’t support it in DRI2. DRI3 does handle arbitration just fine, but it was (and still is) disabled in the latest xorg-x11-drv-intel in the updates repository. Luckily for me, there’s a build in koji that re-enables DRI3. Problem solved.

The second bug was…odd. While we use gnome-shell as the default desktop environment in the school, we use lightdm for logging in, mainly because of it’s flexibility. We run xscreensaver in the login screen (and only in the login screen) to make it clear which computers are off, which are on, and which are logged in. GDM doesn’t support xscreensaver, but lightdm does. And this brings us back to the bug. On the Intel seat, moving the mouse or pressing a key would stop the screensaver as expected, but the screen would remain black except for the username control. It seems that the “VisibilityNotify” event isn’t being honored by the driver (though don’t ask me why it should be passed down to the driver). I filed a bug, and then finally figured out that fading xscreensaver back in works around the problem.

The third bug is even stranger. On the teacher’s machine, we have a small script that starts x11vnc (giving no control to anyone connecting to it) so the teacher can give a demonstration to the students. But after install Fedora 23 on the teacher’s machine, the demo kept showing the same three frames over and over. The teacher’s system isn’t multiseat and is using the builtin Intel graphics, so, oddly enough, disabling DRI3 fixed the problem. I filed another bug.

When upgrading the staff room systems, I ran into a bug in which cups runs screaming into the night (ok, slight exaggeration) if you have a server announcing printers over both the old cups and new dnssd protocols. Since we don’t have any pre-F21 systems any more, I’ve just disabled the old cups protocol on the server.

And, finally, my principal, who teachers computers to grades 11 and 12, came in to ask me why LibreOffice was crashing for a couple (and only a couple) of his students when they were formatting cells on a spreadsheet that he gave them. After some fancy footwork involving rm’d .config/libreoffice directories and files saved into random odd formats and then back into ods, we finally managed to format the cells without a crash. Lovely.

All this brings me back to ansible. In each of the bugs that required changes to the workstations, all I had to do was update the ansible scripts and push the changes out. Talk about painless! Ansible has made this job so much easier!

And I do want to finish by saying that these bugs are part of the reason that I love Fedora. With Fedora, I have the freedom to fix these problems myself. For both the cups bug and the xscreensaver bug, I was able to dig into the source code to start tracking down where the problem lay and come up with a workaround. And if I can just get the LibreOffice bug to reproduce, I could get a crash dump off of it and possibly figure it out too. Hurrah for source code!

Virtualizing Windows (and simplifying my life)

Picture of fireworks

Freedom

At our school, we’ve been running Fedora on most of the desktops since Fedora 8, but the one department that’s stuck with Windows is the accounting department, mainly because their software is Windows-only.  This has long been a problem because most of our infrastructure is built around Linux and we haven’t put nearly as much energy into making sure Windows systems are maintained properly.

Obviously, this led to problems that started out small, but grew until the systems were bordering on unusable.  When it reached the point that we were considering yet another reinstall of Windows, I suggested switching the accountants over to Fedora and having them use a virtual machine for the software that required the other OS.

It took a few days to get something that worked, and another week (including one very late night) to tie down the little glitches and get the virtual machine beyond just-usable to easy-to-use.

I started with VirtualBox, but there were a number of issues with stability, so I decided to take another look at QEMU.  I thought about using libvirt, but one of my requirements was that everything needed to run under the user’s permissions, so it turned out to be easier to run qemu-kvm directly.  I used SPICE and installed the guest agent, which gave us a far better experience with QEMU than the last time I used it for a desktop OS (which, granted, was over five years ago).

Most of my time was spent fixing problems inherent to Windows 7 itself, rather than the virtualization process.  It turns out that there are bugs in how it handles network printers, causing delays every time you want to print.  Oddly enough, the fix was pretty simple, but it took a while to figure it out.  There was also the bug where network drives aren’t mapped properly if the system boots so quickly that the network isn’t up in time, which was only fixable by using a batch file for mapping the network drives.

One change I made was to insist that we use throw-away snapshots for day-to-day work (the data is stored on a network drive) and only keep changes when we’re updating the accounting software.  This should help protect us from viruses and malware that can’t be easily removed.

The best part of all this is that the new accounting VM and the scripts necessary to start it are sitting in a network folder only accessible by the accountants.  This means that they can now do their work from any computer in the school, if necessary, while still protecting them.

And I’m no longer stuck keeping unmanaged Windows systems running.  What a way to close out the year!

Colorful Fireworks by 久留米市民(Kurume-Shimin) used under a CC BY-SA 3.0 unported license

Solving the mystery of the disappearing bluetooth device

wifi This is a true[1] story

One of the features my laptop comes with is Bluetooth, which I’ve found to be quite handy considering all the highly important uses I have for Bluetooth (using Bluetooth tethering on my phone when traveling, controlling my presentations with my phone, using a Wii-mote for playing SuperTuxKart portable Bluetooth controller with built-in accelerometer to analyze the consistency of the matrices used when rendering three-dimensional objects onto a two-dimensional field).

About three months ago, I started to run into problems. Not the easy kind of problem where “BUG: unable to handle kernel paging request at 0000ffffd15ea5e” brings the laptop to an abrupt stop, but instead the kind of problem that causes real trouble.

My Bluetooth module starts to randomly reset itself. I’ll be working merrily, trying to connect my phone or the… portable Bluetooth controller… and, halfway through the process, it will hang. Kernel logs show that the Bluetooth module has been unplugged from the USB bus and then reconnected. Which, when you think about it, makes a whole lot of sense, given that the Bluetooth module is built into the WiFi card which is screwed onto the motherboard.

When faced with kernel logs that boggle the mind, the most logical thing to do is downgrade the kernel. I know that I was able to successfully… analyze the matrices used for, oh, whatever it was… back at the beginning of June, which means I had working Bluetooth on June 1. Let’s see what kernel was latest then, download and install it, boot from it, and…

kernel: usb 8-4: USB disconnect, device number 3
kernel: usb 8-4: new full-speed USB device number 4 using ohci-pci

#$@&%*!

Ok, the hardware must be dying.  Stupid Atheros card.  No idea why it’s just the Bluetooth and not the WiFi as well, but we’re in Ireland and I’m on eBay, so I’ll just order another one.  Made by a different company.  A week later, a slightly used Ralink combo card shows up. I plug it in, fire her up, and…

kernel: usb 8-4: USB disconnect, device number 3
kernel: ohci-pci 0000:00:13.0: HC died; cleaning up
kernel: ohci-pci 0000:00:13.0: frame counter not updating; disabled

Double #$@&%*! Now the Bluetooth module is completely gone and the only way to get it back is to reboot. Grrrrr.

At this point I’ve got a hammer in my hand, my laptop in front of me, and the only thing keeping me from submitting a video for a new OnePlus One is my wife warning me that we’re not going to be buying me a new laptop any time this decade.

So I take a deep breath, calmly return the hammer to the toolbox (no, dear, I have no idea how that dent got on the toolbox), and decide to instead go down the road less traveled. I open up Fedora’s bugzilla and start preparing my bug report, taking special care to only use words that I’d be willing to say in front of my children. “…so the Bluetooth module keeps getting disconnected. It’s almost like the USB bus is cutting its power for some stupid…”

Wait a minute! Just before we traveled to Ireland, I remember experimenting with PowerTOP. And PowerTOP has this cool feature that allows you to automatically enable all power saving options on boot. And I might have enabled it. So I check, and, yes I have turned on autosuspend for my Bluetooth module. I turn it off, try to connect my… portable Bluetooth controller… and it works, first time. I do some… matrix analysis… with it and everything continues to work perfectly.

So I am an idiot. I close the page with the half-finished bug report and go to admit to my wife that I just wasted €20 on a WiFi card that I didn’t really need.  And, uh, if any Atheros or Ralink people read this, well, I’m sorry for any negative thoughts I may have had about your WiFi cards.

[1] Well, mostly true, anyway. Some of the details might be mildly exaggerated.