Tag Archives: xz

Brittle deltas – a possible solution?

Picture of broken eggs

Deltarpm is brittle. When it works correctly, it’s brilliant. But, like a tightrope walker crossing the Niagara falls while balancing an egg on his head, all it takes is one slip and…*splat*.

At the beginning of the Fedora 15 release cycle, a new version of xz was pushed in which the defaults for compression level 3 were changed (as far as I can tell, to what used to be level 4). This doesn’t cause any problems for newly compressed data, but if you decompress an rpm whose payload was compressed using old level 3 (like makedeltarpm does) and then recompress it with new level 3 (like applydeltarpm does), the compressed files no longer match. *Splat*.

I wrote about the root problem here over a year ago, but to summarize: almost no compression algorithms ever guarantee that, over all releases, they will create the same compressed output given the same uncompressed input.

Our fix for Fedora 15 was pretty simple. Delete all of the old deltarpms in Rawhide. As long as the users have the new xz before doing a yum update, all new deltarpms will work correctly. Yay.

The problem is that this is all still extremely fragile. Take Fedora bugs #524720, #548523, and #677578 for example. All three bugs have cropped up because of mistakes in handling changes in the compression format, and it’s all a bit ridiculous. Would anyone use gzip if an old version couldn’t decompress data compressed with a newer version?

A possible solution?
There is no simple solution. So what if we change the rules? Instead of trying to keep the compression algorithms static, what if we stored just enough information in the deltas to recompress using the exact same settings, whatever they are.

For gzip, this would mean recording things like each block size, dictionary, etc. For xz, it would mean recording the LZMA2 settings. The problem is that this information is different for each compression type and the functions to extract the needed information haven’t been included in any compression libraries (to my knowledge).

However, if we could write these functions and get them into the upstream libraries, it would benefit all programs that try to generate deltas. Deltarpm would continue to work when compression algorithms change. Rsync could actually delta gzipped files, even if the “–rsyncable” switch hasn’t been used in gzip.

There are a couple of possible problems with this solution. First, I’m not sure how big the extra needed information is. Obviously, for each compression format, it’s different, but, unless it’s at most 1/100th the size of the uncompressed file, storing the extra data in the deltarpm will probably not be worth the effort.

Second, no code has actually been written. In an open source world of “Show me the code”, this is obviously a major issue. I’d love to do a reference for one of the simpler compression formats (like zlib), but just haven’t had the time yet.

Obviously, the best solution would be for the various upstreams to provide the necessary functions, as they understand both their algorithms and what information should be stored. However, most upstreams have enough on their plates without needing extra stuff thrown in from random blogs.

Another good solution would be for someone who is interested in deltas and compression to take on this project themselves. Any volunteers?🙂

Broken eggs credit: Broken Eggs by kyle tsui. Used under CC BY-NC-ND

Deltarpm problems (Part II)

Crumpled piece of paperAbout six weeks ago, I looked at one of the problems we currently face with deltarpm, that of colored executables.

Today, I want to look at one of the other major problems that we’ve currently papered over without really fixing. That is compression.

Background
When we switched over to using xz compression in Fedora 12, we ran into a two problems, one expected and one not. The first problem was that deltarpm didn’t have to code to handle xz-compressed rpms. That was solved quickly thanks to the work that SUSE developers had already put into handling lzma-compressed rpms. We had to do a little bit of adapting to xz, but it was pretty straightforward and trivial.

The second problem popped up right after the switchover and was completely unexpected. When doing some updates on a Rawhide machine, I noticed that a number of noarch deltarpms were giving me a checksum error on rebuild (prompting a download of the full rpm). It soon became obvious that xz wasn’t producing the same compressed files on PPC and x86 machines.

A noarch rpm (one that could be installed on any architecture machine) would sometimes be randomly built on a PPC builder, and a deltarpm for that package would be generated. The deltarpm would be applied on my x86 laptop and the resulting uncompressed data would be identical to the original rpm’s uncompressed data. However, when that uncompressed data was then recompressed so that we would have the original compressed rpm, it compressed slightly differently, breaking the package signatures.

The Problem
Most compression formats don’t guarantee repeatability. They do not promise that the compressed file you generated today will be identical to the compressed file you generate tomorrow. They just promise that you’ll be able to decompress your file tomorrow.

To understand this, remember that any compression format has a standard (which must always be followed) and an algorithm (which may change slightly). Look at the two following math formulas:

(1 + 5) / 3 + (3 + 9) / 4
(3 + 9) / 4 + (1 + 5) / 3

Though they are different, they still parse to exactly the same result. Now imagine that the formulas are two different compressed files, and the result is the uncompressed file. As far as the compression format is concerned, both compressed files are valid.

The problem is that because deltarpm must rebuild the original compressed rpm, it’s built on the assumption that compression is repeatable. And the assumption is mostly true, mainly because gzip and bzip2 haven’t made changes to their compression algorithms in years. But xz is a much newer algorithm that is still being fine-tuned.

One advantage of this is that upstream changed xz so it is repeatable across different architectures, fixing the PPC/x86 problem. However, upstream made it very clear that they were not promising repeatability over time. They may change the compression algorithm to improve speed or compression, while still sticking to the standard.

A Related Problem
This is closely related to another problem we hit when generating deltas: compressed files in rpms.

How many files are stored on the filesystem in a compressed format? All of our man pages, to start with. A lot of game data. And more…

Guess how good our deltas our of compressed data? Pretty bad, because a small change in an uncompressed file normally creates big changes in compressed files, throwing away the benefit of deltas.

And we can’t uncompress those files before doing a delta on them because we can’t guarantee that they will be recompressed to the exact same file (were they compressed with gzip -5? gzip -9? gzip -1?).

The Current Situation
So where does this leave us? In a bit of a mess with deltarpms if xz does change its compression output. It is a solvable mess, but it’s still a mess.

We also have lousy deltas if there are any compressed files in the rpm. In many ways, this problem is more immediate.

What we need is some way to recompress data in such a way that guarantees that it’s identical to the original compressed data…