Last Comment Bug 504624 - Consider using the Courgette algorithm in product updates
: Consider using the Courgette algorithm in product updates
Status: NEW
: sec-want
Product: Toolkit
Classification: Components
Component: Application Update (show other bugs)
: Trunk
: All All
: -- enhancement with 8 votes (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
http://dev.chromium.org/developers/de...
: 662102 910219 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-07-16 11:38 PDT by Gary Kwong [:gkw] [:nth10sd]
Modified: 2016-03-23 16:54 PDT (History)
46 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments

Description Gary Kwong [:gkw] [:nth10sd] 2009-07-16 11:38:13 PDT
Courgette is supposedly an open source diff'ing algorithm by Google. This enhancement is to consider using it over whatever we're using now, to see if it's better than whatever we have.

http://blog.chromium.org/2009/07/smaller-is-faster-and-safer-too.html

http://dev.chromium.org/developers/design-documents/software-updates-courgette
Comment 1 Serge Gautherie (:sgautherie) 2009-07-18 03:24:42 PDT
See 'Test MAR generation with Courgette?' at
http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/2d19b1795e6e0fc0
Comment 2 Reed Loden [:reed] (use needinfo?) 2010-03-03 14:04:31 PST
Note that there are allegations of patent infringement relating to Courgette, including a patent lawsuit brought against Google. (See Red Bend Ltd. and Red Bend Software Inc. v. Google Inc., Civ. Action No. 09-cv-11813 (E.D. Ma.)).
Comment 3 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-06-04 16:15:30 PDT
*** Bug 662102 has been marked as a duplicate of this bug. ***
Comment 4 Vic 2011-06-04 17:15:04 PDT
cross-posted from Bug 662102

Making updates to Aurora smaller size - update compression algorithm - Courgette 

"Rather than push put a whole new 10MB update, we send out a diff that takes the previous version of Google Chrome and generates the new version." ... 

Right now every day there is a 17MB update in Aurora.

"We want smaller updates because it narrows the window of vulnerability. If the update is a tenth of the size, we can push ten times as many per unit of bandwidth," 

Google has open-sourced the Courgette code, so now Firefox can use it.

See: http://www.theregister.co.uk/2009/10/28/red_bend_sues_google/
Comment 5 Vic 2011-06-04 17:20:26 PDT
Nick Thomas [:nthomas] wrote:

"FWIW, Mozilla apps already support differential updates, which we call a partial update. So if you keep up to date then there is an update of a few MB, rather than 17 or so. eg in http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011/06/2011-06-03-04-mozilla-aurora/:
firefox-6.0a2.en-US.win32.complete.mar                                    18M  
firefox-6.0a2.en-US.win32.partial.20110602042006-20110603042008.mar      2.5M  

It's certainly possible that courgette can make smaller diffs, particularly for PGO rearranging libxul.dll, but that's for bug 504624 to look at."

--------------

In Aurora Channel, when I click "About Firefox", every day it pushes the whole 17 MB update. Or maybe once every 2 days. I am sure there are not that many fixes landed, but I have to download the whole 17 MB every 1-2 days.
Comment 6 Ted Mielczarek [:ted.mielczarek] 2011-06-05 06:36:25 PDT
Nick's note was to tell you that we already use binary diffs for partial updates. The fact that you're getting full updates is a separate issue. It's possible that using Courgette could make our partial updates somewhat smaller, but it's not going to have a huge impact.
Comment 7 christian 2011-06-05 16:44:52 PDT
A couple of points:

1. We only do binary deltas to the last release. With Aurora building daily, if you miss one you will likely get the full update. So, unless you use Aurora and install an update every day you'll get the full. And (I haven't checked this) there is likely the potential that the update check + install nag timers are skewed in just the right way with the build times so some people always are > 1 release behind and always get the full.

We are tracking building patch packages for more than the last release in bug 575317.

2. As Reed says in comment 2, Google is currently being sued for using it (see http://www.computerworlduk.com/news/it-business/17350/google-faces-chrome-legal-battle/ as well). We should wait to see how that turns out or at the very least need our own lawyers to do a legal review.
Comment 8 Nick Thomas [:nthomas] 2011-06-05 17:43:49 PDT
I've filed bug 662208 for the 'always get a complete' issue, since it's not related to Courgette.
Comment 9 Jean-Marc Desperrier 2011-07-23 14:56:27 PDT
(In reply to comment #7)
> Google is currently being sued for using it
> [...]. We should wait to see how that turns out [...]

Since it's hard to find any news more recent than 2009, and some might be left wondering what happened, here's the info I found.
In short, the litigation is still going on, and is not looking very rosy for Google.

Google failed last year to get the patent office to invalidate the patent (despite finding some potential prior art)
Red Bend failed to get an injonction to stop Google from publishing the source.
This will certainly go on for quite some time more.
Comment 10 Matt H 2011-12-21 15:07:21 PST
Join the Firefox Aurora channel! Download 20 MB updates nightly and click through the intrusive installer every time. Why not update when the user isn't using?

http://www.codinghorror.com/blog/2011/05/the-infinite-version.html
Comment 11 Ted Mielczarek [:ted.mielczarek] 2011-12-22 04:35:35 PST
That is completely unrelated to this bug.
Comment 12 Ted Mielczarek [:ted.mielczarek] 2012-05-18 17:23:18 PDT
I'd be interested to see if this actually does a better job than bsdiff on our PGO'ed builds. Note that Google doesn't build Chrome with PGO, so it's possible it has the same issues. (Our nightly updates used to be a few hundred KB before we turned on PGO, then they jumped up into the 2MB range.)
Comment 13 Ted Mielczarek [:ted.mielczarek] 2012-05-18 17:26:28 PDT
Oh, the link in comment 1 has some tests run by rhelmer a few years ago that seem to show a pretty big improvement.
Comment 14 Asa Dotzler [:asa] 2012-05-18 17:28:46 PDT
Yeah. This could be a significant win. Let's do it :-)
Comment 15 Robert Strong [:rstrong] (use needinfo to contact me) 2012-05-21 13:47:52 PDT
(In reply to Asa Dotzler [:asa] from comment #14)
> Yeah. This could be a significant win. Let's do it :-)
Agreed!

This will need to be prioritized with the existing planned work.
Comment 16 Gary Kwong [:gkw] [:nth10sd] 2012-05-21 14:00:28 PDT
(In reply to Robert Strong [:rstrong] (do not email) from comment #15)
> (In reply to Asa Dotzler [:asa] from comment #14)
> > Yeah. This could be a significant win. Let's do it :-)
> Agreed!
> 
> This will need to be prioritized with the existing planned work.

Setting tracking-firefox15? but I'm not sure if it's the correct flag, so pls feel free to change if necessary.
Comment 17 Robert Strong [:rstrong] (use needinfo to contact me) 2012-05-21 14:04:54 PDT
Not going to happen for Firefox 15. We'll evaluate for what release this will land when we assign resources to implement it.
Comment 18 Chris AtLee [:catlee] 2012-07-11 12:31:34 PDT
I just built the latest version of courgette and ran some numbers with firefox 12.0 -> 13.0.1.

The existing bsdiff based partial mar is 10,244,590 bytes. Using courgette to do the diffs instead of bsdiff yields a mar which is 8,414,954 bytes - a 17% reduction. In particular, xul.dll.patch shrinks from 4,358,810 bytes to 2,925,995 bytes.

courgette doesn't help with omni.ja at all - we save about 16k vs the original omni.ja.patch. omni.ja.patch is now the largest file in the partial mar taking up around 40% of the size.
Comment 19 Robert Strong [:rstrong] (use needinfo to contact me) 2012-07-11 12:41:48 PDT
It would be interesting to know if bug 772841 also caused the inflation of the omni.ja patch since it would have changed the ordering of the files within omni.ja.
Comment 20 Chris AtLee [:catlee] 2012-07-11 12:59:37 PDT
according to 'unzip -l', the ordering between omni.ja in 12.0 and 13.0.1 seems to be the same, modulo new/deleted files.

Also, bsdiff should be able to handle files with the same content that change position within the archive since individual files are compressed in zip rather than the entire archive.
Comment 21 :Ehsan Akhgari (out sick) 2012-07-11 13:10:10 PDT
(In reply to comment #18)
> courgette doesn't help with omni.ja at all - we save about 16k vs the original
> omni.ja.patch. omni.ja.patch is now the largest file in the partial mar taking
> up around 40% of the size.

That's not surprising, courgette actually attempts to disassemble executable files and diff the disassembled version while handling things like relative offsets.  I think we should consider using bsdiff for non-executable files when we switch to courgette.
Comment 22 Robert Strong [:rstrong] (use needinfo to contact me) 2012-07-11 13:59:09 PDT
(In reply to Ehsan Akhgari [:ehsan] from comment #21)
> (In reply to comment #18)
> > courgette doesn't help with omni.ja at all - we save about 16k vs the original
> > omni.ja.patch. omni.ja.patch is now the largest file in the partial mar taking
> > up around 40% of the size.
> 
> That's not surprising, courgette actually attempts to disassemble executable
> files and diff the disassembled version while handling things like relative
> offsets.  I think we should consider using bsdiff for non-executable files
> when we switch to courgette.
or evaluating other tools to handle patching zip files. There has been a request to zip up the files during install for example though I am loathe to add that type of complexity especially since we will want to order the files inside of the compressed file.
Comment 23 :Ehsan Akhgari (out sick) 2012-07-11 14:06:53 PDT
(In reply to comment #22)
> (In reply to Ehsan Akhgari [:ehsan] from comment #21)
> > (In reply to comment #18)
> > > courgette doesn't help with omni.ja at all - we save about 16k vs the original
> > > omni.ja.patch. omni.ja.patch is now the largest file in the partial mar taking
> > > up around 40% of the size.
> > 
> > That's not surprising, courgette actually attempts to disassemble executable
> > files and diff the disassembled version while handling things like relative
> > offsets.  I think we should consider using bsdiff for non-executable files
> > when we switch to courgette.
> or evaluating other tools to handle patching zip files. There has been a
> request to zip up the files during install for example though I am loathe to
> add that type of complexity especially since we will want to order the files
> inside of the compressed file.

Do you know of any such tools?  Compressing zip files well is pretty tricky...
Comment 24 Robert Strong [:rstrong] (use needinfo to contact me) 2012-07-11 14:08:43 PDT
No, just saying that it should be evaluated though not as part of this bug.
Comment 25 Ted Mielczarek [:ted.mielczarek] 2012-07-12 05:02:07 PDT
For reference, bug 772868 covers handling zip files better in updates.
Comment 26 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2012-08-31 07:12:41 PDT
This better algorithm might save us a factor of 10x on update size.  In addition to bug 772868, that would be a Big Deal.
Comment 27 :Ehsan Akhgari (out sick) 2012-08-31 10:32:28 PDT
(In reply to comment #26)
> This better algorithm might save us a factor of 10x on update size.  In
> addition to bug 772868, that would be a Big Deal.

Last I checked (which is, right now!), courgette only supports Elf and PE x86.  So, how is this going to help b2g?
Comment 28 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2012-08-31 10:37:51 PDT
Yeah, we just refreshed ourselves on this too :).
Comment 29 Benjamin Smedberg [:bsmedberg] 2013-08-28 07:23:33 PDT
*** Bug 910219 has been marked as a duplicate of this bug. ***
Comment 30 Steve Fink [:sfink] [:s:] 2013-09-23 10:28:28 PDT
I was attempting to follow this bug and bug 772868, together with some experiments of my own, and I sort of follow what the current state is but not completely. So let me try to summarize my flawed understanding, and hopefully people can correct me:

1. Aurora gets updated just about every night, deltas are only based on the previous version, and people are seeing frequent (constant?) full downloads instead of delta updates. This might be because of a timing issue. [bug 504624 comment 7]. No bug seems to be open on this.

2. We have the capability to generate partials from multiple base versions, but it doesn't apply to Nightly or Aurora channels. [bug 575317 comment 18]

3. Our partials are currently made using bsdiff on an uncompressed archive that contains a compressed archive member, omni.ja [bug 772868 comment 0]. omni.ja is compressed using zip, which stores individual files separately, so bsdiff is still able to produce savings for unmodified files within omni.ja even though it can't do much of anything for modified files (because they are compressed.) [bug 504624 comment 20].

Background {

Generic compression works mostly by finding identical repeated substrings within one version.

Differential compression works by finding identical substrings between two versions, so that only the changed portions are stored/transmitted.

bsdiff is differential compression that additionally allows small mismatches within the matching substrings (in-place modifications only, no insertions or deletions), and subtracts the substring from its parent to reduce the cost when all of the mismatches differ by a constant amount (common in recompiled object files.) Thus, bsdiff is a general binary delta algorithm that will do particularly well with machine code produced by recompiling relatively small source modifications.

Courgette is even more specialized for machine code. It decompiles machine code and computes the delta based on that, which exposes many more compression opportunities when it knows how to handle the input. (Both bsdiff and courgette will fall back to standard differential compression on data that they aren't specialized for, so there is a small but not significant compression cost to using them on input they are not optimized for.)

}

4. Courgette achieves substantially better compression than bsdiff on the main machine code file (the "MAR", apparently?) [bug 504624 comment 18 and https://groups.google.com/forum/#!topic/mozilla.dev.platform/LRmxeV5uD8A]. bsdiff by itself saves a large amount over generic binary deltas for the machine code files, but also handles many of the non-machine code files (and machine code files for architectures that Courgette does not yet handle.)

5. I experimented with generating binary deltas using rsync's batch files and small difference block sizes. It is fully generic, can do far better than untuned binary deltas, and is nowhere near as good as bsdiff.

6. Courgette has big scary patent encumbrances. Patents are why we can't have nice things. Google is fighting the patent fight and may not be doing so well, and the comments in these bugs give no indication that we would have an easier time. [bug 504624 comment 9]

7. Courgette is naturally very limited in what formats it can handle, and b2g uses something that is not one of those formats [bug 504624 comment 27]. I think Courgette handles ELF x86 objects, and b2g uses ELF ARM objects? So it would need an ARM "port" to be useful there.

8. We build with pgo, which substantially reduces the effectiveness of bsdiff. Presumably, Courgette would not be affected nearly as much.

9. To reduce link-time memory overhead, Windows PGO is now restricted to a smaller subset of files [bug 871712]. I would guess that this should help bsdiff (as in, reduce the PGO-induced hit that bsdiff suffers from.)

10. bsdiff has some pathological behavior on certain inputs, where the initial suffix sorting phase takes insane amounts of time [http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=409664]. Solutions to this have been proposed and appear fairly straightforward. We are not seeing this in practice.

Anything I got wrong? Anything to add?
Comment 31 Benjamin Smedberg [:bsmedberg] 2013-09-23 10:42:29 PDT
> 1. Aurora gets updated just about every night, deltas are only based on the
> previous version, and people are seeing frequent (constant?) full downloads
> instead of delta updates. This might be because of a timing issue. [bug
> 504624 comment 7]. No bug seems to be open on this.

People on Nightly and Aurora often get more than one build out of date and we're not currently optimizing for that case. I believe we shouldn't worry about that now. Betas and releases have less of a problem with users skipping builds and needing full updates because of the lower frequency.

> 2. We have the capability to generate partials from multiple base versions,
> but it doesn't apply to Nightly or Aurora channels. [bug 575317 comment 18]

More that we *don't* do it for nightly/aurora, not that we couldn't if it were deemed to be important.

> 4. Courgette achieves substantially better compression than bsdiff on the
> main machine code file

Yes. primarily xul.dll/libxul.so

 (the "MAR", apparently?)

No. The MAR is an archive of individual files or patches. The .bsdiff for xul.dll is just one file within the MAR.

> 7. Courgette is naturally very limited in what formats it can handle, and
> b2g uses something that is not one of those formats [bug 504624 comment 27].
> I think Courgette handles ELF x86 objects, and b2g uses ELF ARM objects? So
> it would need an ARM "port" to be useful there.

Getting updates to be smaller on Windows x86 (and in the future Windows x86-64) is a primary concern; it will reduce our release-day costs as well as significantly increase the number of users who are stay up to date.

> 8. We build with pgo, which substantially reduces the effectiveness of
> bsdiff. Presumably, Courgette would not be affected nearly as much.

This is probably true, but needs to be measured.
Comment 32 André Reinald 2013-09-23 11:59:23 PDT
Here is what I can add to the discussion:

(In reply to Steve Fink [:sfink] from comment #30)
> 6. Courgette has big scary patent encumbrances. Patents are why we can't
> have nice things. Google is fighting the patent fight and may not be doing
> so well, and the comments in these bugs give no indication that we would
> have an easier time. [bug 504624 comment 9]

On 2012-11-16, a stackoverflow post mentions the legal status of the patent matter:
http://stackoverflow.com/questions/13415147/how-to-compute-differences-between-two-binaries-i-e-two-executables-in-linux
"Yeah, it looks like Red Bend failed to get a preliminary injunction against Google, but Google failed to get the patent invalidated, so the fight goes on."

> 7. Courgette is naturally very limited in what formats it can handle, and
> b2g uses something that is not one of those formats [bug 504624 comment 27].
> I think Courgette handles ELF x86 objects, and b2g uses ELF ARM objects? So
> it would need an ARM "port" to be useful there.

Having checked out and compiled courgette from the Chromium repository, it presently handles 3 executable formats:
- elf 32
  - arm
  - x86
- win 32
  - x86

And the architecture is thought as to ease implementation of additional executable formats / processor instructions set.
For our own usage it would make sense to add :
- win 64 executable format
- mach-o executable format (we can apparently discard Mac OS "fat binaries" as we only build x86_64 versions of Firefox Mac)
- x86_64 instructions

I am currently investigating if Google had some reasons not to implement those yet.

> 8. We build with pgo, which substantially reduces the effectiveness of
> bsdiff. Presumably, Courgette would not be affected nearly as much.

I can confirm that part of the algorithm is about "adjustment" which tries to find similar segments of code even if they have been moved around.

Finally, some topics have been discussed on our dev.platform mailing list:
https://groups.google.com/forum/#!topic/mozilla.dev.platform/LRmxeV5uD8A
Comment 33 André Reinald 2013-09-23 12:29:22 PDT
An idea to circumvent Courgette patent:

If we can leave functions/method names inside the binaries, or in some way provide those names inside a dictionary (like a .sym file), we could have almost the same functionality as Courgette's, but being different enough so to avoid patent claims. Indeed, most address changes are functions/method calls. Of course this only applies to open source projects, which we are.
Comment 34 Jean-Marc Desperrier 2013-09-24 05:10:53 PDT
(In reply to André Reinald from comment #33)
> An idea to circumvent Courgette patent:
> *** leave functions/method names inside the binaries, ***

The problem is not with Courgette, but with the Red Bend patent. 
The RB patent is strongly different from Courgette, and a much more generic principle of optimization that your idea will not escape from : http://www.google.com/patents/US6546552
Since it's very generic, Google had good hopes of finding prior art before the priority date of 1998, but they failed to convince and patentability has been confirmed in February 2011 : http://www.google.com/patents/US6546552#legal-events
However Red Bend also failed to get a preliminary injunction forbidding Google to publish the source code : http://www.gpo.gov/fdsys/pkg/USCOURTS-mad-1_09-cv-11813/content-detail.html

Apparently still nothing new happened since then. Google and Red Bend might have made an undisclosed back-room deal, and stopped actually pursuing the litigation.
There was talks of IBM acquiring Red Bend, but it apparently didn't work out.

Mozilla legal team needs to try to sort this out, AFAIC contacting Red Bend to get a clear idea of what their current position is might be a good idea.
Comment 35 André Reinald 2013-09-24 06:07:56 PDT
(In reply to Jean-Marc Desperrier from comment #34)
> Mozilla legal team needs to try to sort this out, AFAIC contacting Red Bend
> to get a clear idea of what their current position is might be a good idea.

Seems they have sorted this out already. In short: we can use Courgette.
Comment 36 Ted Mielczarek [:ted.mielczarek] 2013-09-24 12:57:04 PDT
(In reply to Steve Fink [:sfink] from comment #30)
> 8. We build with pgo, which substantially reduces the effectiveness of
> bsdiff. Presumably, Courgette would not be affected nearly as much.

This is the most interesting thing, to me. Our Windows updates are much larger than they used to be before we shipped PGO binaries. If courgette performs much better on these builds then that would be a good win.

Note You need to log in before you can comment on or make changes to this bug.