Closed Bug 68686 Opened 23 years ago Closed 10 years ago

Shrink .jar files by stripping out whitespace, comments

Categories

(Firefox Build System :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 903149

People

(Reporter: sspitzer, Unassigned)

References

Details

(Keywords: memory-footprint, privacy)

Attachments

(1 file)

would it make our .jar files any smaller if we removed whitespace, comments, etc
from the .xul, .css, .js, and .dtd before we made the jars?  (or afterwards,
before we shipped to the user?)

the .jar files are zip archived, which might mean removing white space might not
be a big win, but there are plenty of comments in our shipping .jar that user
has to pay for.

they have to pay for it on download once, and then on startup.  I don't have
numbers, but I'm sure smaller .jar files will make for a faster start up.

I had heard that brendan was investigating us shipping pre-compiled .js, which
would mean there would be no .js to compress, but there is still .xul,.dtd,.css,
etc.

comments?
hrm.. I had once thought about some sort of xml-proprietary compression
mechanism, but I'm not sure I see the value if we're already using compression
in the .jar files.. I just have a feeling this is going to be a 5% gain in
space, a 1% gain in time, for a whole lot of work
alecf, it might not be worth it.  we'd need numbers to justify it.

here's something I (or someone with more cycles) could try:

get some numbers (size and performance) for loading messenger.jar as is.
then, hand strip out the comments (let's forget whitespace for now) and create a
new messenger.jar and get a new set of numbers.

if it looks promising, we can think about what to do next.

I just wanted to through this out there, for consideration.

on a related note:  legally, would be even been allowed to strip the mozilla
license from the files in the jar?  that a lot of unused text in almost every file.

by my crude measurement, I think there is about 100k of comments and
localizatoin notes in en-US (which becomes en-US.jar) that could be removed.

add leaf to the list, always good to have release in on this type of thing,
since it would involve the build / package process.
The license issue is easily satisfied by putting a prominent notice at the root 
of each .jar file (say, a README or LICENSE file), at least by my 
interpretation of section 3.5:

  3.5. Required Notices. 
  You must duplicate the notice in Exhibit A in each file of the
  Source Code.  If it is not possible to put such notice in a particular
  Source Code file due to its structure, then You must include such notice
  in a location (such as a relevant directory) where a user would be likely
  to look for such a notice. [...]

It would be interesting to try an uncompressed .jar file. We'd spend more time 
reading from disk, but less uncompressing. Which matters more? Either way it 
would be nice if the parser didn't have to munge through extraneous comments. 
It would make our chrome a little more daunting to experimenters so I hope we 
don't do this unless it really turns out to help.
the uncompressed jar idea is a very good one. that is how 4.x stored it's .class
files for Java. 
disk bandwitdth is pretty cheap these days relative to the cost of opening any
file, etc. It would be interesting to do some timing on compressed vs.
uncompressed jar's. cc'ing jgrm for some possible measurements.. maybe we should
take this up on .porkjockeys.
sorry, I got lost.  were you suggesting that it would be useful to see how much
faster we'd be if we compared:

reading uncompressed files
vs
reading uncompressed files that were stripped of comments (and whitespace?)


or did you want to see compressed jar vs. uncompressed jar?  (to see if we are
paying the price for the decompression)

of course, jars are a big win on mac (since file i/o) is slow.
No precompiled .js files would be shipped, rather, bug 68045 proposes that
Mozilla precompile on first run after install.

/be
I meant test compressed vs. uncompressed as a separate issue from the 
comment/whitespace removal.

As a bonus I'd guess that an uncompressed .jar file would end up slightly 
smaller in the compressed download package. Let me test that...

                      
std comm.jar            574951  (530982 in .xpi)  mostly text            
uncompressed comm.jar  1885913  (366206 in .xpi)    "     "
std modern.jar          554959  (429592 in .xpi)  lots of images
uncompressed modern     812316  (381036 in .xpi)    "     "
std en-US.jar           235293  (204005 in .xpi)  text
uncompressed en-US      658055  (129626 in .xpi)    "

An extra 2Mb on disk (2 1/2 times bigger) probably isn't worth a 290K (25%) 
savings on the download. Unless it's a performance win, and it might in fact be 
a loss. I'll add this to my list of things to try
footprint, and a triage-able summary
Keywords: footprint
Summary: is it possible to make the .jar files a smaller by removing whitespace and comments from the files? → Shrink .jar files by stripping out whitespace, comments
I'm not a good owner for this bug.  one of you all want to take it.
giving to dveditz.  this seems right up his alley.
Assignee: asa → dveditz
dveditz: personally, I think I disagree. 290k is 58 seconds on a 56k modem.. and
this is just one package! Admittedly you probably can't knock off 25% from the
other packages, but our current full download is 20-30 megs. if you could knock
off 2 megs, that's 6 minutes off of a 1 hour download..seems signifigant :)
Keywords: nsbeta1
Ok, so I did all of our chrome files instead of just three main ones.
               compressed      uncompressed     diff
on disk:        2283579          5896895        3613316
download:       1949833          1418815         531018

Saving half a meg on the download is nothing to sneeze at, but before jumping 
into this lets attack the comment stripping first. It's possible the savings is 
entirely due to compressing out redundant license headers which it isn't able 
to do when compressing each chrome file individually. And it may be the very 
fact the chrome is compressed that helped Mac performance when we switched to 
.jars.
diff's should be signed.
               compressed      uncompressed     diff
on disk:        2283579          5896895        3613316
download:       1949833          1418815        -531018

personally i'm against stripping licenses because i like to work from the 
.jar's that i receive when i retrieve builds. there's another bug about 
removing the expanded version of files from all dist packages, the result is 
that it would be impossible for someone to create a correct diff from packaged 
builds.

Assuming that different os's have different performance costs, would people 
object to mac having compress-xpi[compress-jar] while win,lin have 
compress-xpi[uncompress-jar]? or some other similar disparity?
ok, I am fiddling with stats about what exactly we have, and here are some
interesting statistics. This is the total space used by each file type (in a
mozilla build:
 .css:   695926 ( 12.0%)
 .dtd:   311344 (  5.4%)
 .gif:   847212 ( 14.7%)
 .htm:      128 (  0.0%)
.html:   210439 (  3.6%)
  .js:  2120408 ( 36.7%)
 .png:     5777 (  0.1%)
.properties:   125398 (  2.2%)
 .rdf:    49383 (  0.9%)
 .txt:     1876 (  0.0%)
 .wav:    24266 (  0.4%)
 .xml:   223405 (  3.9%)
 .xul:  1167043 ( 20.2%)

so it looks like even if we could get rid of 10-20% (a guess on my part) of JS
through comments, it's still only 3-6% of the resources files, let alone the
total xpi package sizes
here's a thought: what if we could 'pre-compile' xml? I mean, they're well
formed, so maybe there is a more compact format which we could better store them
in.
we can certainly refactor some xul, my rewrite of mail is considerably 
smaller(i just need to lookat it and get it reviewed...). however i'm not sure 
how much duplication there is left to squeeze out beyond mail.

properties can be squeezed through reorg of identical values, but they're tiny 
and compress well...
ok, I don't know what this "mail rewrite" that you're referring to is, but I
think that's a bit beyond the scope of THIS bug :)

Anyhow, a little research and I came up with:
elements:
     commandset:     4
       popupset:     5
       menuitem:     5
     menubutton:     6
         button:     7
           rule:     9
            box:    12
      menupopup:    14
         script:    15
attributes:
            uri:     8
    tooltiptext:     8
        persist:     9
           flex:    10
        tooltip:    13
      oncommand:    14
           type:    15
            src:    17
          value:    19
          class:    28
             id:    59

these are the top 10 or so tags and attributes in navigator.xul. I'm sure much
of our XUL has lots of similar tag and attributes. One could imagine a
compression algorithm based on this knowledge - atomizing the strings and
encoding them in a stream of some kind.

Hey, sure enough someone has come up with something along these lines:
http://www.research.att.com/sw/tools/xmill/
wait a second, xmill is screwy, ignore that.
reminder to alecf to comment on his byte-code compiler idea for XUL/XML.
Severity: normal → enhancement
Component: Browser-General → Build Config
QA Contact: doronr → granrose
Hardware: PC → All
Blocks: 104166
Status: NEW → ASSIGNED
Dan mentioned this bug in an email exchange; I didn't realize it existed :-)

When working on localized embed.jars for several embedding customers we (L10n)
encountered the need to reduce bloat (i.e. distribution size) and this idea
popped up again.

I found a perl module by the name "HTML Clean" and since it looked cheap I ran
some quick tests with it. Intrigued by the results, I started hacking the script
a bit and let it work the magic on a 1.2a build. 

http://aspn.activestate.com/ASPN/CodeDoc/HTML-Clean/HTML/Clean.html

The numbers below are preliminary; I didn't have time to fine tune the script
and there might be more room left for improvement. 

* current results indicate that XUL bloat could be cut by at least 35%
(embed.jar    
  600kB->400kB, Mozilla chrome 3.5MB->2.2MB)

* proof of concept: reduced Mozilla's distribution size by 1.1 MB, improved
  startup time by 3-5% (measured with a fresh profile, fast load was turned on)

http://rocknroll/users/jbetak/Mozilla_1.2a_Optimized_XUL.zip
wow! those results are fantastic! 
maybe mcafee or seawood could help you get some scripts going that would strip
.jars, and then we could roll that into the release-only build?
Does this address the license notice visibility issues brought up in bug 73661?
Are the links from 'about:' good enough?



*** This bug has been marked as a duplicate of 73661 ***
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE
Oops! Mid-afternoon naps are hazardous to bugs.  That was backwards.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
*** Bug 73661 has been marked as a duplicate of this bug. ***
Chris, ahhh, I'm so glad you reopened this. I'm trying to get people excited
about the potential gains XUL runtime optimization could yield, and then the
bugs gets closed down ;-)

What do you think of Alec's idea? Could we strip license headers, comments and
white space in release builds? A little while back I thought Blackbird might be
a potential candidate, but it might be too late for that.

AFAIK Tao approached Cathleen and there was some talk about creating additional
optimized builds and possibly look into modifying some of the build scripts in
Buffy time frame.
Hmm. Since (post 1.0) we now serialize both JS and XUL (plus DTD) content
into the fastload file, I'm not sure we would actually gain much runtime
benefit from compressing whitespace in those files (aside from the initial
serialization which should only happen once for a typical end user). After
we have serialized a XUL or JS file, we will never read from the jar file 
again for that file (or at least, that's how it's supposed to work).

Nonetheless, reducing download size (and time) is a worthwhile goal, and 
cutting a minute or more from modem download times is a win for the end user.
jrgm, I agree. Although the improvement seems improbable, I've posted it because
that's what my test bed said. 

I don't consider my observation to be conclusive. I only had one optimized test
build and a very limited time to play with it :-)
well, its both download size and first-time speedup for new windows - i.e.
beyond just the browser, if I open mail for the first time, it will be faster. I
think download size is the biggest bonus here, myself.
I assume this will be an extra release-build step (like rebase or symbol
stripping) that doesn't need to run each build?  If so it'd be easy to save off
a copy of the unstripped jars for people who art trying to use PatchMaker.

The license isn't that hard to satisfy. We could do nothing an claim its a
binary/processed file, but since it will look like text to people who poke
around and it's easy to add license boilerplate we should do that. Either or
both of a LICENSE file inside the archive (claiming the contents are not source,
but the source is available under MPL and where) and the comment field of the
zip archive (not that many people look at those).
just curious, do we have a build for evaluation?
Cathleen,

the only currently available evaluation build might be this optimized 1.2a
talkback-enabled Mozilla release:

http://rocknroll/users/jbetak/Mozilla_1.2a_Optimized_XUL.zip

It's based on this binary distribution:

ftp://ftp.mozilla.org/pub/mozilla/releases/mozilla1.2a/mozilla-win32-1.2a-talkback.zip




I've been working on the first part of a patch for this. Though it still has
some bugs it works fairly well. I've written a standalone program 'Strip' that
can remove JS style comments (// and /* */) and XML style comments ( <!-- --> ).
Furthermore it can trim lines on the left and right side and remove empty lines.
It is a standalone program that should be fairly easy to integrate into the
build process but since I know nothing about that, someone else could perhaps
give that a try (hint! ;-). It recognizes several extensions by default. Just
compile and run strip -? for more help or look at the source.

There are a few regressions right now, Chatzilla stopped working e.g., but most
of Mozilla works. I'll investigate the bugs further.

These are the results so far. All the numbers are in kilobytes.

stripped: *.js, *.css, *.xul, *.rdf, *.xml, *.xsl


jar               uncompressed  compressed   compressed+stripped    gain
chatzilla.jar         544           148          102                 46
classic.jar           549           296          237                 59
comm.jar             3192           944          610                334
inspector.jar         543           188          116                 72
messenger.jar        1909           537          341                196
modern.jar            859           568          478                 90
pippki.jar            275           100           60                 40
toolkit.jar           822           221          166                 55
venkman.jar           827           258          198                 60

Total                9520          3260         2308                952

Compression is based on maximum zip compression because I don't know how to make
.xpi files. Nor do I know how well .xpi's are compressed but I assume they're
comparable with zip. Assuming that, we can save approx. 952kb on the
downloadsize. I've only tested the larger jars with the extensions above.

The program can already strip *.dtd with manual settings but doesn't (yet)
recognize the extension automatically. We could also trim *.html and with a
small modification *.properties can also be stripped. This proces can be taken
of course one step further: we can remove the spaces from "var x = 3;". It
shouldn't be very hard, I'll look at it later on. So far I think the result are
encouraging.
Rene: this would only ever be applied to release builds, as all the information
you are removing is very useful for debugging. Also, changing the files would
break tools like Patch Maker (http://www.gerv.net/software/patch-maker/). 

So, you have to be completely certain that running your tool over the files will
not cause regressions - because the bugs will be very hard to track down. I'd
say the best way to do this would be to be certain that your comment-finding
rules are exactly the same as those used by the relevant parsers in the Mozilla
code.

Gerv
this is great stuff. Do you have a measure of savings of uncompressed+stripped?
We found recently that using uncompressed .jars saves us lots of allocations on
startup (Because we don't have to go through zlib)

I don't think we should be burdening all users because this might break
patchmaker. The best thing we can do is integrate this into the build early in
1.5alpha such as right now (thus adding it to nightly builds, not just release
builds) and then seeing if any regressions crop up.
> The best thing we can do is integrate this into the build early in
> 1.5alpha such as right now (thus adding it to nightly builds, not just release
> builds) and then seeing if any regressions crop up.

Alec: do you mean that we should integrate it on a temporary basis, just in
order to find regressions? As I said, if you put it into nightly builds
permanently, that breaks Patch Maker...

Gerv
Sure, it could be temporary. We could tell people that patchmaker is going to
break for a few weeks.
Rene,

I just wanted to provide you a few pointers to build scripts. Ian's preprocessor
for Mozilla Firebird might be of particular interest to you. I adapted a Perl
script named "HTML Clean" for my test run last year. If past experience is worth
anything, optimizing .properties and .dtd is most definitely worth the effort. 

http://lxr.mozilla.org/seamonkey/source/config/preprocessor.pl
http://lxr.mozilla.org/seamonkey/search?string=preprocessor.pl

http://lxr.mozilla.org/seamonkey/source/config/make-jars.pl
http://lxr.mozilla.org/seamonkey/search?string=make-jars.pl

I would imagine that if implemented, XUL optimization will be a build option.
When the dust has settled, it will only affect optimized builds, right?
Using it temporarily on the trunk sounds good to me. Right now is not a good
idea since it still contains some known bugs, but I'll try to get them out at
asap (I'm rather busy with some exams coming up though), hopefully by the end of
this week. Remember that this is just a first naive version that is supposed to
be a proof of concept and food for thought rather than a final functional
version. Right now it still has some known problems such as regexp in JS (e.g.
/\// is seen as a comment) and cdata blocks in xml files can become crippled if
they contain a <!--. 

I have no numbers on uncompressed+stripped but as a rule of thumb you can use
the compression ratio of uncompressed vs. compressed. In most cases this is
+-30%, except for classic.jar and modern.jar where it is higher due to the big
number of images.  So uncompressed+stripped should be +-3MB smaller than
uncompressed.
Blocks: 171082
Rene,
any news on this? I would be interested in testing something newer than your
06/2003-version of Strip. Have you done any work on that in the last four months?
Hmm, no. After the first version it ended up under a big pile of dust. I was
working on a rewrite to handle some issues better such as reg. expr. and
removing spaces from "var x = 3;" but I never finished it. I can't promise
anything w.r.t. finishing Strip any time soon since I'm still pretty occupied,
as usual.
Product: Browser → Seamonkey
This is particularly annoying because some viruses pick up e-mail addresses from
these files, and my e-mail address is in a least one of them.  As a result, I
have received several million virus e-mails to that account over the past couple
years due to people having both Mozilla products and viruses on their machines.
 So, if not doing this for a space reduction/performance gain, how about doing
it for developer sanity?  By bug 279698, I'm under the impression Firefox
already tries to do this.
Keywords: privacy
Assignee: dveditz → nobody
Status: REOPENED → NEW
QA Contact: granrosebugs → build-config
Product: SeaMonkey → Core
(In reply to Rene Pronk from comment #41)
> Hmm, no. After the first version it ended up under a big pile of dust. I was
> working on a rewrite to handle some issues better such as reg. expr. and
> removing spaces from "var x = 3;" but I never finished it. I can't promise
> anything w.r.t. finishing Strip any time soon since I'm still pretty
> occupied,
> as usual.

There was a patch for this at one time. Is this even worth pursuing any more, or should it just be marked "WONT FIX"?
Status: NEW → RESOLVED
Closed: 22 years ago10 years ago
Resolution: --- → DUPLICATE
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: