Closed Bug 69426 Opened 24 years ago Closed 8 years ago

bytecode compression for xml

Categories

(Core :: XML, defect, P2)

x86
Windows 2000
defect

Tracking

()

RESOLVED WONTFIX
Future

People

(Reporter: alecf, Assigned: alecf)

Details

(Keywords: perf)

I've been fiddling with an xml parser, and may have found a nice way to tokenize xml into a faster-loading bytecode... this could make loading XUL much faster. the basic idea is this: xml is a very verbose language that often has as much overhead in syntax as it has raw data. It would be very easy to atomize tags and attributes, and serialize well-formed xml into a more compact format on disk, which could be further compressed by a true compression algorithm such as that found in zip or gzip. tags, attribute names, <, >, = take up about 35% of an average XUL file. Early analysis based on existing xul suggests that we could cut the space used by tags and attribute names by about 60%, an overall savings of about 20% of the file. My theory is that this atomized file will actually compress even better than existing xml. This analysis does not include stripping of xml comments (like the license, which takes up about 800 bytes per file) or whitespace.
reassingning to me (wanted to get the default XML owner cc'ed)
Assignee: heikki → alecf
What if we just used compression on the JAR files?
The .jar files are already compressed. I think the goal here is to save parser time rather than disk space.
Gotcha. alecf, how much of an issue is parser time?
it's a bit of both. I think I can warp XML into a format that will compress better, and will be parse faster... this makes the jar files (already compressed) smaller.. further analysis with a larger sample (gotta love perl5's XML::Parser) seems to indicate that I could actually compact the XML markup by about 70% the average XUL file is 22% whitespace and comments 27% XML markup (tags, attribute names, and <,>, =, and ") with the xml-to-bytecode compiler, we could eliminate the comments and 70% of the 27% markup, a grand total of 30% of the file, before compression. this is kind of a blue-sky sort of thing, so marking mozilla 1.1 for now
Status: NEW → ASSIGNED
Priority: -- → P2
Target Milestone: --- → mozilla1.1
one way this will make parsing faster is by atomizing the strings. This will greatly reduce the number of allocations done by the parser's tokenizer because the tokenizing will be done at compile-time.
Another blue-sky way to address this problem (parsing XUL takes time) is just to compile the XUL all the way into the structs (nsXULPrototypeElement, etc.) that you actually want to have around in memory. Then you have no parsing to do at all: you just do a binary read and relocation fixup. (If I understand correctly, this is how `.fasl' files work in many common lisp implementations.) gagan has suggested that it would be possible to write a cache module (in the New New Cache Architecture, of course) that could do this sort of thing.
alecf's idea sounds a lot like WML (the language used in WAP applications). They replace tags with numbers (or something like that, been a while...) so that a WAP server needs to transmit less and a WAP browser can have simpler parser. If you want to pursue this I'd advice you have a look at some of the WAP/WML docs... [By the way, the ViewPort SGML/HyTime engine could also store SGML into a fast loading binary format that was readable only by applications based on ViewPort. The speed improved typically by a factor of 10.] However, I find waterson's idea more appealing. If we could compile XUL into a binary we could directly load into memory it would be even better. Regardless of the approach I believe we should have the original XUL in normal XML as it is now, and have some sort of cache for the compiled, fast-loading versions. I wouldn't want to lose the benefits of XML here. Maybe this has been implied all along, I just wanted to make it explicit.
waterson - that sounds pretty cool. as far as this goes, I have two goals here, in this order: 1) reduce download size 2) speed up reading of XUL from disk.. while the idea of reading in structs from disk does appeal to me, it looks like WML is really what I should be looking at (thanks for the reference) As for caching vs. distributing raw data is concerned, since my biggest concern is reducing disk/download footprint, I would rather not distribute xml as it is today... 0.1% of consumers may want to unpack .jar files and muck with the contents, but the other 99.9% want a fast browser.. My preference is for: 1) distributing some sort of XP compacted format, perhaps wml 2) permentantly caching this compressed wml in the new new cache architecture as structs, like waterson said, so that it's even faster. Anyway, before I read these comments I fiddled with my perl program a bit more and determined that I could compact the files to save about 36% overall, pre-compression. assuming 50% compression (my format is still very compressable) we could get it down to about 32% of it's original size. Now to explore wml.
Yep, with alecf's investigations, looks like just going a further step ahead with waterson's pre-compilation will give maximum performance... no parsing and related chords; just load, and do some pointers arithmetic/initialization, and all is ready to go at full gear! Keeping the original files (as heikki reminded), and some other code paths to still be able to handle things the old text way, is all that is needed to get set... sounds really appealing, and may cause a great divide amongst those who keep complaining about the overhead/speed of XUL :-)
Keywords: perf
Of course, as alecf noted, the original files need not be shipped -- except in cvs and debug builds... as usual, developers get all the crap...
nav triage team: This would be way cool, but not a beta stopper ;-) Marking nsbeta1-
Keywords: nsbeta1-
Target Milestone: mozilla1.1alpha → Future
QA Contact: petersen → rakeshmishra
Hoping to wake-up interest in this bug, because I have a real customer intranet application that needs help. The problem is the app is designed around a sort of homegrown 'GUI framework', implemented as XML/XSL library. The pages use xsl:import href= for the library files. In this way, then, even a simple page, builds up a 170K xml document. It takes a Very Long Time to parse through this document. Actual 'execution time' is very small compared to parse time. Problem is it is all re-parsed each use, nothing much is optimized for subsequent reuse. (as opposed to IE xml engine which is much faster on reload of these docs). Okay, a problem here is the customer considers their app confidential and they do not want it posted in public domain, so I can't put the testcase here, at least for now. I am far from being expert in XML stylesheet usage. Would also appreciate any ideas about how to speed up the app itself on Mozilla. Suggesting to them the obvious answer, use more granular 'xsl library files' and importing only whats needed as needed... gets the same old answer... "IE does not have a problem, it is mozilla that has the problem, not our application." ... I hate when that happens. Appreciate any comments or suggestions........
QA Contact: rakeshmishra → ashishbhatt
Given fastload, how much of an issue is this?
QA Contact: ashshbhatt → xml
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.