Open Bug 231925 Opened 21 years ago Updated 2 years ago

xml pretty printing is more bloated than necessary

Categories

(Core :: XML, defect)

defect

Tracking

()

People

(Reporter: axel, Unassigned)

References

(Blocks 1 open bug)

Details

I bet we could cut down on the additional bloat compared to IE as mentioned in
bug 197956 if we get rid of the view-source html and instead use plain XML and
CSS. We could get rid of all attributes, I bet we could get rid of the span for
the '=' in attrs in favour of CSS :before {content: "=";}.
Not sure if we should get rid of the table for the expander, too. It may suffice
to just use CSS display and not expose the full table elements.
Having a simpler markup might eventually ease the transition to a non-XSLT
prettyprinter, as I don't think we can fix bug 175946 with it.

I'll take a stab at this one.
Blocks: 197956
on a 600k XML file, I am like four times as fast. At least when I compare
pretty-printing with a XSLT PIed version linking to my stylesheet.
I removed all the html, so that we don't go thru the rather expensive html
content generation. I removed almost all attributes and use just plain xml
elements, that should get us a significant improvement in size.
The generated tree is a good deal shallower (sp?).
I removed the predicates from the tests in favour of xsl:choose, and added
priorities so that we deal with texts and elements first, then comments,
then PIs and then documents. The expander is done in xbl alone, so the 
call-template died, too.

Of course, all of this is nothing until I manage to get collapsing undone.
I need to put a testcase online to that I get some info from layout folks on
why this thing is acting up.

(Note that the collapsing and expanding is faster than the stuff IE does.)

Jonas, do you notify an observer to each expandable element in the generated
doc? If so, why did you do that? It seems like that is causing another
20% of the total time or so.
some numbers, solaris 1.6 build (so no IE comparison, but you'll get the idea)

TestGTKEmbed about:blank 25MB

testfile with 2k nodes and 600k datasize (about one third of
http://bugzilla.mozilla.org/show_bug.cgi?id=197956#c12).

pretty-printing takes 8:30 mins and 92MB, xslt with my mods takes 2:40 mins and
69MBs
pretty-printing makes it up to elem27, my version makes it up to elem98.
though it's getting pretty confused when it one scrolls the other elements into
the view. Dang, I get broken layout all over my ass.
Would you be able to get any numbers on how the new stylesheet compares to the
old one in a post optimized-xpath world? The reason i'm asking is that (at least
for some testcases) optimized xpath makes the old stylesheet 6 times faster,
whereas this new one is "only" 4 times faster and from your description doesn't
seem to get as good benefit from optimized xpath.
> Jonas, do you notify an observer to each expandable element in the generated
> doc? If so, why did you do that? It seems like that is causing another
> 20% of the total time or so.

Not really sure what you mean, but as far as i remember i don't perform any
notifications manually at all.
The XSLT code does send out a lot of notifications during content-creation
though, that's covered by bug 221335
(In reply to comment #5)
> Would you be able to get any numbers on how the new stylesheet compares to the
> old one in a post optimized-xpath world? The reason i'm asking is that (at least
> for some testcases) optimized xpath makes the old stylesheet 6 times faster,
> whereas this new one is "only" 4 times faster and from your description doesn't
> seem to get as good benefit from optimized xpath.

As I mentioned in our favorite bug 197956, we shouldn't confuse factors for
differently sized testcases. I just didn't take the time to wait for the full
testcase, so I cut down it's size. That of course get's down the factor if one
has different scalings in two attempts.

I'll first try to get down to the odd facts in the layout before I start
attaching testcases.

Btw, IE has pretty odd layout problems with their stylesheet and deep documents,
too. Hihihi.
numbers on Nested_Chapter_Test.xml:
pp: 72MB, 4:47
me: 54MB, 1:30

That's roughly a factor 3.
http://bugzilla.mozilla.org/show_bug.cgi?id=208172#c3 says *6 for xpath optim, but
I think that is from pre-walker days.
Note that speedups do vary from architecture to architecture, too, as memory vs.
cpu speed change as well as alloc overhead and such.
Blocks: 232990
Is any work being done on XML Pretty print at the moment?

I have studied the xsl pretty printer and found that it makes roughly eight
times (8x) larger output, this gets very slow on larger XML files as they begin
to   hog large amounts of memory (I often work with files in the range of 500kb
to 5000MB). 

The html output file could be miminalized easily, by using shorter class names
and  similar approaches. The result should be quicker response times (? I
guess), and use of less memory.

I am willing to assist if and where I am needed. ;)
It's much more important to cut down the number of involved elements.
Not sure if class names would have an equivalent impact. (I'm afraid that 
attribute values aren't stored as atoms for class="", though.)
It'd be much more compact to store the information in element names instead of 
spans and classes, though.
classnames are actually stored as atoms so changing the classname to something
shorter will have no affect at all. However what could have a small effect as
far as classes goes is to avoid having more then one class for a single element.
In those cases we store the attribute as a list of atoms which takes much more
space then a single atom.

However it would probably have a much greater effect to cut down on the number
of elements. Getting rid of the tabels would be great if it's possible.

Though the best way to increase the speed of something comming up is bug 208172.
The patch in there increases the speed of the prettyprint stylesheet by a factor
of 6. I'm not marking that bug a blocker of this one though since this bug is
about bloatyness and not speed, just figured i should mention it since comment
10 talked about slowness.

Finally, if you're working with 5GB xml files mozilla won't be the tool to use.
There's no way in hell we'll be able to render files like that prettyprinted in
any sane sort of way.
Hi, and thanks for the feedback. 
I am sure long classnames would have no big effect once in gecko dom layout, but
the fact that it is xslt transformed into a much bigger document, which
subsequently have to be parsed and rendered does seem a little bloated to me,
now consider the xslt transformation was 8x times bigger than the source (I only
tested one large example so far, but should give a rough factor), I'm guessing
it could be streamlined a bit perhaps.
Perhaps a version only featuring the folding part, omitting the color-coding
would be beneficial for people working with larger XML documents (500kb+).

Hopefully I will have some time to experiement more with this during the hollidays.

(In reply to comment #12)
> Finally, if you're working with 5GB xml files mozilla won't be the tool to use.
> There's no way in hell we'll be able to render files like that prettyprinted in
> any sane sort of way.

Whoops, I meant 5MB not 5000MB, sorry for the mistake.

(In reply to comment #13)

We never serialize and parse the XSLT output, thus the string size doesn't matter,
just the content size does.
(In reply to comment #16)

Im  a little confused by that comment.. The XSLT output is the very html code
that  shows the pretty printed xml, thus it must be parsed - and it does make a
difference is xslt output is 8MB or say 4MB since it would quite simple be less
data to handle.

I have worked a lot with XSLT, but not the inner workings of mozilla, are you
catching the xslt output as a domtree and direct it directly to gecko?
(In reply to comment #17)
No, in Mozilla XSLT output is a DOM document, not its string representation.
Serializing it into a string would be an additional step that isn't done here.
(In reply to comment #18)
ok, that makes good sense. I will return when I have made tests, my goal is to
reduce the xsl output DOM object in size. Which to me still seems reasonable
using existing technolgy and implementation 

- however I think on a longer timescale that a more low-level solution would
greatly benefit in terms of speed and memory efficiency.
So as of bug 379683 there are no more tables.  Only divs and spans, some styled as inline-block (and some using the default block styling).

All the uncolored plaintext is now just that -- plaintext.  No spans around it.

So if desired it should be easy to transition to XML instead of HTML if there's a win to it (as suggested by comment 2).  Is there such a win?
I think those benefits are gone these days since we no longer need to case-convert for LREs (i.e. for things like <foo> in the stylesheet). That is all done at compile time.
Niceness :-)

Anyway, I guess there could be still a win, in both content creation and CSS resolution.

Things like <span class="foo"> could just be <foo>, as long as we have only one static class. That way, we don't have to create attributes for those elements, which should make us get rid of some allocations, and some modification events, too. I have no idea if it would help our style resolution if we don't have to access class attributes.
Looking at the last profilein bug 197956, we spend about 25000 hits setting attributes, out of 190000 for the XSLT part over all. Given bug 197956 comment 56 last comment that the XSLT part is now around 50%, that means that we'd save some 5% by fixing this bug.
The XSLT part is about 33%, not 50%, actually.
For what it's worth, I tried converting this stuff to XML, and didn't see much of a speed or memory difference...  Maybe I didn't do a very good job on the XML.
QA Contact: ashshbhatt → xml
Assignee: axel → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.