Closed Bug 512272 Opened 10 years ago Closed 10 years ago

Store CSS in UTF16 to facilitate zero-copy IO

Categories

(Core :: DOM: CSS Object Model, defect, P3)

x86
Linux
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: taras.mozilla, Assigned: benedict)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [ts])

Attachments

(2 files, 2 obsolete files)

zwol came up with the idea of storing our css in the target utf format. This would allow us to 
a) save on memory copying
b) utf conversion
Is that really such a good idea?  Vlad has been doing a lot of profiling and IO is by far the biggest bottleneck at least in startup time.  Furthermore, a bunch of us have been doing a lot of work in Bug 506430 to make the UTF 8->16 conversion a lot cheaper in terms of runtime.
Blocks: 506431
I do not believe i/o is anywhere near dominated by UTF-16 high bytes -- profiling data proving this, or it didn't happen :-).

Plus, as this bug's subject says it is facilitating zero-copy (mmap) i/o by changing representation. How much that might win, I don't know, but transcoding won't win, it will be a net loss in cycles retired.

Speeding up conversion still takes time converting. The web standards really want "UCS-2" nowadays stored as UTF-16. Although we might get the transcoding time down to noise for common cases, I'm skeptical this is a good use of coding time, never mind code footprint or cycles.

/be
Whiteboard: [ts]
We're about to land some stuff for optimizing utf8 to utf16 conversion (in bug 560430), at least speeding up the ASCII case.  But there are a lot of wins here, one of the main ones being that we could hand a mmap'd buffer straight to our parsers (js, CSS, others), so at the very least there would be a win in doing many fewer temporary buffer allocations, fewer memory copies, etc.
Well, as one of the CSS parser guys, what I really want (since almost everything in CSS is plain ASCII) is UTF-8.  But we'd have to convert most of layout/style to use UTF-8 internally for that to be a win, I think.

In the mean time, killing two memory copies should more than compensate for having to store 2x as much data on disk, especially when reading from mmapped jar files.
Using UTF16 for css files stored in jar files, will still mean more I/O (even if mmapped, it is still I/O for the operating system), twice the number of bytes, and more memory allocation (even if we can save the read buffer.
The real improvement would be to allow the CSS parser to directly read the UTF8 (or really ASCII) encoded CSS files stored in jar's, instead of first converting UTF8 to UTF16 and then parsing... Parsing UTF8 based CSS files directly from mmap'ed jarfiles would the fatest solution.

It would only require the CSS parser to also handle UTF8 based streams/strings.
(In reply to comment #5)
> Using UTF16 for css files stored in jar files, will still mean more I/O (even
> if mmapped, it is still I/O for the operating system), twice the number of
> bytes, and more memory allocation (even if we can save the read buffer.
> The real improvement would be to allow the CSS parser to directly read the UTF8
> (or really ASCII) encoded CSS files stored in jar's, instead of first
> converting UTF8 to UTF16 and then parsing... Parsing UTF8 based CSS files
> directly from mmap'ed jarfiles would the fatest solution.
> 

Indeed the size will double. From my measurements doubling the size of css files will not make a big difference because the extra io is continuous. We will not land this if it regresses performance in any way. Avoiding 2 memory copies, lower memory usage and no utf16 conversion overhead seems like it would be a win.
Attached patch very rough draft (obsolete) — Splinter Review
this patch kills one memcpy by storing the CSS on disk in UTF-16, and awkwardly short-circuiting in the converterStream.

areas I want to iterate on: getting the CSSParser to read straight from a provided buffer (killing another memcpy), and finding a more elegant way to determine which files are already converted. Ideally, we would just be able to detect the encoding, but currently the detection code returns "UTF8" when given a UTF-16(LE) encoded file.
Attachment #398468 - Flags: review?(tglek)
Attachment #398468 - Attachment is patch: true
Attachment #398468 - Attachment mime type: application/octet-stream → text/plain
> Areas I want to iterate on: getting the CSSParser to read straight from a
> provided buffer (killing another memcpy),

I got this (see bug 513149).

Depending on how that and a bunch of other work comes out, we may want to go back to UTF-8 on disk; running the CSS parser directly on UTF-8, or even the entire style system, is on my list of things to experiment with.  CSS is pure ASCII except for occasional string-type tokens, so that ought to be a win at least by cutting memory bandwidth in half.

I won't be working on it till next week at the earliest, though.
Attached patch readsegment (obsolete) — Splinter Review
changes nsCSSScanner to use ReadSegment, should avoid one more memcopy.

still need to write some tests for this. also, I think I should split my changes to nsConverterInputStream into a different class that just handles already-converted files.
Attachment #398468 - Attachment is obsolete: true
Attachment #398468 - Flags: review?(tglek)
> changes nsCSSScanner to use ReadSegment, should avoid one more memcopy.

So, if I understand that patch, the ReadSegment callback puts the scanner's read pointer directly into the input stream's internal buffer, and then claims to have consumed all of the data in that buffer.  Is that really safe?  How long will the data in that buffer be valid?
> Is that really safe? 

No.

> How long will the data in that buffer be valid?

Per the interface, until the callback function passed to ReadSegments returns.  The stream is free to discard the data immediately afterward.

Now it happens that the only current implementations of the interface either never invalidate the buffer (if the stream wraps a string) or don't invalidate it until the next Read/ReadSegments call (for the converter stream).  But I don't think we want to depend on this.

Basically, ReadSegments is designed for a push model of data flow, and the CSS parser does a pull instead...
> Basically, ReadSegments is designed for a push model of data flow, and the CSS
> parser does a pull instead...

It might not be too hard to turn that around.  We could set up the data stream and call ReadSegments in the various parser entry points, then within the callback, set up the scanner's read pointer and run the parser.  The difficult part would be suspending the parser, instead of generating a spurious EOF, when we hit end-of-segment.
Worst case, should be able to do a small buffer in between though, no?  Pull data until the parser stops pulling, then buffer any of the rest until the next iteration.  Will require some copies, but should be substantially fewer than before.
Instead of doing ReadSegment (which is btw NOT_IMPLEMENTED for jar streams),
one could also try to use the StreamBuffer interface access from the parser the underlying buffer of the stream directly. This will prevent the need to switch the parser from pull to push...
One could easily add nsIStreamBufferAccess support to the JAR input stream,
and in the nsCSSScanner test if the input stream has the nsIStreamBufferAccess interface and then access the 'buffers' of the underlying stream for the token scanning.
> Instead of doing ReadSegment (which is btw NOT_IMPLEMENTED for jar streams),

Why does that matter?  The stream the CSS scanner has is never a jar stream.

> One could easily add nsIStreamBufferAccess support 

Not to nsIUnicharInputStream implementations, which are what the CSS scanner is working with.
But to be clear, we _can_ make API changes to nsIUnicharInputStream if needed.  I believe our string/utf8/converter input streams are the only implementations of it that are floating around.

We could also consider changing stylesheets to use a byte stream, I suppose, but that seems rather counterproductive...
(In reply to comment #16)
> 
> We could also consider changing stylesheets to use a byte stream, I suppose,
> but that seems rather counterproductive...

I'm currently suspecting the CSS scanner/parser would be faster if it did all its internal work in UTF-8, even if the data structures it generates remain UTF-16. (I'm doing some experiments to test that right now - more later today.)  Does that change your opinion there?
It depends.  Is the cost of the UTF-16 to UTF-8 conversion for every single foo.style.whatever call ok?  If so, working in UTF-8 might be reasonable....  That would also work better with all the places where we atomize, since atoms are UTF-8.

The thing is that for the common case of ISO-8859-1 we'd need to convert to UTF-16 and then to UTF-8, right?
Well, it should fast-path in most cases since CSS tends to be pure ASCII.  That would also cover the conversion from iso-8859-n (if we don't have a fast path for that, it shouldn't be hard to add).  I don't know yet, though, how much overhead we'd be getting rid of.  It might not be an overall win for JS manipulation.
Attached patch the actual patchSplinter Review
Sorry, the patch submitted yesterday was incomplete. This doesn't address any of the comments made above, just fixes the previous patch.

This patch depends on https://bug510844.bugzilla.mozilla.org/attachment.cgi?id=398462
Attachment #399373 - Attachment is obsolete: true
this last patch is promising. From initial testing, looks like small css files now load 1.2-2x quicker.

Seems like there is a 15% regression on large files, but that might just be highlighting deficiencies in jar mmap setup.
(In reply to comment #21)
> this last patch is promising. From initial testing, looks like small css files
> now load 1.2-2x quicker.
> 
> Seems like there is a 15% regression on large files, but that might just be
> highlighting deficiencies in jar mmap setup.

after tweaking the mmap code with madvise this way of doing css appears to be conclusively slower. We should put this on hold until until css lexing is done directly on utf8.
I'm pretty opposed to the hardcoding of paths, etc that patch does.  Especially since it is in fact wrong and will break extensions, hard.

Was there a reason to not in fact use the existing @charset detection, with fallback to UTF-8 if the stream doesn't implement ReadSegments, say, and with the jar-maker thing adding the right @charset rule to sheets?

I haven't looked all that carefully through the converter input stream machinery, but wouldn't it be easier to just implement a new subclass of nsIUnicharInputStream that simply wraps an nsIInputStream that has known UTF-16 little-endian data instead of adding branchy codepaths in the converter input stream?
Or is the problem that we don't think we can trust general UTF-16LE data to the extent that we want to trust this data and hence need some magic something that will flag that we should use this high-trust mode?
(In reply to comment #25)
> I'm pretty opposed to the hardcoding of paths, etc that patch does.  Especially
> since it is in fact wrong and will break extensions, hard.

this is a proof-of-concept for benchmarking. A final patch would be very different.
Ah, gotcha.  That wasn't at all clear.  As a POC, looks just fine.  ;)
not a big win until parser is redesigned by zach (bug 91242), per taras -> p3.
Priority: -- → P3
Blocks: 447581
We can revisit this in the future if it looks like the parser, etc, are redesigned enough to make this a win.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Attachment #400579 - Flags: review?(tglek)
(In reply to comment #18)
> It depends.  Is the cost of the UTF-16 to UTF-8 conversion for every single
> foo.style.whatever call ok?  If so, working in UTF-8 might be reasonable.... 
> That would also work better with all the places where we atomize, since atoms
> are UTF-8.
> 
> The thing is that for the common case of ISO-8859-1 we'd need to convert to
> UTF-16 and then to UTF-8, right?

I know this comment is probably useless since this bug is currently already WONTFIX (and IMHO should remain so) but:
- ASCII codepoints (i.e. 0x00 to 0x7F) have identical representation in all three of US-ASCII, ISO-8859-1 and UTF-8 but not in UTF-16.
- ISO-8859-1 bytes 0x80 to 0xFF can be converted directly to UTF-8 just as easily as if the data were UTF-16, since conversion of ISO-8859-1 to UTF-16 simply means "convert byte to word by means of a null high byte". In MASM it's XOR AH,AH (one additional i86 operation with no memory access, which might perhaps even be optimized away depending on how the TO_UTF8 function is coded) or in C it means using an unsigned char rather than an unsigned short for the incoming character data. Sorry, I don't know Motorola machine code but I suspect it's just as easy.

All in all I believe that conversion of ISO-8859-1 to UTF-8 would be made faster by converting directly, *without* using UTF-16 as a waystation.
Yes, but the point is that our existing character conversion infrastructure can convert various encodings to and from UTF-16.  While it would be possible to create new direct-to-UTF-8 converters, it would be a good bit of work.  All solvable, but a nontrivial undertaking.
(In reply to comment #32)
> Yes, but the point is that our existing character conversion infrastructure can
> convert various encodings to and from UTF-16.  While it would be possible to
> create new direct-to-UTF-8 converters, it would be a good bit of work.  All
> solvable, but a nontrivial undertaking.

If nothing else, I suppose you could use the iconv library, which IIUC uses UTF-8 as a pivot. It is available on both Unix (meaning Linux and OSX) and Windows.

All I'm doing here is idle speculation, of course. No need to change anything any time soon.
You need to log in before you can comment on or make changes to this bug.