Closed Bug 786108 Opened 11 years ago Closed 11 years ago

Large SVG files encoded as data URIs make firefox use 100% CPU and allocate several GB of memory, if they post CSS errors to the error console

Categories

(Core :: SVG, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla18

People

(Reporter: pere.jobs, Assigned: seth)

References

Details

(Whiteboard: [MemShrink:P3])

Attachments

(2 files, 6 obsolete files)

1.03 MB, text/html
Details
10.66 KB, patch
seth
: review+
Details | Diff | Splinter Review
Attached file bug.html
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:14.0) Gecko/20100101 Firefox/14.0.1
Build ID: 20120713224749

Steps to reproduce:

I used a data URI containing a base64-encoded SVG file as the src attribute of an img tag.


Actual results:

Firefox started using 100% CPU and allocated all my memory (3Gb), whereas the base64 data is 1Mb and the svg file in itself is 900Ko. I killed it before it finished.


Expected results:

Firfox should have displayed the image.
Component: Untriaged → SVG
Product: Firefox → Core
Attachment #655822 - Attachment mime type: text/plain → text/html
Daniel, do we end up creating lots of copies of the data: URI for some reason?  If so, why?
Yes, we do, for a somewhat-hilarious reason.

From a non-scientific sampling (3 random instances of interrupting-the-hanging-process-in-gdb), we're spending all our time allocating giant strings, for use in nsCSSScanner::OutputError().  That calls into nsScriptError::InitWithWindowID() (passing in the URL as "mFilename"), and InitWithWindowID() dutifully copies that URL  (" mSourceName.Assign(sourceName);")

So basically, the file has tons of CSS issues, and we report messages to the error console for each one.  And each error-console message apparently gets its own copy of the (huge) URL.  And collectively, these error reports end up eating all our memory.
(In reply to Daniel Holbert [:dholbert] from comment #2)
> So basically, the file has tons of CSS issues

er, s/the file/the encoded SVG document/

And the CSS errors are all instances of:
 "Warning: Unknown property 'writing-mode'.  Declaration dropped."

There are 892 instances of the string "writing-mode" in the encoded SVG document, each of which triggers a warning-message with a ~1 MB URL embedded in it.  So that's 1 GB of memory right there.
So ideally it'd be nice to share the underlying string buffer for this, but I don't know our string classes' magic-sharing-properties well enough to know if that's possible.  (My gut says it's not possible, since the source string (which owns the buffer) is a value that will go away when we close the tab.)

So, a simpler solution might be to just clamp the maximum length of a URL that we include in error console messages.  (say, to the first maximum 1000 characters?)  That'd probably break the ability to click the link and jump straight to the problematic line of source, though, and that's probably bad...
Status: UNCONFIRMED → NEW
Ever confirmed: true
Could we pass an nsIURI (that we already have) instead of a string as the filename, and then reference count it?
That could probably work, yeah.

Side-note: we also spend a long time making a fully copy of the URI for each "style" attribute, because nsCSSScanner::Init() copies our full URI into the member-var "mFilename" (a string).  

If we made mFilename a refcounted URI, as dbaron suggests, that copy would be a lot cheaper be OK.  I'm not sure why we have mFilename at all, actually -- it looks like the only values it can take on are mURI.GetSpec() and the string literal "from DOM" in some special cases.  Maybe we could just use a boolean flag to indicate "mIsFromDOM" in those cases, and drop mFilename entirely.
(apologies for poor editing in prev. comment; it's a bit late :))
(In reply to Daniel Holbert from comment #3)
> There are 892 instances of the string "writing-mode" in the encoded SVG
> document, each of which triggers a warning-message with a ~1 MB URL embedded
> in it.  So that's 1 GB of memory right there.

Fortunately we only keep the last 250 console messages.

(In reply to Daniel Holbert from comment #4)
> So ideally it'd be nice to share the underlying string buffer for this, but
> I don't know our string classes' magic-sharing-properties well enough to
> know if that's possible.
mFilename is a UTF-8 string but the console service wants a UTF-16 string. So every time we log the error, we convert the string from UTF-8 to UTF-16...
So the basic issue here is that the API for initWithWindowID is braindead: it takes a PRUnichar*, which means any sort of sharing goes out the window.

If initWithWindowID took an AString in the IDL, we could definitely have a single string shared across all the error reports.  We'd just need to store the UTF-16 string in the scanner (possibly initialized lazily the first time we need to report an error?) and keep passing it through.

Want to give this a shot and see how it affects memory use here?
Version: 14 Branch → Trunk
(In reply to Boris Zbarsky (:bz) from comment #9)
> Want to give this a shot and see how it affects memory use here?

Sure!

> If initWithWindowID took an AString in the IDL

That's a decently-sized self-contained change on its own (it means every client of this API has to change), so I'm filing a helper-bug for it: bug 789382.
Depends on: 789382
Whiteboard: [MemShrink]
Yes, separate bug on the initWithWindowId bit is totally the way to go.
dholbert, can we assign this to you?
Whiteboard: [MemShrink] → [MemShrink:P3]
Were there any improvements by bug 789382?
I'm not sure -- but I don't think we're expecting any improvements on this particular testcase until we've got the latter half of what bz was suggesting in comment 9. (which I'm hoping to get to soon)
Seth Fowler is going to take a crack at fixing this.

Seth: See bz's comment 9.  We basically want to save the result of the "NS_ConvertUTF8toUTF16" conversion that happens here, in nsCSSScanner::OutputError()...
https://mxr.mozilla.org/mozilla-central/source/layout/style/nsCSSScanner.cpp?mark=410-410#408
...in an nsString member-variable (maybe "mFilenameLazilyConvertedToUTF16").

Then, we want to pass that new member-variable to errorObject->InitWithWindowID() instead of calling the conversion function for every single parse error.
Assignee: dholbert → mfowler
So the problem seems clear now, and I have a fix, but there is a tradeoff involved that we should think about. The problem, as discussed above, is that the nsCSSScanner repeatedly reports errors on the same _huge_ URI and converts it to UTF-16 repeatedly.

I added code to cache the converted URI, but it requires that we not reset mFileName and mURI whenever nsCSSScanner::Close() is called. Since CSSParserImpls (and by extension the nsCSSScanners that they hold) are stored for reuse in a freelist, the downside of not resetting mFileName and mURI is that that memory doesn't get freed until the CSSParserImpl gets reused to parse something else.

In the case of huge data URIs like the one we're dealing with in this bug, this means that this patch will greatly improve performance but will waste a noticeable amount of memory for a nondeterministic amount of time.

We could fix this with some sort of 'garbage collection' that periodically walked the freelist and cleaned out the caches, or some sort of deterministic reclamation based upon the lifetime of the associated document. Not sure going down that path necessarily gains us much over the nondeterministic case, though. Thoughts?
Attachment #664318 - Flags: feedback?(bzbarsky)
Comment on attachment 664318 [details] [diff] [review]
Cache UTF-16 version of URI to prevent the CSS scanner from converting large URIs repeatedly.

Seth, you probably want to add "-U 8 -p" to your diff flags.  Just throwing

  [defaults]
  diff = -U 8

  [diff]
  git = 1
  showfunc = true

should do the trick.

You don't need to explicitly truncate or null things out in the destructor.  That'll happen automatically.

>+        mUTF16FileName.Assign(NS_ConvertUTF8toUTF16(mFileName));

  NS_CopyUTF8toUTF16(mFileName, mUTF16FileName);

but also, why do we need an mFileName at all?  Seems like it's only used to set mUTF16FileName.  So we could just have a stack variable for the UTF-8 string in OutputError once we detect that we have no mUTF16FileName yet, I think.

Good catch on the cached scanners.  One simple solution would be to give the scanner a boolean for "posted filename cleanup event" and in Close() if the boolean is not set throw an event into the event loop, or off a timer if we think we'll hit the event loop in between inline style attrs, to clear out the filename stuff from this scanner.
Attachment #664318 - Flags: feedback?(bzbarsky) → feedback+
Thanks for the feedback, Boris. I've addressed all of your concerns, I believe, although I could not find NS_CopyUTF8toUTF16, so I didn't make that change.

We now only cache the UTF16 version of the filename, and we create it lazily when we actually need it for error reporting. To clean up both the cached filename and the cached URI, as you suggest we trigger an nsRunnable in nsCSSScanner::Close() that takes care of the task. If for some reason the nsCSSScanner is destroyed before the nsRunnable gets a chance to run, we notify the nsRunnable to ensure that it doesn't try to dereference an invalid pointer.

In short, this patch should both address the performance issue we're trying to fix and ensure that the memory used for caching is reclaimed after a short time. Hopefully we've now covered all the bases.

Let me know what you think!
Attachment #664748 - Flags: feedback?(bzbarsky)
I think Boris meant CopyUTF8toUTF16 (no "NS_").
> although I could not find NS_CopyUTF8toUTF16, so I didn't make that change.

Sorry, my bad.  It's called "CopyUTF8toUTF16".  Naming consistency-r-us.  ;)
Comment on attachment 664748 [details] [diff] [review]
Bug 786108 - Cache UTF-16 version of URI to prevent repeated conversions in the CSS scanner, and free the cache after a short time.

Hmm.  I guess we can't use nsRunnableMethod because we're not refcounted, right.

It might be better to rename CancelCleanup() to Revoke(), then make mDeferredCleaner be of type nsRevocableEventPtr<DeferredCleanupRunnable>.  Then you won't need the manual cancel in the destructor (though you'll still need to mDeferredCleaner.Forget() in PerformDeferredCleanup.  And make Close() check mDeferredCleaner.IsPending().

Two more things:

1)  Instead of Clean() maybe make it ResetState() or Reset()?
2)  I think in the setup I describe above you want to Revoke mDeferredCleaner
    if the NS_DispatchToCurrentThread fails.  In your setup, you just want to
    null it out.
Attachment #664748 - Flags: feedback?(bzbarsky) → feedback+
Heh, I had tried some variants of NS_CopyUTF8toUTF16, but I didn't think to leave off the prefix. =) I made that change now. I also switched to nsRevocableEventPtr (nice to see that there's existing support for this pattern!) and took care of the other miscellaneous changes.

NS_DispatchToCurrentThread() failure is now handled, and will cause an immediate cleanup of the cached date. For cleaning up mDeferredCleaner in this situation, I (indirectly) use mDeferredCleaner.Forget() instead of mDeferredCleaner.Revoke(), in the belief that if NS_DispatchToCurrentThread fails the dispatch has not occurred and mDeferredCleaner is the only reference.
Attachment #664318 - Attachment is obsolete: true
Attachment #664748 - Attachment is obsolete: true
Attachment #665062 - Flags: review?(bzbarsky)
fwiw: Seth is working on making a crashtest for this as well.

Also, removing "base64" from the summary since that's not a significant part of this bug. (This affects any long data-URI-encoded SVG image, regardless of whether it's base64-encoded.)
Flags: in-testsuite?
Summary: SVG files encoded as base64 in data URIs make firefox use 100% CPU and allocate several GB of memory → Large SVG files encoded as data URIs make firefox use 100% CPU and allocate several GB of memory
Summary: Large SVG files encoded as data URIs make firefox use 100% CPU and allocate several GB of memory → Large SVG files encoded as data URIs make firefox use 100% CPU and allocate several GB of memory, if they post CSS errors to the error console
Attached is the final version with an appropriate crashtest.
Attachment #665062 - Attachment is obsolete: true
Attachment #665062 - Flags: review?(bzbarsky)
Attachment #665236 - Flags: review?(bzbarsky)
It's got a commit message, but it doesn't have your commit info.
https://developer.mozilla.org/en-US/docs/Creating_a_patch_that_can_be_checked_in
Yup -- Seth, you just need to run "hg qref -U" -- that should add your author-info to the patch headers.

(Pretty sure he's currently got all the stuff from that MDN page -- he just didn't have it all when we originally initialized the patch-file in his queue)
Thanks folks. Fixed the user info issue; patch is otherwise identical.
Attachment #665236 - Attachment is obsolete: true
Attachment #665236 - Flags: review?(bzbarsky)
Attachment #665522 - Flags: review?(bzbarsky)
Comment on attachment 665522 [details] [diff] [review]
Bug 786108 - Cache UTF-16 version of URI to prevent repeated conversions in the CSS scanner, and free the cache after a short time.

r=me.  Looks great!
Attachment #665522 - Flags: review?(bzbarsky) → review+
TryServer run:
  https://tbpl.mozilla.org/?tree=Try&rev=49f4f3c6299d

TryServer run with just the new crashtest, & no bugfix (so it'll hopefully fail):
  https://tbpl.mozilla.org/?tree=Try&rev=4492131caf3c
Carrying forward r+ from bz.

The original crashtest takes a little too long on Android and times out; this updated version adds a different test for Android and skips the original test on that platform. We still want the original test too since beefy machines will probably pass this new test even without the patch.
Attachment #665522 - Attachment is obsolete: true
Attachment #665702 - Flags: review+
so Comment 31's try run looked good, except it didn't run crashtests on android (which is what we were really interested in), because I forgot that the TryChooser invocation for those is different from the other platforms. (since crashtests are split into multiple sub-testsuites on android)

So -- I pushed comment 30's patch to try again (android-specific this time, since this already got a try run for the other platforms, and using the right crashtest invocation now). https://tbpl.mozilla.org/?tree=Try&rev=62ef54e30b7a
Try runs in comment 31 and 32 look good, so this is ready to land!
Pushed: https://hg.mozilla.org/integration/mozilla-inbound/rev/ec0dcd401a3f
Flags: in-testsuite? → in-testsuite+
edmorley backed this out because the crashtest appears to fail on Linux Cipc (crashtests over IPC):  https://hg.mozilla.org/integration/mozilla-inbound/rev/5fbf17bdec32


The failures look like this:
*********************************
REFTEST TEST-START | file:///home/cltbld/talos-slave/test/build/reftest/tests/layout/style/crashtests/786108-1.html | 1549 / 2117 (73%)
[Child 2153] ###!!! ABORT: file ../../../ipc/chromium/src/base/pickle.cc, line 60
[Child 2153] ###!!! ABORT: file ../../../ipc/chromium/src/base/pickle.cc, line 60
TEST-UNEXPECTED-FAIL | file:///home/cltbld/talos-slave/test/build/reftest/tests/layout/style/crashtests/786108-1.html | application timed out after 330 seconds with no output
*********************************
https://tbpl.mozilla.org/php/getParsedLog.php?id=15640070&tree=Mozilla-Inbound

It looks like we're timing out.  Looks like IPC might be similar to Android, in terms of how much time this bug's crashtest takes.

(In retrospect, it actually looks like TryServer had this error too, in the first run in comment 29.  Not sure why I didn't notice it at the time -- maybe the Cipc run was delayed or something.)

In any case -- looks like we need some more fine-tuning of the tests.  Maybe the less-taxing "786108-2" variant is sufficient to bust all of our testers in un-patched builds? If so (and if it doesn't mess up Cipc like 786108-1 does), then maybe we could *only* use that crashtest.
Flags: in-testsuite+ → in-testsuite?
(Alternately: I wonder if there's some additional overhead involved with posting to the error console in IPC? I believe the idea in IPC builds is that the UI is on a different thread from the rendering process, so maybe there are additional copies of error-console messages introduced from that separation, and we lose the benefit of string-sharing... (?)  Just throwing out a possibility.)
There's a good chance that for IPC builds we make a new copy for each IPC message, yes.  :(
Well sure, but what happens in the one-process case if a user actually opens the console?
Yes, there's a huge difference in memory consumption and performance between running the 786108-1.html crashtest with and without IPC.

Without IPC, on my machine, it takes ~20 seconds with ~140 MB maximum memory consumption.

With IPC, it takes ~300 seconds with ~870 MB maximum memory consumption.

Chris, you're right: if the user has the console open, memory usage is massive and performance takes a nosedive. Rendering these huge error messages resulted in a memory consumption of ~10 GB on my machine.

I'm going to try a new version of the crashtest that should hopefully crash without the patch on at least some of the build machines; if that works, I'll update the patch appropriately.

I suspect that I should file two additional bugs here, one for the IPC-related issue and one for the error console issue (likely there we need a UI change to avoid rendering the huge URL while still storing it internally to make it clickable).
The try for the less brutal crashtest is running here: https://tbpl.mozilla.org/?tree=Try&rev=ed6c89b2c1ad
(To clarify, that's the case _without_ the patch. It's expected to fail. Should've updated the commit message.)
(Carrying forward approval from bz.)

OK, the less brutal crashtest does not seem to do a good job of producing the crash. The best bet seems to me to disable all crashtests for this bug from running on the IPC version, and address the remaining issues (IPC, error console rendering) in separate bugs, since as far as I can tell they are really distinct problems.

This is the same patch, but with 'browserRemote' added to the skip-if clauses on both crashtests.

I've set up a try run for this version here: https://tbpl.mozilla.org/?tree=Try&rev=3b89c125f1ed
Attachment #665702 - Attachment is obsolete: true
Attachment #666141 - Flags: review+
I was trying to suggest in comment 37 that the IPC case is representative of what would happen if a user actually opened the console.  Just disabling the IPC test seems like a band-aid; it won't resolve the underlying crash bomb.  Unless I'm missing something.
Well the user doesn't usually open the console ...

but yeah, we should fix that too.
Chris: yeah, I wasn't clear whether the two situations were identical internally or just analogous, but I agree this needs to be fixed as well. I'm in the process of writing up a bug in another tab.

However, this patch fixes a different problem, and the user won't experience any issue once this patch lands except if they're using an IPC build or they open the error console. I think it makes sense to separate these things into different bugs. Landing this patch would be a strict improvement on the situation that existed before.
Whoops, realized I forgot to put the bug numbers for the related bugs here.

The error console issue is being resolved in bug 796179.
The IPC issue is being discussed in bug 795534.

Neither of those issues should block this patch from landing; they're separate except insofar as this test case can trigger all of them in different circumstances.
Try run from comment 41 looks good, so I pushed Seth's latest patch here (which has the test disabled for Cipc, per comment 34 etc.):
 https://hg.mozilla.org/integration/mozilla-inbound/rev/94ab602328ff

Before pushing, I also tweaked the crashtest.list file to mention bug 795534 as the reason for the "skip-if(browserIsRemote)" annotation.
Status: NEW → ASSIGNED
Flags: in-testsuite? → in-testsuite+
https://hg.mozilla.org/mozilla-central/rev/94ab602328ff
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
You need to log in before you can comment on or make changes to this bug.