Closed
Bug 1106264
Opened 10 years ago
Closed 9 years ago
Large OOM in nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase
Categories
(Firefox :: Session Restore, defect)
Tracking
()
People
(Reporter: alex_mayorga, Unassigned)
References
Details
(Keywords: crash)
Crash Data
Attachments
(3 files)
1.65 KB,
patch
|
jimm
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
702.70 KB,
application/x-gzip
|
Details | |
722.60 KB,
application/x-gzip
|
Details |
This bug was filed from the Socorro interface and is report bp-ece146fc-dc12-41fa-a8a4-b8b032141129. =============================================================
This is by no means a new crash (oldest report points to Firefox 30) but it's reaching a significant volume. There have been 4988 reports of this crash in the last week and accounts for nearly 12% of all DOM crashes. Firefox 40 - #275 with 3 crashes Firefox 39 - #56 with 42 crashes Firefox 38 - #26 with 989 crashes Firefox 37 - #36 with 2923 crashes [Tracking Requested - why for this release]: By rank this does not yet count as a topcrash but it is our #1 crash for DOM right now.
tracking-firefox37:
--- → ?
Component: General → DOM
Product: Firefox → Core
Version: 34 Branch → Trunk
Reporter | ||
Comment 2•9 years ago
|
||
¡Hola Anthony! Still a problem on 38.0b99 Report ID Date Submitted bp-1f570d4d-a3f3-47aa-967b-b5c162150506 06/05/2015 05:15 p.m. Sadly I still don't have STR the salient symptoms just prior to the crash is heavy usage of the "Physical Memory" at 98% or more and major black artifacting on the pages that are loaded or loading on Firefox Beta. Please let me know if there's anything worth collecting from the affected system. Shall tracking be re-requested for 38? ¡Gracias!
status-firefox37:
--- → affected
status-firefox38:
--- → affected
status-firefox39:
--- → affected
status-firefox40:
--- → affected
Flags: needinfo?(anthony.s.hughes)
Unfortunately there isn't much we can do without STR. Perhaps the next time you encounter these symptoms you could capture a profile using the Gecko Profiler addon. Additionally, an about:memory log would be useful.
Flags: needinfo?(anthony.s.hughes)
Comment 4•9 years ago
|
||
Marking Firefox 37 and no for tracking and wontfix for status as this release will be EOL when Firefox 38 is released on Tuesday.
Comment 5•9 years ago
|
||
[Tracking Requested - why for this release]: This crash signature is now 1.1% of 39.0b2 data so far.
tracking-firefox39:
--- → ?
Comment 6•9 years ago
|
||
This is OOM. Not sure what we can do to this.
Comment 7•9 years ago
|
||
It sounds like a fix is unlikely until we have better steps to reproduce the bug. Andrew since this seems to be a crash signature on the rise for 39, can you find someone to look into it a little more? I can also track this and we can keep an eye on the crashes.
tracking-firefox40:
--- → +
Flags: needinfo?(overholt)
Comment 8•9 years ago
|
||
This is rather hard to debug based on the crash-stats. Some JS code somewhere calls sendAsyncMessage(msg, someLargeData); and we don't have enough memory to serialize that large data.
Comment 9•9 years ago
|
||
But, we're doing some structured clone data copying. Do we really need to? Could we perhaps just share the data. This is all within on process. Perhaps baku has some ideas.
Flags: needinfo?(amarchesini)
Comment 10•9 years ago
|
||
(In reply to Olli Pettay [:smaug] (Reviewing overload, no new review requests before June 7, please) from comment #8) > This is rather hard to debug based on the crash-stats. Some JS code > somewhere calls > sendAsyncMessage(msg, someLargeData); and we don't have enough memory to > serialize that large data. Well, for one thing, if it's OOM, I think we should actually fail the right way and annotate allocation size on the crash. For the other, should we really let JS crash us on a large allocation? Shouldn't we make this fallible instead and return an error, catching it on the JS side?
Comment 11•9 years ago
|
||
JS passes something large to C++ side, and then C++ crashes when it copies the serialized form of some data it got from JS. The issue is that this is in message manager messaging, so supposedly it is some browser functionality which is causing the issue. Is that functionality important enough that we rather crash than break it in case of OOM? hard to say without knowing what this is about. But I believe we should be able to use less memory here.
Comment 12•9 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #10) > Well, for one thing, if it's OOM, I think we should actually fail the right > way and annotate allocation size on the crash. Funny enough I just added that annotation in bug 1167189 comment 8. I'll request uplift so we can get a handle on the size of these allocations.
Comment 13•9 years ago
|
||
> Perhaps baku has some ideas.
There is a bug about using shared memory for big strings in postMessage. But nothing else than this.
I'll talk with bent to know if we have any plan to support something similar for IPDL and maybe we can have something shared.
Flags: needinfo?(overholt)
Flags: needinfo?(amarchesini)
Comment 14•9 years ago
|
||
well, this has nothing to do with IPDL.
Updated•9 years ago
|
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*)] → [@ mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*)]
[@ OOM | large | N…
Summary: crash in mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) → Large OOM in nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase
Comment 15•9 years ago
|
||
Probably too late to take anything in 39; we could still take a patch for 40.
Comment 17•9 years ago
|
||
Basically forever. We have crash reports going back at least as far as Fx17. (The signatures were different before bug 982646)
Flags: needinfo?(lhenry)
Comment 18•9 years ago
|
||
But we started seeing it in higher volume during 34, and it continued to climb in 35, 36, 37. I suspect that's just because we've eliminated many large OOMs that may have hid this earlier. I.e. memory-constrained machines would have previously crashed somewhere else; now that we've fixed those callsites, they're starting to crash here now.
Comment 19•9 years ago
|
||
interesting, and useful data. I was worried that we have added some new message manager usage where some large data packages are sent via in-process message manager.
Comment 20•9 years ago
|
||
Now that we have allocation size numbers... by eyeball most of these are 1-2MB but occasionally going as high as 10MB. Looks like this is going to be our top OOM on beta. Any chance for a fix in 40?
Flags: needinfo?(bugs)
Comment 21•9 years ago
|
||
Do we have any correlation for addons? Since if FF itself was using messagemanager for that large messages, that clearly indicates a bug. (but thinking if there is anything safe for beta)
Comment 22•9 years ago
|
||
The correlation files don't show any addons at all for that signature. (I assume that means no addons met the threshold for being suspicious)
Comment 23•9 years ago
|
||
felipe, can you think of anything in FF which might pass _huge_ messages via message manager.
Updated•9 years ago
|
Flags: needinfo?(felipc)
Comment 24•9 years ago
|
||
I don't know of anything that would get close to a OOM situation unless things are being leaked. I know that the select boxes can send a big amount of data if there's a <select> with many <option>s (e.g the "Component" dropdown in this bugzilla page: https://bugzilla.mozilla.org/request.cgi?action=queue&requestee=nobody%40mozilla.org&group=type ), but I doubt that would get past 1MB or so. I don't know if thumbnails make use message manager to do anything, but if they do that would be somewhere to look. I guess it wouldn't be hard to add some logging when a message larger than 1MB is received, right? And see if we spot it in during mochitests or in the wild by logging to the browser console.
Flags: needinfo?(felipc)
Comment 25•9 years ago
|
||
Jim: do you know if there's any usage from the Thumbnails service that could be passing huge messages through the message manager?
Flags: needinfo?(jmathies)
Comment 26•9 years ago
|
||
(In reply to :Felipe Gomes [away Jul 8 - 21, will reply to needinfos when back] from comment #25) > Jim: do you know if there's any usage from the Thumbnails service that could > be passing huge messages through the message manager? That was my first thought reading this bug. http://mxr.mozilla.org/mozilla-central/source/toolkit/components/thumbnails/PageThumbs.jsm#291 We really need to identify the caller here, knowing size alone isn't helpful.
Flags: needinfo?(jmathies)
Comment 27•9 years ago
|
||
> We really need to identify the caller here, knowing size alone isn't helpful.
Can you elaborate why you're interested in the caller? Do you suspect that these huge messages shouldn't be happening at all?
If there's reasonable chance that the message size is legit, maybe we can just use a fallible allocator and call it a day?
Comment 28•9 years ago
|
||
(In reply to David Major [:dmajor] from comment #27) > > We really need to identify the caller here, knowing size alone isn't helpful. > > Can you elaborate why you're interested in the caller? Do you suspect that > these huge messages shouldn't be happening at all? I have no idea, that's why i'm interested. Sounds like we could optimize whatever this api is doing, serializing large checks of data can't be good for performance. > If there's reasonable chance that the message size is legit, maybe we can > just use a fallible allocator and call it a day? I don't know how this fixes this problem: Total Virtual Memory 2,147,352,576 Available Virtual Memory 266,911,744 System Memory Use Percentage 79% Available Page File 1,321,902,080 Available Physical Memory 702,009,344 Why would this system OOM trying to serialize a large message? Does a fallible allocator fix the problem?
Comment 29•9 years ago
|
||
> Available Virtual Memory
> 266,911,744
For large allocations the limiting factor is contiguous address space. We may have 266 megs available in total, but it's fragmented into a bunch of small regions. With our current allocator, once we get below roughly 300 megs, the available regions are usually not larger than 1 megabyte in size. So in general, we should not expect an allocation over 1MB to succeed.
Jim, Olli, we are hitting this crash on Beta channel at about 1.4%. Given the high volume, could you please help find an owner? Thanks!
Flags: needinfo?(jmathies)
Comment 31•9 years ago
|
||
(In reply to David Major [:dmajor] from comment #29) > > Available Virtual Memory > > 266,911,744 > > For large allocations the limiting factor is contiguous address space. We > may have 266 megs available in total, but it's fragmented into a bunch of > small regions. With our current allocator, once we get below roughly 300 > megs, the available regions are usually not larger than 1 megabyte in size. > So in general, we should not expect an allocation over 1MB to succeed. So if we switch to fallible allocation here per your suggestion in comment 27, does that mean we get better crash signatures with caller info? This wouldn't help us on beta but we need to start somewhere. Also, if these machines are memory starved and our allocator barfs in these situations, what's the proper fix here, try to reduce the allocation size of the caller, whatever that is? This feels like a bug that doesn't have a real chance of getting fixed baring ripping our allocator out and using something better.
Flags: needinfo?(jmathies)
Updated•9 years ago
|
Flags: needinfo?(dmajor)
Comment 32•9 years ago
|
||
There are several possible approaches for dealing with Large OOMs. I think in this case we'd be looking at the last one. - Make the allocation smaller. Data structure rearrangements, lazy construction, general MemShrink-type work. - For long arrays, use a SegmentedVector to break the allocations into easier-to-swallow pieces. - Use a fallible allocator. This papers over the problem but in practice it can be good enough. Some users will be hopelessly low on memory and will just go on to crash somewhere else instead. We can't do much about that. But for other users, a fallible allocation might let them get by just enough to finish what they were doing, or close a tab, or until a GC kicks in, etc. Average uptimes get longer. This becomes more and more true with larger allocation sizes (larger proportion of users who can't survive this allocation but could survive subsequent smaller ones).
Flags: needinfo?(dmajor)
Comment 33•9 years ago
|
||
So based on that, to fix this we need information on the callers. Looks like we have a message string here we could add to annotations. I think adding debugging on mc to get a feel for what calls commonly fail here is the next step. http://hg.mozilla.org/releases/mozilla-beta/annotate/664b98d33a1b/dom/base/nsFrameMessageManager.cpp#l2099 This is an ipc related abort on machines with limited memory. Moving this into our e10s triage.
tracking-e10s:
--- → ?
Comment 34•9 years ago
|
||
This bug is not about e10s or ipc. This is about _same_process_ messaging.
Flags: needinfo?(bugs)
Comment 35•9 years ago
|
||
(In reply to Olli Pettay [:smaug] (away-ish until July 26, expect slower than usual reviews) from comment #34) > This bug is not about e10s or ipc. This is about _same_process_ messaging. Hmm, ok then it's not suitable for e10s triage as the e10s team won't pick it up. I'm at a loss for an owner here, anyone who knows crash reporter and ipc a bit should be able to add some additional annotations to help diagnose.
Comment 37•9 years ago
|
||
Comment on attachment 8637304 [details] [diff] [review] Log the message name Yep that looks good to me.
Attachment #8637304 -
Flags: review?(jmathies) → review+
Comment 39•9 years ago
|
||
Comment on attachment 8637304 [details] [diff] [review] Log the message name Note that this is a diagnostic patch only. Approval Request Comment [Feature/regressing bug #]: unknown [User impact if declined]: The volume of this crash is so low on Nightly that we won't get enough data. [Describe test coverage new/current, TreeHerder]: nope [Risks and why]: very low risk as it only affects codepaths where we've already decided to crash [String/UUID change made/needed]: none
Attachment #8637304 -
Flags: approval-mozilla-aurora?
Keywords: leave-open
Comment on attachment 8637304 [details] [diff] [review] Log the message name Diagnostic messaging is good. Let's uplift to Aurora.
Attachment #8637304 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 42•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/22d307fa2327
status-firefox41:
--- → affected
status-firefox42:
--- → affected
tracking-firefox41:
--- → ?
tracking-firefox42:
--- → ?
Tracked.
Comment 44•9 years ago
|
||
Just two so far, but interesting: 2385920, SessionStore:update 2927296, SessionStore:update Does anyone know who currently works on session store?
Comment 45•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #44) > Does anyone know who currently works on session store? Bill would know.
Flags: needinfo?(wmccloskey)
Tim, we're OOMing trying to send the SessionStore:update message. I guess we could try to break up the data. The other option would be to have sendAsyncMessage throw in this case.
Flags: needinfo?(wmccloskey) → needinfo?(ttaubert)
Comment 47•9 years ago
|
||
Breaking up the data and supporting incremental updates shouldn't be too hard. Can we figure out which data the message includes that makes us OOM? I fear it might be DOMSessionStorage where it might be caused by a single huge entry that we couldn't split. What would we do when sendAsyncMessage() throws? Just retain the data and keep trying?
Flags: needinfo?(ttaubert)
(In reply to Tim Taubert [:ttaubert] from comment #47) > Breaking up the data and supporting incremental updates shouldn't be too > hard. Can we figure out which data the message includes that makes us OOM? I > fear it might be DOMSessionStorage where it might be caused by a single huge > entry that we couldn't split. We should probably assume that DOMSessionStorage is possible then. It might be hard to find out too much, though, since we probably don't have enough memory to serialize the message into the minidump. > What would we do when sendAsyncMessage() > throws? Just retain the data and keep trying? Yeah, I'm not sure what else to do. Hopefully we've fired a memory pressure event and things will get better.
Comment 49•9 years ago
|
||
smaug stated that this this code only executes for in-process messaging (comment 34), why is session store sending huge blocks of data over a "local" connection?
It uses the message manager regardless of whether we're in e10s or not. Generally that makes for a nicer design. Avoiding the structured clone is hard in general. If we just pass the object from the sender to the receiver, then the receiver can modify it and the modifications will be visible to the sender. Also, it has very different semantics for things like XPCOM objects, which don't work under structured clone. One horrible hack we could do here is to check if we're in the chrome process. If we are, then we can send the data using the CPOWs argument rather than the data argument. We'd have to make sure that we don't modify the data in the session store code, but we could probably make it work. We'll still have the same problem in e10s, but maybe we can put off fixing it until we get multiple content processes. At that point, address space limitations should be much less of an issue. Or maybe people will actually start using 64-bit windows builds.
Comment 51•9 years ago
|
||
We aren't seeing a lot of these on nightly and aurora. Of those that have shown up, 99% are session updates. I found one sdk related message ('sdk/remote/process/message') and that's it.
Comment 52•9 years ago
|
||
This is now wontfix for 40.
Comment 53•9 years ago
|
||
out of 100 reports on beta now: {u'vdh': 3, u'sdk/remote/process/message': 4, u'ContextMenu:SetAsDesktopBackground:Result': 2, u'SessionStore:update': 91}
Component: DOM → Session Restore
Product: Core → Firefox
Comment 54•9 years ago
|
||
Bill, it looks like this is indeed SessionStore. Can we do anything in time for 41?
Flags: needinfo?(wmccloskey)
Comment 55•9 years ago
|
||
njn, this might interest you. Perhaps we could use Segmented Vector somewhere here
Flags: needinfo?(n.nethercote)
Comment 56•9 years ago
|
||
mData is JSAutoStructuredCloneBuffer. Other option is to try to reuse the data and not do extra cloning.
Comment 57•9 years ago
|
||
(In reply to Olli Pettay [:smaug] (high review load) from comment #55) > njn, this might interest you. Perhaps we could use Segmented Vector > somewhere here Thanks for the tip. I'll take a look.
Flags: needinfo?(n.nethercote)
Brad, I am trying to follow up on tracked 41 bugs. Given that this is now a top crasher for Nightly, could you please help find an owner? That might be one step in trying to see if we have a speculative fix for FF41/42.
Flags: needinfo?(blassey.bugs)
Comment 59•9 years ago
|
||
(In reply to Ritu Kothari (:ritu) from comment #58) > Brad, I am trying to follow up on tracked 41 bugs. Given that this is now a > top crasher for Nightly, could you please help find an owner? That might be > one step in trying to see if we have a speculative fix for FF41/42. This was very rare a couple weeks ago, do we know what bug regressed this?
Comment 60•9 years ago
|
||
This isn't a top crasher on nightly, it's a bad crash in release (I can't get stats off crashstats for 40.0.2??). 43 isn't listed in the meta report. https://crash-stats.mozilla.com/report/list?signature=OOM%20|%20large%20|%20NS_ABORT_OOM%28unsigned%20int%29%20|%20nsSameProcessAsyncMessageBase%3A%3AnsSameProcessAsyncMessageBase%28JSContext*%2C%20nsAString_internal%20const%26%2C%20mozilla%3A%3Adom%3A%3AStructuredCloneData%20const%26%2C%20JS%3A%3AHandle%3CT%3E%2C%20nsIPrincipal*%29#tab-sigsummary Poking at a few of the reports, it look likely that this is caused by the same session store code. I see really big allocations on memory starved systems. This has nothing to do with e10s too.
Flags: needinfo?(wmccloskey)
Updated•9 years ago
|
Flags: needinfo?(ttaubert)
Comment 61•9 years ago
|
||
Not sure what more information I can provide... It sounds like making sendAsyncMessage() throw and simply keep trying to send that data might be the best we can do.
Flags: needinfo?(ttaubert)
Comment 62•9 years ago
|
||
Ritu, this isn't a top crasher in Nightly and isn't e10s related. Did you NI me on the wrong bug?
Flags: needinfo?(blassey.bugs) → needinfo?(rkothari)
Brad, my bad. I meant to n-i billm.
Flags: needinfo?(rkothari)
It is getting too late to fix this in FF41 given that we don't have a patch ready. Let's try to get this fixed in 42.
Comment 65•9 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #57) > (In reply to Olli Pettay [:smaug] (high review load) from comment #55) > > njn, this might interest you. Perhaps we could use Segmented Vector > > somewhere here > > Thanks for the tip. I'll take a look. (In reply to Tim Taubert [:ttaubert] from comment #61) > Not sure what more information I can provide... It sounds like making > sendAsyncMessage() throw and simply keep trying to send that data might be > the best we can do. Tim, Nick - it isn't clear which of those paths is going to be taken here. Could you please work out a plan and an owner?
Flags: needinfo?(ttaubert)
Flags: needinfo?(n.nethercote)
Comment 66•9 years ago
|
||
Would either of these options risk breaking session store for these systems? If the system is starved for memory and every attempt to update session data fails. Also, sounds like we need a bug filed and prioritized by a team that aims to break this data up so we can really fix the problem.
Comment 67•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #66) > Would either of these options risk breaking session store for these systems? > If the system is starved for memory and every attempt to update session data > fails. Yeah that would mean we wouldn't update session data for one or multiple tabs. That's somewhat related to bug 581510 and bug 903621 where the same is happening in different places. We wouldn't crash (so that's better) but we could also lose data if we can't find memory before Firefox quits. > Also, sounds like we need a bug filed and prioritized by a team that aims to > break this data up so we can really fix the problem. We don't have a "team" for sessionstore unfortunately, the latest changes we made where mostly to make it work with multiple processes, so some of the e10s folks and I worked on that. Breaking this data down is not an easy thing to do if it's really caused by one "type" of data. It's a non-negligible amount of work to implement incremental updates from child processes. But if even those incremental updates are huge (what we guess happens here with DOMSessionStorage) then we can't split that data. We could try to compress it, but that operation needs memory too. The biggest problem though (I think) is that even if we somehow rewrite code and manage to get this data over in chunks without going OOM then we very likely can't save that to disk anyway. We can't create that huge JSON string we want to send to the session worker that handles our I/O.
Flags: needinfo?(ttaubert)
Comment 68•9 years ago
|
||
(In reply to Tim Taubert [:ttaubert] from comment #67) > (In reply to Jim Mathies [:jimm] from comment #66) > > Would either of these options risk breaking session store for these systems? > > If the system is starved for memory and every attempt to update session data > > fails. > > Yeah that would mean we wouldn't update session data for one or multiple > tabs. That's somewhat related to bug 581510 and bug 903621 where the same is > happening in different places. We wouldn't crash (so that's better) but we > could also lose data if we can't find memory before Firefox quits. My obviously not great suggestion would be to make sendAsyncMessage() throw so that we can catch that in the content script and simply keep trying. That would be in line with the somewhat shitty UX from bug 581510 & bug 903621 but at least we don't crash.
Comment 69•9 years ago
|
||
I experienced two crashes in the last two days: https://crash-stats.mozilla.com/report/index/c98c0575-04f0-49bc-933f-9aedf2150909 https://crash-stats.mozilla.com/report/index/1eec3c5b-2d99-41e2-bda5-1ee8d2150911 Before each occurred I saw black painting on content. Shortly before the second crash I took a memory report. I haven't read through this entire bug, so if you guys have this issue fully grasped, great, but if you need info from me like the memory report, my graphics info or more, please let me know. Twice in two days is quite the escalation and I'd love to see it fixed.
Updated•9 years ago
|
Flags: needinfo?(n.nethercote)
Comment 70•9 years ago
|
||
Comment 71•9 years ago
|
||
As I mentioned previously, I've been experiencing this crash somewhat frequently and so when I saw it was about to crash (because of the black drawing) I was able to generate an anonymized memory report shortly before 2 separate crashes. I have attached the two memory reports and their associated crashes ( 1eec3c5b-2d99-41e2-bda5-1ee8d2150911 and https://crash-stats.mozilla.com/report/index/00cb32d2-6533-4a1c-869c-6a4bf2150917 ).
Comment 72•9 years ago
|
||
Caspy, thanks for the reports, I think the problem is understood at this point. Your crash reports show a recurring high memory usage so you are likely to have problems even after this bug is fixed. (Fixing this bug will mainly benefit those users who only occasionally spike into high memory territory.) Perhaps you may want to open a separate bug for "I frequently run out of memory" where you can get advice from the MemShrink team.
Comment 74•9 years ago
|
||
dmajor thinks that it is actually actionable. Tracking it again. David, I think I remember you told me that your worked on the session mechanism, would you be able to help here?
Comment 75•9 years ago
|
||
For the record I understand that there's concern about the complexity of implementation here. Perhaps this may require a more involved fix that isn't uplift material. If so, that's fine; it's still better than accepting this high-volume crash signature forever. That's why I'd still like to keep it on the tracking radar so that we don't forget about this bug.
I see the following ways forward. 1/ Make sendAsyncMessage throw, as suggested by billm. As mentioned by Tim, this would, however, leave us with users who cannot write their Session Restore data to disk, so this is at best a temporary measure, which should perhaps be accompanied by a UX "Firefox is running low on memory, you should restart now". 2/ Bite the bullet and make Session Restore updates piecewise. This is possible, I had an early prototype a long time ago, but this is lots of delicate work and I am not planning to work on this again. Also, as Tim mentions this still leaves us with the problem that we may not be able to write such large JSON to disk in the first place. 3/ Bite the other bullet and move Session Restore to sqlite. There is a chance that this might have a negative impact on startup time, but this should improve jank and get us rid of these insane issues into which we run with JSON. Any thought, Tim?
Flags: needinfo?(dteller) → needinfo?(ttaubert)
Comment 77•9 years ago
|
||
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #76) > 1/ Make sendAsyncMessage throw, as suggested by billm. As mentioned by Tim, > this would, however, leave us with users who cannot write their Session > Restore data to disk, so this is at best a temporary measure, which should > perhaps be accompanied by a UX "Firefox is running low on memory, you should > restart now". That might be the easiest fix for now. It doesn't make OOM situations any easier but it doesn't crash Firefox, which is nice. > 2/ Bite the bullet and make Session Restore updates piecewise. This is > possible, I had an early prototype a long time ago, but this is lots of > delicate work and I am not planning to work on this again. Also, as Tim > mentions this still leaves us with the problem that we may not be able to > write such large JSON to disk in the first place. Also, just one piece (a huge DOMSessionStorage entry) might be enough to go OOM. I will certainly not have the time to implement or even mentor large changes like this. > 3/ Bite the other bullet and move Session Restore to sqlite. There is a > chance that this might have a negative impact on startup time, but this > should improve jank and get us rid of these insane issues into which we run > with JSON. Same resource/time problem as with (2). Also I'm not convinced that this helps here? We go OOM just collecting data, not even trying to write it to disk. Sure we might go OOM partially because we hold that big JSON string in memory but we're talking about a few MBs (usually not more than 10). Anything could be pushing us over that boundary easily at that point. I think there might be other data formats to consider, but we should be confident they solve the problem at hand before considering any of those. SQLite will probably bring a whole class of different problems too.
Flags: needinfo?(ttaubert)
Well, 3/ will definitely take resources and open a different can of unpredictable worms. What makes you say that we can crash collecting data, though? What did I miss? Is this an OOM that crashes when we send data from the content not-process-because-we-re-not-in-e10s to the parent?
Flags: needinfo?(ttaubert)
Comment 79•9 years ago
|
||
Yeah, sendAsyncMessage() crashes the child process, or everything if we don't actually have multiple processes. The nsSameProcessAsyncMessageBase() constructor aborts on OOM. http://hg.mozilla.org/releases/mozilla-beta/annotate/664b98d33a1b/dom/base/nsFrameMessageManager.cpp#l2099
Flags: needinfo?(ttaubert)
Ah, in that case, yeah, we need both 1/ and a brand-new 4/ break up data from `SessionStore:update`, as you mentioned above. I wonder which part of the data is so large that it cannot be sent. Session Storage, perhaps?
Comment 81•9 years ago
|
||
session storage should use less memory, but I think we can also optimize memory usage in message sending. StructuredCloneData could use nsStringBuffer to store the data so that the data could be shared, not copied.
I am a bit nervous that if we the message is large enough to cause an OOM – any kind of OOM – then we have a structural weakness. As mentioned by ttaubert, if the message is large enough to cause an OOM in the child frame/process, it is also certainly large enough to cause an OOM in the parent a bit later when we attempt to write it to disk. So, if we want to improve the efficiency of message-passing (either by improving StructuredCloneData or by cutting the message in chunks), we also need to rewrite how we write data to disk, otherwise, we're most likely just shifting the OOM crash around. This pretty much means sqlite, I believe. For the time being, we indeed need to throw from sendAsyncMessage, and handle this semi-correctly, but this will turn this crash into a dataloss, which is not really satisfying.
Reporter | ||
Comment 83•9 years ago
|
||
¡Hola! Just another data point in case it is useful... ¡Gracias! bp-334794a8-76ba-4397-957a-d7ad92150930 30/09/2015 06:06 p.m. Crashing Thread Frame Module Signature Source 0 xul.dll NS_ABORT_OOM(unsigned int) xpcom/base/nsDebugImpl.cpp 1 xul.dll nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) dom/base/nsFrameMessageManager.cpp 2 mozglue.dll moz_xmalloc memory/mozalloc/mozalloc.cpp 3 xul.dll nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) dom/base/nsFrameMessageManager.cpp 4 xul.dll nsFrameMessageManager::DispatchAsyncMessage(nsAString_internal const&, JS::Value const&, JS::Value const&, nsIPrincipal*, JSContext*, unsigned char) dom/base/nsFrameMessageManager.cpp 5 xul.dll js::SetObjectElement(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, bool, JS::Handle<JSScript*>, unsigned char*) js/src/vm/Interpreter.cpp 6 xul.dll PRMJ_Now() js/src/vm/Time.cpp 7 xul.dll js::jit::CanEnter(JSContext*, js::RunState&) js/src/jit/Ion.cpp 8 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp 9 xul.dll js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) js/src/vm/Interpreter.cpp 10 xul.dll js::fun_apply(JSContext*, unsigned int, JS::Value*) js/src/jsfun.cpp 11 xul.dll js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) js/src/vm/Interpreter.cpp 12 xul.dll js::DirectProxyHandler::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) js/src/proxy/DirectProxyHandler.cpp 13 xul.dll js::CrossCompartmentWrapper::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) js/src/proxy/CrossCompartmentWrapper.cpp 14 xul.dll js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) js/src/vm/Interpreter.cpp 15 xul.dll js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) js/src/vm/Interpreter.cpp 16 xul.dll js::jit::DoCallFallback js/src/jit/BaselineIC.cpp 17 @0x14b94300 18 @0x30a74337 19 @0x14b90967 20 xul.dll EnterBaseline js/src/jit/BaselineJIT.cpp 21 xul.dll js::jit::EnterBaselineMethod(JSContext*, js::RunState&) js/src/jit/BaselineJIT.cpp 22 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp 23 xul.dll js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) js/src/vm/Interpreter.cpp 24 xul.dll JS_CallFunctionValue(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) js/src/jsapi.cpp 25 xul.dll nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) js/xpconnect/src/XPCWrappedJSClass.cpp 26 xul.dll nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) js/xpconnect/src/XPCWrappedJS.cpp 27 xul.dll PrepareAndDispatch xpcom/reflect/xptcall/md/win32/xptcstubs.cpp 28 xul.dll SharedStub xpcom/reflect/xptcall/md/win32/xptcstubs.cpp 29 nss3.dll PR_Assert nsprpub/pr/src/io/prlog.c 30 nss3.dll md_UnlockAndPostNotifies nsprpub/pr/src/md/windows/w95cv.c 31 user32.dll GetInputBits
As a stopgap measure, I'll try and make sending throw (and add Telemetry) and we'll see how this goes.
Depends on: 1212244
Updated•9 years ago
|
Crash Signature: , nsIPrincipal*) ]
[@ OOM | large | NS_ABORT_OOM | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase ] → , nsIPrincipal*) ]
[@ OOM | large | NS_ABORT_OOM | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase ]
[@ mozalloc_abort | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase]
Reporter | ||
Comment 85•9 years ago
|
||
¡Hola! Just another data point in case it is useful... ¡Gracias! bp-9a3bff39-5edd-47ca-9af8-e0e892151012 12/10/2015 03:09 p.m. Crashing Thread Frame Module Signature Source 0 xul.dll NS_ABORT_OOM(unsigned int) xpcom/base/nsDebugImpl.cpp 1 xul.dll nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) dom/base/nsFrameMessageManager.cpp 2 mozglue.dll moz_xmalloc memory/mozalloc/mozalloc.cpp 3 xul.dll nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) dom/base/nsFrameMessageManager.cpp 4 xul.dll nsFrameMessageManager::DispatchAsyncMessage(nsAString_internal const&, JS::Value const&, JS::Value const&, nsIPrincipal*, JSContext*, unsigned char) dom/base/nsFrameMessageManager.cpp 5 xul.dll mozilla::jsipc::IdToObjectMap::trace(JSTracer*) js/ipc/JavaScriptShared.cpp 6 xul.dll js::InitPropertyOperation(JSContext*, JSOp, JS::Handle<JSObject*>, JS::Handle<jsid>, JS::Handle<JS::Value>) js/src/vm/Interpreter-inl.h
Comment 86•9 years ago
|
||
Looks like this bug won't be fixed during the 42 beta cycle. However, we still have time for 43 and we would be happy to take a patch during this cycle.
Comment 87•9 years ago
|
||
For 43 we could try either bug 1212244 or bug 1213437. Though, they both probably just move OOM to some other place, or make browser work inconsistently (when sending large messages) in case we're close to OOM. Bug 1213437 seems to rely on the change only in 44, so bug 1212244 might be easier and safer to port to 43.
So, bug 1212244 should be safe wrt OOM but will cause Session Restore dataloss instead. On the other hand, bug 1213437 (which is generally a Good Thing) will probably only move the OOM in this case, until we have bug 1214346... which will also turn the OOM into a Session Restore dataloss. I will try and work on bug 1214346 asap, but in the meantime, I'm not sure we should uplift anything. Executive decision needed.
Depends on: 1216250
Actually, the real solution is probably bug 1216250.
Comment 90•9 years ago
|
||
Work is ongoing in bug 1216250 for now; wontfixing this for 43.
Reporter | ||
Comment 91•9 years ago
|
||
¡Hola! Is this useful? bp-95ce8741-a64e-4057-ab99-bf9ee2151204 04/12/2015 12:35 p.m. Crashing Thread (0) Frame Module Signature Source 0 xul.dll NS_ABORT_OOM(unsigned int) xpcom/base/nsDebugImpl.cpp 1 xul.dll nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::ipc::StructuredCloneData&, JS::Handle<JSObject*>, nsIPrincipal*) dom/base/nsFrameMessageManager.cpp 2 mozglue.dll moz_xmalloc memory/mozalloc/mozalloc.cpp 3 xul.dll nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::ipc::StructuredCloneData&, JS::Handle<JSObject*>, nsIPrincipal*) dom/base/nsFrameMessageManager.cpp 4 xul.dll xul.dll@0x1b83107 ¡Gracias!
Comment 92•9 years ago
|
||
(In reply to alex_mayorga from comment #91) > ¡Hola! > > Is this useful? > > bp-95ce8741-a64e-4057-ab99-bf9ee2151204 > 04/12/2015 12:35 p.m. This is with Firefox Beta 43 which will not be getting any further fixes. It'd be useful to see if you can replicate this in Firefox 44 since that's where bug 1216250 landed.
Comment 93•9 years ago
|
||
Looks like it is gone in 44 thanks to bug 1216250! Congrat David! Merci beaucoup!
You need to log in
before you can comment on or make changes to this bug.
Description
•