Closed Bug 1265168 Opened 9 years ago Closed 4 years ago

Frequent (GC?) crashes since april 14 in developer channel

Categories

(Core :: JavaScript: GC, defect)

47 Branch
x86_64
Unspecified
defect
Not set
major

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: kael, Unassigned)

Details

Since an update installed on April 14, FF Dev Edition (x64) has been crashing multiple times per day. There's no consistent stack but it seems likely that it is GC related, since a couple of the stacks are during a GC operation and it tends to happen when I'm interacting with sites that trigger frequent GCs. Here are the IDs from about:crashes. I seem to remember a couple others that aren't in the list here, maybe I didn't click OK on the report dialog. bp-e18ee48e-da37-486e-ab45-fdac72160416 2016-04-16 09:21 bp-ed27517c-edaa-497a-964a-f7eaa2160416 2016-04-16 04:52 bp-671dcfbf-581a-4144-b76d-57f7a2160415 2016-04-15 12:37 bp-f41b7ceb-9c53-49ed-afec-40c832160415 2016-04-14 21:56 bp-d0f1f6b9-88e3-4b44-bd98-db0642160415 2016-04-14 21:03 The most frequent site I see this on is Twitter. It usually happens when I'm scrolling, and I assume some code they run to reflow/load new tweets is setting this off.
Thanks for filing this. (In reply to K. Gadd (:kael) from comment #0) > Since an update installed on April 14, FF Dev Edition (x64) has been > crashing multiple times per day. Do you update regularly? Just trying to figure out the start of the regression range. Your first crash was with build id 20160413004016. As a start, someone should just check what landed before that.
(In reply to Jan de Mooij [:jandem] from comment #1) > Thanks for filing this. > > (In reply to K. Gadd (:kael) from comment #0) > > Since an update installed on April 14, FF Dev Edition (x64) has been > > crashing multiple times per day. > > Do you update regularly? Just trying to figure out the start of the > regression range. > > Your first crash was with build id 20160413004016. As a start, someone > should just check what landed before that. I usually update within a day of seeing the message, and I almost always have FF open. I don't know whether that influences how long it takes for me to see the update notification.
Amusingly I just crashed while scrolling this bug page, immediately after adding the previous comment. https://crash-stats.mozilla.com/report/index/c70867f8-cb63-48fc-9ae7-ceeaf2160418
That machine doesn't have bad RAM? I glanced at the push log but nothing stands out.
Flags: needinfo?(terrence)
Those stacks don't really have much in common other than dereferencing a bunch of memory. Given that something like 8% of RAM sticks start experiencing memory errors within a year of installation[1], I think it's certainly worth letting memtest86 loose on it overnight to check. 1- http://www.zdnet.com/article/dram-error-rates-nightmare-on-dimm-street/
Flags: needinfo?(terrence)
I left a Windows Memory Diagnostic running for an entire day and it didn't find any errors, and it's new RAM (maybe a month old?). Hrm. Wonder if something else could be the problem... Maybe memtest86 will find something.
(In reply to K. Gadd (:kael) from comment #6) > I left a Windows Memory Diagnostic running for an entire day and it didn't > find any errors, and it's new RAM (maybe a month old?). Hrm. Wonder if > something else could be the problem... Maybe memtest86 will find something. Maybe also try mprime? I've had it detect some soft CPU faults once when the cooling was marginal.
It handles mprime just fine, though haswell-e chips definitely run hotter on average. I'm going to try downgrading to an early-april build to see if the problem goes away. If it doesn't, it's definitely my machine :-) Could be GPU, maybe?
Thank you for taking the time to bisect! In theory, a GPU shouldn't be corrupting main memory, but who knows? Something certainly is though. The crashes you posted definitely indicate that there are corrupt edges stored in the CC, in the GC, and in the C++ heaps. Whatever it is, it's certainly not a run-of-the-mill missing barrier! Generally when those bugs sneak in (and they certainly do!), we see a massive crash spike everywhere. We didn't really see one then, which is why I thought hardware would be likely. If the hardware tests out (and it sounds like it's going to), then something unique to your workflow must be tripping over a latent bug.
I've been running an Aurora build from April 3 for three days now, and haven't had a single crash (with the exception of a single instance of https://bugzilla.mozilla.org/show_bug.cgi?id=1259699, which appears to have been fixed a couple days after this build was created). Are there any ways I could try to narrow down whether this could be an obscure/latent bug, like an asan build of Aurora?
So I ran the april 3 build until yesterday and didn't see this crash once. Then I updated to 2016-04-30 and I've already seen three crashes. :( What can I do to narrow this down?
(In reply to K. Gadd (:kael) from comment #11) > Then I updated to 2016-04-30 and I've already seen three crashes. :( That should now be 48.0a2. Was that also on Twitter? Do you have some crash reports? If the bug was introduced between April 2-14, it should be in this range: https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=60828c17117a&tochange=0f441874c8dd Nothing really stands out... > What can I do to narrow this down? You could (a) try to disable extensions (especially the ones that don't have a lot of users or do interesting things) or (b) try a build after April 3, maybe the 2016-04-10 build or (c) use a debug build for a while (we may hit an assertion failure somewhere) but that will probably be too slow...
I loaded the crash dumps for the crashes in comment 0 in Visual Studio and it's pretty weird. The crashes have different stacks but the crash reason seems to be similar. For instance, one of them has us loading a byte from *r8 and r8 is 0x200002A380ECE098. That's a bogus pointer on x64 (because 48-bit address space). It doesn't seem completely bogus, because other registers contain very similar values like 0x000002A380EE1A10 (without the high byte set). The other crashes are similar: a pointer with the high byte set to 1 or 2, but the rest of the bytes are similar to other register values. (Note that Socorro shows the crash address as 0xffffffffffffffff. That's a known issue, the CPU (IIRC) does that when the value is outside the 48 bit address space. The registers hold the right values though.)
(In reply to Jan de Mooij [:jandem] from comment #13) > The other crashes are similar: a pointer with the high byte set to 1 or 2 I meant 0x10 or 0x20 (or: the high nibble is 1 or 2). The crash in comment 3 is another one of these: a JSObject pointer that's 0x1000000000000000. It was likely a nullptr that got corrupted somewhere.
OK, it sounds like it is almost certainly bit flips. I have spare old RAM I can swap in, so I'll just do that to rule things out - it's weird that I could pass memtest and still have problems, but stranger things have happened :-)

All better?

Flags: needinfo?(kg)

Yep

Flags: needinfo?(kg)
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.