Closed Bug 530955 Opened 15 years ago Closed 14 years ago

New crash [@ ExecuteTree] in Firefox 3.6b3

Categories

(Core :: JavaScript Engine, defect)

1.9.2 Branch
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
blocking1.9.2 --- .5+
status1.9.2 --- .5-fixed
status1.9.1 --- unaffected

People

(Reporter: jst, Assigned: dvander)

References

()

Details

(4 keywords, Whiteboard: [sg:critical][critsmash:investigating])

Crash Data

Attachments

(4 files)

There's a new crash in Firefox 3.6b3 with the signature "ExecuteTree" in Firefox 3.6b3 that hasn't been seen in any of the versions 3\.5.*. So far we've seen 78+ of these crashes in the wild.

Please see http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A3.6b3&range_value=1&range_unit=weeks&query_search=signature&query_type=exact&query=ExecuteTree&do_query=1 for more crash info.
Flags: blocking1.9.2?
21 total crashes for ExecuteTree on 20091122-crashdata.csv
3 start up crashes inside 3 minutes

signature list
  18 ExecuteTree
   3 js_ExecuteTree

os breakdown
   6 ExecuteTree Windows NT 5.1.2600 Service Pack 2
   5 ExecuteTree Windows NT 5.1.2600 Service Pack 3
   4 ExecuteTree Windows NT 6.1.7600
   3 js_ExecuteTree Windows NT 5.1.2600 Service Pack 2
   1 ExecuteTree Windows NT 6.0.6002 Service Pack 2
   1 ExecuteTree Windows NT 6.0.6001 Service Pack 1
   1 ExecuteTree Windows NT 5.1.2600 Service Pack 2, v.2055

distribution of all versions where the ExecuteTree crash was found on 20091122-crashdata.csv
  14 Firefox 3.6b3
   3 Firefox 3.6b1
   3 Firefox 3.5.5
   1 Firefox 3.6b2
I think I first see this showing up in 3.6 data on 10/16 then ramps just after beta 1 on nov 5
in the lists in comment 2 the js_ExecuteTree is probably a different bug that is low volume on both 3.5.x and 3.6

http://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=exact&query=js_ExecuteTree&date=&range_value=1&range_unit=weeks&do_query=1&signature=js_ExecuteTree

That should probably be tracked separately.

volumes shown in comment 2 are only the ExecuteTree signature and show higher volume and increasing as 3.6 betas get more users.
It looks like the higher volume crash is in the called tree itself, so we'll probably need minidump inspection here to have any joy.  The lower-volume 3.5+3.6 one looks to be crashing trying to call LeaveTree, can't tell much more without minidumpery.
Flags: blocking1.9.2? → blocking1.9.2+
Assignee: general → gal
Looks like I won the lottery. I will start looking at the stacks and take it from there.
The LeaveTree one might be an oom condition. Focusing on the higher-volume on trace crash. Minidump won't help much because we can't see the jitted coded from the minidump. We should look at the urls.
Keywords: qawanted
I spent some time browsing the url jst pulled for me. No luck. Its very heavy on facebook, but that might just be representative of average web use. None of the URLs crash for me. At this point my gut is telling me this is a GC issue (we GC and something isn't kept alive, causing us to die on trace). jst has 2 core files for linux in the same general area as this bug. I will look at those next.
Ok, after some digging around I have convinced myself that this bug was fixed by the following patch:

https://bugzilla.mozilla.org/show_bug.cgi?id=528048

The patch makes sure sprops stay alive after a GC if we embed them on trace. The patch has landed on m-c a week ago but is not on 1.9.2 yet.

There isn't enough crash data for trunk to tell whether the patch fixed anything there. I only see 1 crash with ExecuteTree for the last 4 weeks on trunk.
This crash shows as #40 in 3.6 B4. Did the patch in Bug 528048 make it into the beta (from the checkin date it looks as if it might have)? If it still an issue and someone can give me some URLs I will see if I can reproduce.
marcia, jst pulled urls for me but this is a GC bug so I was never able to reproduce it. I was hoping we could confirm based on crash stats that the patch fixed the problem.
From http://hg.mozilla.org/releases/mozilla-1.9.2/pushloghtml it looks like the patch in bug 528048 went into the 1.9.2 branch well after b4 was tagged.  So I think we have to wait for b5 (or RC) crash data.
Keywords: qawanted
Whiteboard: [waiting for 3.6b5 crash data]
I will look at crash stats for beta 5 and renominate for blocking if needed.
Flags: blocking1.9.2+
adding dependency so we don't loose track of rechecking this.  sounds like beta 5 might be going out today.
Depends on: 528048
[@ ExecuteTree] went from #83 in b4 to #10 in b5 :(
http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6b4
http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6b5
Flags: blocking1.9.2?
Whiteboard: [waiting for 3.6b5 crash data]
yeah, something caused a big spike yesterday.  one thing that happened was 200,000 new users moved up from beta4 to beta5 or joined the beta as new users.

https://wiki.mozilla.org/CrashKill/Crashr#Relases_3.6b5
Flags: blocking1.9.2? → blocking1.9.2+
are these crashes all the latest beta? iow, have we eliminated older buildids?
some of these are on 3.5.x but much higher rate om 3.6

checking --- 20091220-crashdata.csv ExecuteTree
release total-crashes
              ExecuteTree crashes
                         pct.
all     215848  112     0.000518884
3.0.15  6811            0
3.0.16  31577           0
3.5.5   18251   2       0.000109583
3.5.6   105854  3       2.83409e-05
3.6b5   16669   94      0.00563921
3.6b4   5000    10      0.002
3.6b3   644             0
3.6b2   661     2       0.00302572
3.6b1   2101            0
before this recent spike the distribution was still tilted toward 3.6bx

checking --- 20091215-crashdata.csv ExecuteTree
release total-crashes
              ExecuteTree crashes
                         pct.
all     231799  47      0.000202762
3.0.15  41098           0
3.0.16  414             0
3.5.5   127012  7       5.51129e-05
3.5.6   3080            0
3.6b5   209             0
3.6b4   22729   38      0.00167187
3.6b3   677     1       0.0014771
3.6b2   793             0
3.6b1   2265    1       0.000441501
This is almost certainly multiple bugs; ExecuteTree is tracemonkeyese for "running generated code".  All the ILLEGAL_INSTRUCTION (and PRIV_INSTRUCTION -- exciting! are we running garbage?) ones are on AMDs, so there may be some instruction selection issues at hand.
I looked through a bunch of these and found many different stacks, with a large percentage of crashes happening on hungarian operating systems. even within that subset I found different stacks. we're going to have to minus this and look for more data.
Flags: wanted1.9.2+
Flags: blocking1.9.2-
Flags: blocking1.9.2+
(In reply to comment #20)
> All the ILLEGAL_INSTRUCTION (and PRIV_INSTRUCTION -- exciting! are we 
> running garbage?) ones are on AMDs

Most, but not all; e.g. https://crash-stats.mozilla.com/report/index/0817bbaa-f4bc-48d0-8c52-e438f2091219

Anyway, I agree this is probably multiple bugs. Unfortunately, this crash happens too rarely to be found in nightlies (although I may search harder to try to find it), so we cannot try to guess at a patch that may have started it.

choffman: is there any way we can get these crash numbers compared to ADUs? I'm curious if the increase we see around b3 or so is simply due to more users, or if there is evidence that a patch/patches introduced more of these at that time.

Sadly, minidumps can't help us at this time because the code at the crash point is generated, and is not on the stack. We need new ideas in order to get anywhere on this. Here are two:

1. dvander suggested stashing the script filename before we call the trace so that we can recover it from the crashreport. For example, we could copy it to a char[] buffer in ExecuteTree, and then it would be in the minidump. We could see if the same script keeps showing up, or if we're really lucky find a test case.

2. Teach breakpad to send back the generated trace code. One idea is to create a breakpad API to register a memory range of interest. We call that just before calling a trace with the range of that trace. If we crash, then breakpad includes that memory in the minidump. After returning from the trace, we unregister that range. The problem with this idea is that traces can call other traces, so we can't easily and compactly represent the memory range of interest. 

A simpler idea that would work is to add the page that contains EIP to the minidump. We could then refine that with tracer knowledge later.
(In reply to comment #22)
> A simpler idea that would work is to add the page that contains EIP to the
> minidump. We could then refine that with tracer knowledge later.

file pls!
Depends on: 536271
(In reply to comment #22)
> 
> choffman: is there any way we can get these crash numbers compared to ADUs? I'm
> curious if the increase we see around b3 or so is simply due to more users, or
> if there is evidence that a patch/patches introduced more of these at that
> time.
> 

bugs are on file to get adu data merged into the crash database so we can do things like that more easily.   until then I'm grabing snaps from the two sources and pasting together at https://wiki.mozilla.org/CrashKill/Crashr

                                     adus
                  crash-count   3.6b3    3.6b4

20091118-crashdata	21	18435	
20091119-crashdata	26	142847	
20091120-crashdata	30	207349	
20091121-crashdata	20	217975	
20091122-crashdata	21	243541	
20091123-crashdata	33	294307	
20091124-crashdata	24	321004	
20091125-crashdata	30	319230	
20091126-crashdata	50	313303	11003
20091127-crashdata	54	227788	101832
20091128-crashdata	70	111492	208895
20091129-crashdata	63	80372	262879
20091130-crashdata	77	79695	318380
20091201-crashdata	87	58951	354012
20091202-crashdata	43	47254	377984
20091203-crashdata	42	40100	394451
20091204-crashdata	46	34703	399269
20091205-crashdata	42	29512	375329
20091206-crashdata	45	26259	390387
20091207-crashdata	34	28124	447912
20091208-crashdata	54	26173	460269
(In reply to comment #22)
> 1. dvander suggested stashing the script filename before we call the trace so
> that we can recover it from the crashreport. For example, we could copy it to a
> char[] buffer in ExecuteTree, and then it would be in the minidump. We could
> see if the same script keeps showing up, or if we're really lucky find a test
> case.

Not sure you need to copy the whole string to a stack buffer -- we know how perf-sensitive ExecuteTree/LeaveTree are -- but here's a fun fact: script filenames are GC'ed and shared aggressively, see js_SaveScriptFilename. The char buffer used is an extension of a JSHashEntry, so in the heap, but perhaps you could use a more concise id to track filename from the stack.

> A simpler idea that would work is to add the page that contains EIP to the
> minidump. We could then refine that with tracer knowledge later.

+1000.

/be
I'm in contact with a user who reported a reproducible crash
that looks likes this bug. Here's his crash data:
http://crash-stats.mozilla.com/report/index/93852b95-98f9-47a4-9e5b-0b69b2100222

He's very cooperative and have created a test account for us on their server.
I can reproduce the crash (only on Windows though), my crashes:
bp-40101cbf-e63b-4bd6-9b48-6d6392100324	2010-03-25	01:29
bp-c4a8bc74-d38a-40d5-8e1d-109fc2100324	2010-03-25	01:28
bp-87e16b1d-961f-4524-8154-7d01b2100324	2010-03-25	01:28
bp-14631208-05bc-4db0-9160-f23ab2100324	2010-03-24	21:42
bp-aadbeecc-ef7d-4e2a-9d50-adfe42100324	2010-03-24	21:39
bp-b6991f9b-2792-4d21-9749-65ae82100324	2010-03-24	21:35

I can't reproduce it on trunk.  Nor on MacOSX or Linux, with any version.
The user says this crash started with Firefox 3.6, it never occurred with
3.5.x.
User says the crash also occurs on Mac OS X 10.5 and 10.6 with Firefox 3.6.
I tried the STR with no luck (3.6.2, macosx, product build). I will try again with a debug build.
Andreas, can you reproduce the crash?
Still 100% reproducible for me.  Namoroka 3.6.5pre 20100415 on Windows XP.
Keywords: crash
Whiteboard: [sg:critical]
It's #53 in the Firefox 3.6.3 top crash list, with 5118 crashes (past 2 weeks).
http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6.3
blocking1.9.2: --- → ?
Keywords: topcrash
Andreas, what should we do here now that Mats can reproduce?  Would a corefile from Mats help?
I only tried mac. Let me go upstairs and find a windows box and try again there. If that fails too, we should figure out core files.
Mats: can you capture it in a VM and get a snapshot to Andreas?  Alternatively, we could try copilot to your machine, so Andreas or someone else can debug it live.
Attaching Visual Studio or WinDbg and using the save memory feature might work, too.
dvander, I will stop by. We should try this out on windows before resorting to bigger guns.
reproduced
Group: core-security
I have narrowed this down the assembly generated on line 11240 of jstracer.cpp in the 1.9.2 branch. This code is supposed to index into a typemap vector, but the base address is garbage. I will know more soon.
Okay, I think I see what's going on here. The bogus address is 0x1E, stored in EBX.

>  mov ebx, [ebx + 0xC]
>  add ebx, 0x1E

This line is grabbing a FrameInfo* from the RP stack and adding |sizeof(FrameInfo) + 2|. 0xC/4 is the distance between the trace entry frame and the frame that owns the argsobj. That's 3.

So why is rp[3] NULL? This is an optimized build i.e. no trace spew, so reading the nearest guard jump:

>  006af307  jne 006df3f4
>  ... ... guard code
>  006df418  mov eax, 0x66DE698

Examining this address as a GuardRecord, and then recovering the VMSideExit, reveals the callDepth is 3.

RP uses 0-based indexes, so this is an off-by-one bug - rp[3] would be valid if |callDepth >= 4|. Test case and patch coming.
Attached file test case
This bug does not exist on trunk, it happened to be fixed along with bug 495331. Test case does not crash (poisoning memory would do the trick), but you can see the problem because the type guard fails too much:

monitor: exits(16), timeouts(0), type mismatch(0), triggered(16), global mismatch(0), flushed(0)
Assignee: gal → dvander
Status: NEW → ASSIGNED
Attached patch fixSplinter Review
monitor: exits(2), timeouts(0), type mismatch(0), triggered(2), global mismatch(0), flushed(0)
Attachment #439423 - Flags: review?(dmandelin)
Attachment #439423 - Flags: review?(dmandelin) → review+
blocking1.9.2: ? → needed
dvander: Can you please check out Bug 561813? On Mac I get this crash running the trunk and the URL in that bug. See the last bug comment for the link to my crash report.
Whiteboard: [sg:critical] → [sg:critical][critsmash:investigating]
dvander, any progress here?
(In reply to comment #46)
> dvander, any progress here?

This is waiting on approval. I don't know when that happens.
Comment on attachment 439423 [details] [diff] [review]
fix

a=beltzner for 1.9.2 default only
Attachment #439423 - Flags: approval1.9.2.5? → approval1.9.2.5+
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/de8139bb4aa7
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
blocking1.9.2: needed → .5+
I have seen a few crashes showing up in crash stats with this stack (http://tinyurl.com/2e2sdqv links to the Mac crashes) - I can crash in this stack by loading https://home.eease.adp.com/recruit2/?id=510443&t=2 using Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.3a6pre) Gecko/20100621 Minefield/3.7a6pre. Should I reopen this bug or file a new one?
(In reply to comment #50)
> I have seen a few crashes showing up in crash stats with this stack
> (http://tinyurl.com/2e2sdqv links to the Mac crashes) - I can crash in this
> stack by loading https://home.eease.adp.com/recruit2/?id=510443&t=2 using
> Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.3a6pre)
> Gecko/20100621 Minefield/3.7a6pre. Should I reopen this bug or file a new one?

Since this bug is already patched, let's do a new one.
Bug 573558 is the new bug on file for the crash noted in Comment 50.
(In reply to comment #44)
> crash on load 1.9.2 winxp
> http://www.roadsafetraffic.com/locations.htm
> bp-02c5e6df-1749-4be3-86e2-520822100423
> 
> http://www.srssa.com/contact/
> bp-d7bd2b01-0642-4359-b380-daa142100423

I used these to verify the fix. Both of these still crash in 1.9.2.6 but don't crash in build 1 of 1.9.2.7 on Win XP: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.7) Gecko/20100701 Firefox/3.6.7 (.NET CLR 3.5.30729).
Keywords: verified1.9.2
Group: core-security
Crash Signature: [@ ExecuteTree]
You need to log in before you can comment on or make changes to this bug.