Closed Bug 379484 Opened 13 years ago Closed 12 years ago

move leak box off Firefox page to MozillaTest until it can stay green

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sayrer, Assigned: rhelmer)

References

Details

Attachments

(4 files)

 
I should add: is it orange because shipping Firefox code is buggy? (if so, it should stay up on the front page)
We need leak numbers on the main tinderbox.  If this tinderbox is unacceptable, we need to close the tree until we have one that is.
That said, somebody should debug why it's actually orange.  Is it because multiple VMs are running on the same machine causing the time to run out?  Is it because Firefox is hanging?  If so, where is it hanging?  Somebody with access to the machine should investigate.

Some other possible solutions: bug 374822, bug 376874.
(In reply to comment #2)
> We need leak numbers on the main tinderbox.  If this tinderbox is unacceptable,
> we need to close the tree until we have one that is.
> 

Let's not play chicken. As of now, we don't consistently have leak numbers on the main tinderbox because the box we have now is orange half the time.
We have perfectly usable leak numbers -- it's just a long cycle time sometimes.
(In reply to comment #5)
> We have perfectly usable leak numbers -- it's just a long cycle time sometimes.
> 

We would still have perfectly usable leak numbers if the 50%-orange box reported to MozillaTest, and we wouldn't have an orange box on the front page at all times.
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
(In reply to comment #3)
> That said, somebody should debug why it's actually orange.  Is it because
> multiple VMs are running on the same machine causing the time to run out?  Is
> it because Firefox is hanging?  If so, where is it hanging?  Somebody with
> access to the machine should investigate.

I'm going to stop tinderbox when the current cycle is complete, and see if I can reproduce the problem.

> Some other possible solutions: bug 374822, bug 376874.
 
It's a single-CPU machine (it's a VM so more correctly it only has one CPU assigned). Does that rule out bug 376874?
Hm, there are old firefox-bin processes hanging around on this machine. Not doing anything in particular according to strace:

futex(0x9a1e734, FUTEX_WAIT, 2, NULL

Not much memory free, which could certainly be slowing things down.
mmm, borrow pskill.exe from robcee?
(In reply to comment #9)
> mmm, borrow pskill.exe from robcee?
> 

This is tinderbox client on Linux. It's supposed to do this on timeout:

http://mxr.mozilla.org/mozilla/source/tools/tinderbox/build-seamonkey-util.pl#1571

I'm going to run several test-only cycles and see if I can get a hang like this.
(In reply to comment #10)
> (In reply to comment #9)
> > mmm, borrow pskill.exe from robcee?
> > 
> 
> This is tinderbox client on Linux. 

oh right, duh.
Hm. I am running the tests via tinderbox (so the profile creation/pref settings should all be ok), and every time so far the bloatcycle.html never closes after completing all tests. Made sure that browser.dom.window.dump.enabled was set correctly in prefs.js, I'm going to put some debug statements in the test to see what's up.
Have now seen a couple runs where the above does not happen, but when running with --trace-malloc, firefox-bin invokes another copy of firefox-bin (made sure NO_EM_RESTART=1). This has happened every time I've tried, so far (reduced it down to just --trace-malloc).
Actually it looks like after a while the child process went away, and the test completed.. will check the log.
Attachment #263529 - Attachment mime type: text/x-log → text/plain
Looks like the problem in comment #12 happens about half the time, it looks like the firefox-bin being invoked twice is a different problem.

ajschult helped track that down to mozilla/security/nss/lib/freebl/unix_rand.c:1018 which is trying to fork and run netstat. I've got a stack trace of the child process, attaching.

I asked rhelmer for some locals from that stack:

frame 7
<rhelmer> p bp is $1 = (void **) 0xbff04668
<rhelmer> p *bp is $2 = (void *) 0x0

frame 6
>p depth
<rhelmer> $3 = 1
> p bp
<rhelmer> $4 = (void **) 0x8b8eec0
> p bpdown
<rhelmer> $5 = (void **) 0x7374656e
> p bpdown[0]
<rhelmer> Cannot access memory at address 0x7374656e

So as far as I can tell, we call |calltree| with a pointer to null.  Then the execution works as follows, I think:

First time through the loop:

1047 bpdown = (void**)(0)
1048 (*0xbff04668) = NULL
1049 Test false because RHS is 0.
1051 bpup = 0xbff04668
1052 bp = 0

Second time through the loop:

1047 bpdown = (void**)(*(void**)0)  (whatever that is!)
1048 (*0) = 0xbff04668              (not sure why this works)
1049 Test crashes because bpdown is some random pointer

So it seems like the "simple" thing to do would be to either have calltree bail out if its arg points to null or have backtrace() not call calltree with such an arg.

Now the question is why we have such an arg in the first place...  dbaron, any ideas?
Stacktrace from bloatcycle.html hanging. This test is checked in here:
http://mxr.mozilla.org/mozilla/source/build/bloatcycle.html

Command line was:
firefox-bin -P default resource:///res/bloatcycle.html
bz and ajschultz thought this might help. Also did a little cleanup. 

I think a better solution here would be to use e.g. an onload handler to determine when the pages are done loading instead of a timeout, and to do something more like http://test.bclary.com/bin/quit.js to quit instead of "window.close()" on the parent window.

However, those will require more prefs to be set and more testing, so let's see if this patch helps for right now..
Attachment #263550 - Flags: review?
Attachment #263550 - Flags: review? → review?(bzbarsky)
Attachment #263550 - Flags: review?(bzbarsky) → review+
Landed:
Checking in bloatcycle.html;
/cvsroot/mozilla/build/bloatcycle.html,v  <--  bloatcycle.html
new revision: 1.3; previous revision: 1.2
done
Looks like we've stopped the bleeding at least. I am reassigning this to back general build alias for now. I'll file a separate bug on improvements to bloatcycle.html that I suggested, as I'd like to make the same kind of change for a lot of our tests and it could be done more generically.

Leaving the bug open to deal with comment #17.
Assignee: rhelmer → build
Status: ASSIGNED → NEW
Assignee: build → nobody
Boris, any chance you could file a separate bug on comment 17 with a little more context?  (And then we can resolve this one.)
Filed bug 417872.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee: nobody → robert
Component: Testing → Release Engineering
Product: Core → mozilla.org
QA Contact: testing → release
Version: Trunk → other
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.