Closed Bug 455791 Opened 13 years ago Closed 13 years ago

regression from 22:38:45 9/15 causing talos intermittent ts failures on win32

Categories

(Core :: General, defect)

PowerPC
Windows XP
defect
Not set
blocker

Tracking

()

RESOLVED FIXED

People

(Reporter: anodelman, Unassigned)

References

Details

Attachments

(1 file)

Proposed regression range: http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2008-09-15+20%3A45%3A00&enddate=2008-09-15+23%3A00%3A00

After 22:38:45 9/15 we've been having intermittent talos failures during ts on win32 mozilla central talos boxes.  The failures occur when the browser takes too long to shut down, causing talos to believe that the browser is frozen and thus killing it and throwing an error.  

The intermittent failures are occurring on all talos mozilla-central win32 machines, including try server talos.  This is 5 separate machines, so machine failure is doubtful.

Bug 455734 filed to investigate possibility that this is talos bustage - but it really looks like browser bustage to me.
(In reply to comment #0)
> Proposed regression range:
> http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2008-09-15+20%3A45%3A00&enddate=2008-09-15+23%3A00%3A00

The only change that should affect generated code on win32 in that range is 
http://hg.mozilla.org/mozilla-central/rev/f51aad9e6a88

It would add some null-checks in some shutdown code and possibly mean that some code that was inlined is no longer inlined.  The expected effect of this would be very small given other things happening at shutdown.  Any idea what magnitude of change in shutdown time we might be talking about?
I would say that it could be a very small time difference.  Once talos receives results from the browser it assumes that the browser has shut down.  It then does a double check for a clean system before starting the next iteration.

There's some question that we could fix this by a well placed sleep in the talos code - but I'm wary of altering something that is showing a failure just to get us green.
Blocks: 454735
Severity: critical → blocker
If my theory about the cause of this bug is correct, then this patch should clean up the intermittent failures.  It gives the browser 5 seconds between test iterations to shut down cleanly - after that it will still call it busted and throw an error.

The danger here is that we could end up covering up slow browser closure issues.  This could warrant the design of a matching test to Tstartup for Tshutdown.
Attachment #339366 - Flags: review?(bhearsum)
Attachment #339366 - Flags: review?(bhearsum) → review?(joduinn)
Comment on attachment 339366 [details] [diff] [review]
[Checked in]give the browser 5 seconds to close before calling it busted

Looks good. 

being precise: the Sleep() time is still the same, the change here is just doing Sleep() within the loop, before every iteration of a test.
Attachment #339366 - Flags: review?(joduinn) → review+
Comment on attachment 339366 [details] [diff] [review]
[Checked in]give the browser 5 seconds to close before calling it busted

Checking in ttest.py;
/cvsroot/mozilla/testing/performance/talos/ttest.py,v  <--  ttest.py
new revision: 1.16; previous revision: 1.15
done
Attachment #339366 - Attachment description: give the browser 5 seconds to close before calling it busted → [Checked in]give the browser 5 seconds to close before calling it busted
Some talos machines were still showing ts failures with builddate
Thu Sep 18 17:14:51 2008, so I've backed out f51aad9e6a88.
That seems to have resolved the problem.  I'll see if I can work out why.

If Ts cycles were crashing on shutdown, would we know?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Most recent run of qm-plinux-trunk03 is orange with a busted Ts again.  Is this symptomatic of the same problem?   I don't want to re-open the tree if it was closed to debug this, but if this is a different problem, then I think there are people itching to land.  :)
Talos could be masking browser crash on shutdown - if there's a crash dialog it just sees it as an open firefox process and starts cleaning up.  I haven't had much luck recreating this on stage, but I'll keep trying.
I think I've spotted a problem in f51aad9e6a88 that may have caused this.
I'm sorry I took so long to spot it in such a small patch.

-            gfxFontCache::GetCache()->NotifyReleased(this);
-            return 0;
+            NotifyReleased();
          }
          return mRefCnt;

The code is wrong because it returns this->mRefCnt when the object may have been deleted.  I don't yet know why the object may have already been deleted on Windows, but I should fix the code and try that before asking you to spend more time looking at talos.  Thank you for all your efforts and sorry for the trouble.
(In reply to comment #8)
> Most recent run of qm-plinux-trunk03 is orange with a busted Ts again.  Is this
> symptomatic of the same problem?

Looks the same kind of failure from the info in the log, though this seems much more of an outlier than the win32 failures that were happening reasonably frequently.
Duplicate of this bug: 455734
You need to log in before you can comment on or make changes to this bug.