Closed Bug 1099195 Opened 10 years ago Closed 9 years ago

Intermittent B2G debug M12 / opt M9 "command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/b2g_emulator_unittest.py', '--cfg', 'b2g/emulator_automation_config.py', '--test-suite', 'mochitest', '--this-chunk', '12'

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

OK, enough ignoring this. Green runs are finishing in just under 120min. Not surprisingly, we're semi-frequently going over and timing out. Chunk on!
Summary: Intermittent B2G debug M12 "command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/b2g_emulator_unittest.py', '--cfg', 'b2g/emulator_automation_config.py', '--test-suite', 'mochitest', '--this-chunk', '12', '--tot" → Intermittent B2G debug M12 / opt M9 "command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/b2g_emulator_unittest.py', '--cfg', 'b2g/emulator_automation_config.py', '--test-suite', 'mochitest', '--this-chunk', '12'
Blocks: 1148843
I noticed the incremental GC slice is usually over due with DEBUG build, and tried to set SliceBudget with smaller CounterReset (200), but it doesn't help much. Probably because there're still many unlimited slices such as the exactGC() from mediaTestCleanup(), and gc reason CC_FORCED.
There's one thing I noticed today. SpecialPower.exactGC() does two iterations of full GC and CC like this: #1 full GC #2 CC #3 full GC #4 CC At #2, it's possible that NeedsGCAfterCC() returns true because sCCollectedWaitingForGC is larger than the threashold and nsJSContext::EndCycleCollectionCallback() then poke a delayed GC with reason CC_WAITING. This delayed GC will always happen after #4, even there's a full GC #3 and the NeedsGCAfterCC() check at #4 returns false. I think this is redundant, GC really takes time in DEBUG build. Kyle, I am not sure whom to ask, but since this is triggered from EndCycleCollectionCallback(), do you think it is reasonable to cancel the delayed GC timer if there's a full GC happened before it? Or do you know whom should I ask for this? Thank you.
Flags: needinfo?(khuey)
nsJSContext::GarbageCollectNow() calls KillGCTimer() and KillShrinkGCBuffersTimer() in the begining, but DOMGCSliceCallback() calls KillShrinkGCBuffersTimer() only when GC_CYCLE_BEGIN...
Flags: needinfo?(khuey) → needinfo?(bugs)
(In reply to Ting-Yu Chou [:ting] from comment #2024) > nsJSContext::GarbageCollectNow() calls KillGCTimer() and > KillShrinkGCBuffersTimer() in the begining, but DOMGCSliceCallback() calls > KillShrinkGCBuffersTimer() only when GC_CYCLE_BEGIN... This is code that smaug or I are most familiar with. Your reasoning in comment 2023 makes sense: we should not have a GC running after CC 4 because of the state of things in CC 2. The start of the second full GC should kill off the full GC timer started in the first CC, because it should be doing a GC_CYCLE_BEGIN. Do you know why that is not happening?
Flags: needinfo?(bugs)
Trying to figure out what exactGC actually is about... that looks a bit silly. But yes, starting a new GC should kill the timer.
(In reply to Andrew McCreight (awayish through 5/5) [:mccr8] from comment #2025) > (In reply to Ting-Yu Chou [:ting] from comment #2024) > > nsJSContext::GarbageCollectNow() calls KillGCTimer() and > > KillShrinkGCBuffersTimer() in the begining, but DOMGCSliceCallback() calls > > KillShrinkGCBuffersTimer() only when GC_CYCLE_BEGIN... > > This is code that smaug or I are most familiar with. Your reasoning in > comment 2023 makes sense: we should not have a GC running after CC 4 because > of the state of things in CC 2. > > The start of the second full GC should kill off the full GC timer started in > the first CC, because it should be doing a GC_CYCLE_BEGIN. Do you know why > that is not happening? Comment 2024 should be the reason. The full GC is initiated from PreciseGCRunnable() by JS::GCForReason() and goes to GCRuntime::collect() (not through nsJSContext::GarbageCollectNow()), so the timer should be canceled in DOMGCSliceCallback() for nsJSContext when receive GC_CYCLE_BEGIN.
Debug M12 still timed out from the Try results of bug 1159810, there could be some other reasons...
Will sheriff have any concerns if we set total chunks larger?
Flags: needinfo?(ryanvm)
It's not that simple, otherwise we would have done it months ago. As demonstrated recently in bug 1156425, we're up against the limit for the number of different jobs we can run on our AWS linux64 instances due to inherent buildbot limitations. We've been banking on switching emulator tests to Taskcluster w/ significantly more chunks as the ultimate fix here, but that's been taking much longer than anticipated. Unfortunately, we're pretty much stuck at the moment, and yes, that's awful.
Flags: needinfo?(ryanvm)
Then what if we extend the timeout (2h) longer as bug 1156425?
Flags: needinfo?(ryanvm)
*shrug* I'm honestly having a difficult time caring much at this point. If you want to go that route, I don't have a strong opinion.
Flags: needinfo?(ryanvm)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #2089) > *shrug* I'm honestly having a difficult time caring much at this point. If > you want to go that route, I don't have a strong opinion. I couldn't find the place set 2h timeout, is it https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/tasks/tests/b2g_emulator_mochitest.yml?from=b2g_emulator_mochitest.yml&case=true#11?
(In reply to Ting-Yu Chou [:ting] from comment #2090) > I couldn't find the place set 2h timeout, is it > https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/tasks/ > tests/b2g_emulator_mochitest.yml?from=b2g_emulator_mochitest. > yml&case=true#11? Those are the Taskcluster configs. They won't help for the existing jobs. I think you'll need to specify one for each chunk in https://mxr.mozilla.org/build/source/buildbot-configs/mozilla-tests/b2g_config.py. I'd suggest asking someone from releng for more guidance than that.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #2094) > (In reply to Ting-Yu Chou [:ting] from comment #2090) > > I couldn't find the place set 2h timeout, is it > > https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/tasks/ > > tests/b2g_emulator_mochitest.yml?from=b2g_emulator_mochitest. > > yml&case=true#11? > > Those are the Taskcluster configs. They won't help for the existing jobs. I > think you'll need to specify one for each chunk in > https://mxr.mozilla.org/build/source/buildbot-configs/mozilla-tests/ > b2g_config.py. I'd suggest asking someone from releng for more guidance than > that. hwine, do you know where sets the 2h timeout for one test chunk? I have tried to ask on IRC #releng but didn't get responses (probably because of different timezone), please let me know if I should ask someone else. Thanks.
Flags: needinfo?(hwine)
I don't think we really want to increase the timeout here, 2h jobs make end-to-end times bad enough. What we should really do is increase the number of chunks.
Say hello to my little friend comment 2086. Here, let me help make it easier to concentrate, by hiding debug M1.
Ugh, so if we have no other choices, we can increase the timeout by adding a 'script_maxtime' property to the entries here: http://hg.mozilla.org/build/buildbot-configs/file/f895d6aaa604/mozilla-tests/b2g_config.py#l320 ...and specify something greater than 7200.
Attached patch patch v1 (obsolete) — Splinter Review
Extend mochitest timeout from 2h to 4h. Please let me know if I should ask someone else for reviewing. Clear NI to hwine.
Flags: needinfo?(hwine)
Attachment #8602482 - Flags: review?(jgriffin)
Comment on attachment 8602482 [details] [diff] [review] patch v1 Review of attachment 8602482 [details] [diff] [review]: ----------------------------------------------------------------- It pains me to do this, but I guess we don't have better alternatives atm. r+
Attachment #8602482 - Flags: review?(jgriffin) → review+
Attached patch patch v2Splinter Review
Added r=jgriffin.
Attachment #8602482 - Attachment is obsolete: true
Not sure have sheriff noticed the checkin-needed, Kanru told me seems there're some problems with the filter, or probably you guys are just too busy. Set NI to make sure it's in the checkin queue. Thank you.
Flags: needinfo?(ryanvm)
Keywords: checkin-needed
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #2171) > Comment on attachment 8603127 [details] [diff] [review] > patch v2 > > *cry* We can still revert this once comment 2086 is addressed.
No longer blocks: HiddenAutomationTests
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: