Closed Bug 1115490 Opened 9 years ago Closed 8 years ago

Intermittent Windows build command timed out: 19800 seconds elapsed, attempting to kill

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Windows Server 2008
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

We've just been blowing these off for a while, but who knows, they might turn out to be just a subset of slaves, or so frequent we have to go from an insane maxtime up to a psychotic maxtime.
b-2008-ix-0007 - 5
b-2008-ix-0012 - 2
b-2008-ix-0013 - 4
b-2008-ix-0014 - 3
b-2008-ix-0051 - 1
b-2008-ix-0066 - 1
b-2008-ix-0090 - 1
b-2008-ix-0124 - 1
b-2008-ix-0152 - 1
b-2008-ix-0164 - 1
b-2008-ix-0168 - 1

Couple potential winners of the loser sweepstakes there.
b-2008-ix-0001 - 1
b-2008-ix-0002 - 2
b-2008-ix-0003 - 1
b-2008-ix-0007 - 7
b-2008-ix-0012 - 4
b-2008-ix-0013 - 8
b-2008-ix-0014 - 4
b-2008-ix-0051 - 1
b-2008-ix-0066 - 1
b-2008-ix-0090 - 1
b-2008-ix-0118 - 1
b-2008-ix-0124 - 1
b-2008-ix-0131 - 1
b-2008-ix-0143 - 1
b-2008-ix-0152 - 1
b-2008-ix-0164 - 1
b-2008-ix-0168 - 1
b-2008-ix-0171 - 1
Depends on: 1122975
last few recent builds are all timing out here:

23:17:58     INFO -  c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox/_virtualenv/Scripts/python.exe c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/config/nsinstall.py -t -m 755 'jsapi-tests.exe' '../../../dist/bin'
23:17:58     INFO -  mozmake.EXE[6]: Leaving directory 'c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox/js/src/jsapi-tests'
23:44:50     INFO -  Still linking, 30 minutes passed...
(In reply to Jordan Lund (:jlund) from comment #51)
> last few recent builds are all timing out here:
> 
> 23:17:58     INFO - 
> c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox/
> _virtualenv/Scripts/python.exe
> c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/config/
> nsinstall.py -t -m 755 'jsapi-tests.exe' '../../../dist/bin'
> 23:17:58     INFO -  mozmake.EXE[6]: Leaving directory
> 'c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox/
> js/src/jsapi-tests'
> 23:44:50     INFO -  Still linking, 30 minutes passed...

On a good build/machine, there are only one or two "Still linking, 30 minutes passed..." messages. On the failing ones there are 5 before it gets killed, meaning it's taking at least 2-3 times longer to complete the same task. As you mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1110236#c17 it's worth checking if we're running out of memory and just dying from too much swap. It might also help to check if the CPU is scaling under load - we never did figure out bug 1050508 for the SeaMicro machines. Maybe there's a similar mis-configuration on these machines?
Thanks to jlund a handy script:

(tmp)-bash-4.1$ ./diagnose_intermittent_bug.py 1115490
{'builders': [('WINNT 5.2 mozilla-inbound pgo-build', 18),
              ('WINNT 5.2 fx-team pgo-build', 16),
              ('WINNT 5.2 b2g-inbound pgo-build', 14),
              ('WINNT 5.2 mozilla-aurora build', 5),
              ('WINNT 5.2 mozilla-central nightly', 2),
              ('WINNT 5.2 mozilla-central pgo-build', 2),
              ('WINNT 5.2 mozilla-beta build', 2),
              ('WINNT 5.2 try build', 1),
              ('WINNT 5.2 fx-team build', 1)],
 'repos': [('mozilla-inbound', 18),
           ('fx-team', 17),
           ('b2g-inbound', 14),
           ('mozilla-aurora', 5),
           ('mozilla-central', 4),
           ('mozilla-beta', 2),
           ('try', 1)],
 'slaves': [('b-2008-ix-0007', 11),
            ('b-2008-ix-0013', 8),
            ('b-2008-ix-0012', 4),
            ('b-2008-ix-0014', 4),
            ('b-2008-ix-0002', 4),
            ('b-2008-ix-0152', 2),
            ('b-2008-ix-0118', 2),
            ('b-2008-ix-0168', 2),
            ('b-2008-ix-0005', 2),
            ('b-2008-ix-0004', 2),
            ('b-2008-ix-0003', 2),
            ('b-2008-ix-0001', 2),
            ('b-2008-ix-0111', 1),
            ('b-2008-ix-0150', 1),
            ('b-2008-ix-0171', 1),
            ('b-2008-ix-0076', 1),
            ('b-2008-ix-0132', 1),
            ('b-2008-ix-0131', 1),
            ('b-2008-ix-0051', 1),
            ('b-2008-ix-0090', 1),
            ('b-2008-ix-0096', 1),
            ('b-2008-ix-0006', 1),
            ('b-2008-ix-0143', 1),
            ('b-2008-ix-0162', 1),
            ('b-2008-ix-0066', 1),
            ('b-2008-ix-0164', 1),
            ('b-2008-ix-0124', 1),
            ('b-2008-ix-0123', 1)],
 'times': {'2014-12-24': 1,
           '2014-12-25': 1,
           '2014-12-26': 3,
           '2014-12-27': 1,
           '2014-12-30': 4,
           '2014-12-31': 3,
           '2015-01-02': 1,
           '2015-01-04': 1,
           '2015-01-05': 3,
           '2015-01-06': 3,
           '2015-01-07': 3,
           '2015-01-08': 3,
           '2015-01-09': 6,
           '2015-01-11': 1,
           '2015-01-12': 4,
           '2015-01-13': 1,
           '2015-01-14': 5,
           '2015-01-15': 2,
           '2015-01-16': 1,
           '2015-01-17': 1,
           '2015-01-18': 2,
           '2015-01-19': 1,
           '2015-01-20': 2,
           '2015-01-21': 1,
           '2015-01-23': 1,
           '2015-01-27': 3,
           '2015-01-28': 3}}
Blocks: 1149955
Is there anyone that can start looking at this? In trying to debug/analyse what's going on with bug 1149955 I've spent over a week and most of that is waiting for try builds, which probably 80% of the time have failed.

Its obviously not helped by the 5-hour time for pgo builds, though I realise that's different, but its very frustrating to push to try and then come back 5 hours later only to realise they have failed again.
{'builders': [('WINNT 5.2 mozilla-inbound pgo-build', 40),
              ('WINNT 5.2 b2g-inbound pgo-build', 32),
              ('WINNT 6.1 x86-64 try build', 31),
              ('WINNT 5.2 fx-team pgo-build', 28),
              ('WINNT 6.1 x86-64 mozilla-inbound pgo-build', 8),
              ('WINNT 6.1 x86-64 b2g-inbound pgo-build', 8),
              ('WINNT 5.2 try build', 8),
              ('WINNT 5.2 mozilla-aurora build', 6),
              ('WINNT 6.1 x86-64 fx-team pgo-build', 5),
              ('WINNT 5.2 mozilla-central pgo-build', 5),
              ('WINNT 5.2 mozilla-beta build', 3),
              ('WINNT 6.1 x86-64 mozilla-central nightly', 3),
              ('WINNT 5.2 mozilla-central nightly', 3),
              ('WINNT 6.1 x86-64 mozilla-aurora build', 3),
              ('WINNT 5.2 mozilla-aurora nightly', 2),
              ('WINNT 5.2 mozilla-inbound leak test build', 2),
              ('WINNT 6.1 x86-64 mozilla-central pgo-build', 2),
              ('WINNT 5.2 mozilla-inbound build', 1),
              ('Ubuntu VM 12.04 fx-team debug test mochitest-browser-chrome-2',
               1),
              ('WINNT 6.1 x86-64 mozilla-aurora nightly', 1),
              ('WINNT 5.2 fx-team build', 1),
              ('non-buildbot b2g-device-image test [TC] B2G Flame KK Eng', 1),
              ('b2g_mozilla-inbound_win32_gecko build', 1),
              ('WINNT 5.2 mozilla-aurora leak test build', 1),
              ('WINNT 6.1 x86-64 cypress build', 1),
              ('TB WINNT 5.2 comm-central nightly', 1),
              ('WINNT 6.1 x86-64 try leak test build', 1),
              ('Ubuntu VM 12.04 mozilla-inbound debug test mochitest-other',
               1)],
 'repos': [('mozilla-inbound', 54),
           ('b2g-inbound', 40),
           ('try', 40),
           ('fx-team', 35),
           ('mozilla-central', 13),
           ('mozilla-aurora', 13),
           ('mozilla-beta', 3),
           ('comm-central', 1),
           ('cypress', 1)],
 'slaves': [('b-2008-ix-0007', 11),
            ('b-2008-ix-0015', 10),
            ('b-2008-ix-0013', 8),
            ('b-2008-ix-0014', 6),
            ('b-2008-ix-0097', 5),
            ('b-2008-ix-0012', 5),
            ('b-2008-ix-0118', 4),
            ('b-2008-ix-0110', 4),
            ('b-2008-ix-0068', 4),
            ('b-2008-ix-0002', 4),
            ('b-2008-ix-0039', 3),
            ('b-2008-ix-0139', 3),
            ('b-2008-ix-0168', 3),
            ('b-2008-ix-0161', 3),
            ('b-2008-ix-0027', 3),
            ('b-2008-ix-0171', 3),
            ('b-2008-ix-0140', 3),
            ('b-2008-ix-0152', 3),
            ('b-2008-ix-0183', 2),
            ('b-2008-ix-0033', 2),
            ('b-2008-ix-0133', 2),
            ('b-2008-ix-0132', 2),
            ('b-2008-ix-0131', 2),
            ('b-2008-ix-0134', 2),
            ('b-2008-ix-0108', 2),
            ('b-2008-ix-0109', 2),
            ('b-2008-ix-0029', 2),
            ('b-2008-ix-0105', 2),
            ('b-2008-ix-0106', 2),
            ('b-2008-ix-0115', 2),
            ('b-2008-ix-0116', 2),
            ('b-2008-ix-0111', 2),
            ('b-2008-ix-0162', 2),
            ('b-2008-ix-0048', 2),
            ('b-2008-ix-0076', 2),
            ('b-2008-ix-0093', 2),
            ('b-2008-ix-0094', 2),
            ('b-2008-ix-0098', 2),
            ('b-2008-ix-0142', 2),
            ('b-2008-ix-0143', 2),
            ('b-2008-ix-0141', 2),
            ('b-2008-ix-0066', 2),
            ('b-2008-ix-0018', 2),
            ('b-2008-ix-0154', 2),
            ('b-2008-ix-0017', 2),
            ('b-2008-ix-0150', 2),
            ('b-2008-ix-0006', 2),
            ('b-2008-ix-0005', 2),
            ('b-2008-ix-0004', 2),
            ('b-2008-ix-0003', 2),
            ('b-2008-ix-0001', 2),
            ('b-2008-ix-0184', 1),
            ('b-2008-ix-0182', 1),
            ('b-2008-ix-0030', 1),
            ('b-2008-ix-0035', 1),
            ('b-2008-ix-0147', 1),
            ('b-2008-ix-0038', 1),
            ('b-2008-ix-0137', 1),
            ('b-2008-ix-0135', 1),
            ('b-2008-ix-0020', 1),
            ('b-2008-ix-0023', 1),
            ('b-2008-ix-0022', 1),
            ('b-2008-ix-0028', 1),
            ('b-2008-ix-0102', 1),
            ('b-2008-ix-0101', 1),
            ('b-2008-ix-0149', 1),
            ('b-2008-ix-0056', 1),
            ('b-2008-ix-0051', 1),
            ('b-2008-ix-0053', 1),
            ('b-2008-ix-0059', 1),
            ('b-2008-ec2-0002', 1),
            ('unknown', 1),
            ('b-2008-ix-0163', 1),
            ('b-2008-ix-0164', 1),
            ('b-2008-ix-0042', 1),
            ('b-2008-ix-0041', 1),
            ('b-2008-ix-0040', 1),
            ('b-2008-ix-0047', 1),
            ('b-2008-ix-0046', 1),
            ('b-2008-ix-0049', 1),
            ('b-2008-ix-0173', 1),
            ('b-2008-ix-0074', 1),
            ('b-2008-ix-0179', 1),
            ('b-2008-ix-0071', 1),
            ('tst-linux32-spot-170', 1),
            ('tst-linux32-spot-174', 1),
            ('b-2008-ix-0090', 1),
            ('b-2008-ix-0092', 1),
            ('b-2008-ix-0095', 1),
            ('b-2008-ix-0096', 1),
            ('b-2008-ec2-0020', 1),
            ('b-2008-ix-0148', 1),
            ('b-2008-ix-0019', 1),
            ('b-2008-ix-0156', 1),
            ('b-2008-ix-0158', 1),
            ('b-2008-ix-0016', 1),
            ('b-2008-ix-0088', 1),
            ('b-2008-ix-0124', 1),
            ('b-2008-ix-0123', 1),
            ('b-2008-ix-0086', 1)],
 'times': {'2014-12-24': 1,
           '2014-12-25': 1,
           '2014-12-26': 3,
           '2014-12-27': 1,
           '2014-12-30': 4,
           '2014-12-31': 3,
           '2015-01-02': 1,
           '2015-01-04': 1,
           '2015-01-05': 3,
           '2015-01-06': 3,
           '2015-01-07': 3,
           '2015-01-08': 3,
           '2015-01-09': 6,
           '2015-01-11': 1,
           '2015-01-12': 4,
           '2015-01-13': 1,
           '2015-01-14': 5,
           '2015-01-15': 2,
           '2015-01-16': 1,
           '2015-01-17': 1,
           '2015-01-18': 2,
           '2015-01-19': 1,
           '2015-01-20': 2,
           '2015-01-21': 1,
           '2015-01-23': 1,
           '2015-01-27': 3,
           '2015-01-28': 3,
           '2015-02-01': 1,
           '2015-02-03': 1,
           '2015-02-04': 2,
           '2015-02-05': 1,
           '2015-02-07': 1,
           '2015-02-08': 1,
           '2015-02-14': 1,
           '2015-02-15': 1,
           '2015-02-16': 1,
           '2015-02-19': 2,
           '2015-02-22': 3,
           '2015-02-24': 1,
           '2015-02-25': 2,
           '2015-02-26': 1,
           '2015-02-28': 1,
           '2015-03-01': 1,
           '2015-03-02': 2,
           '2015-03-03': 3,
           '2015-03-04': 1,
           '2015-03-05': 2,
           '2015-03-06': 1,
           '2015-03-07': 2,
           '2015-03-08': 4,
           '2015-03-09': 6,
           '2015-03-12': 1,
           '2015-03-13': 2,
           '2015-03-14': 1,
           '2015-03-16': 1,
           '2015-03-17': 3,
           '2015-03-18': 3,
           '2015-03-26': 1,
           '2015-03-29': 1,
           '2015-03-30': 1,
           '2015-04-03': 1,
           '2015-04-10': 2,
           '2015-04-22': 4,
           '2015-04-28': 3,
           '2015-05-01': 2,
           '2015-05-04': 1,
           '2015-05-06': 4,
           '2015-05-15': 2,
           '2015-05-26': 1,
           '2015-05-27': 2,
           '2015-05-30': 2,
           '2015-05-31': 1,
           '2015-06-01': 3,
           '2015-06-03': 2,
           '2015-06-04': 3,
           '2015-06-10': 3,
           '2015-06-14': 1,
           '2015-06-16': 2,
           '2015-06-18': 1,
           '2015-06-19': 6,
           '2015-07-01': 1,
           '2015-07-02': 1,
           '2015-07-03': 1,
           '2015-07-04': 1,
           '2015-07-06': 1,
           '2015-07-07': 1,
           '2015-07-08': 5,
           '2015-07-12': 1,
           '2015-07-13': 1,
           '2015-07-16': 5,
           '2015-07-20': 1,
           '2015-07-22': 1,
           '2015-07-23': 1,
           '2015-07-24': 1,
           '2015-07-27': 1,
           '2015-07-28': 1,
           '2015-07-29': 1,
           '2015-07-30': 1,
           '2015-08-03': 1,
           '2015-08-04': 2,
           '2015-08-05': 5,
           '2015-08-06': 3}}
Attached file ProcessList.txt
I hit this on a staging build, and I was able to log in to the machine and generated the attached process list. It looks like symbolstore.py might be hung, though there is also a suspicious grep.exe running that doesn't appear to have any arguments. I have some .DMP files as well that I created from the task manager, but I'm not sure if they are useful. :ted, any thoughts on why symbolstore.py might be hanging?

My build hung after these messages:

09:53:56     INFO -  cd dist/test-stage && \
09:53:56     INFO -    zip -rq9D 'c:/builds/moz2_slave/m-cen-w32-00000000000000000000/build/src/obj-firefox/dist/firefox-42.0a1.en-US.win32.common.tests.zip' \
09:53:56     INFO -    * -x \*/.mkdir.done \*.pyc cppunittest\* mochitest\* reftest\* xpcshell\* web-platform\* && \
09:53:56     INFO -  rm -f 'dist/firefox-42.0a1.en-US.win32.'cppunittest'.tests.zip' && zip -rq9D 'c:/builds/moz2_slave/m-cen-w32-00000000000000000000/build/src/obj-firefox/dist/firefox-42.0a1.en-US.win32.'cppunittest'.tests.zip' cppunittest -x \*/.mkdir.done \*.pyc ; rm -f 'dist/firefox-42.0a1.en-US.win32.'mochitest'.tests.zip' && zip -rq9D 'c:/builds/moz2_slave/m-cen-w32-00000000000000000000/build/src/obj-firefox/dist/firefox-42.0a1.en-US.win32.'mochitest'.tests.zip' mochitest -x \*/.mkdir.done \*.pyc ; rm -f 'dist/firefox-42.0a1.en-US.win32.'reftest'.tests.zip' && zip -rq9D 'c:/builds/moz2_slave/m-cen-w32-00000000000000000000/build/src/obj-firefox/dist/firefox-42.0a1.en-US.win32.'reftest'.tests.zip' reftest -x \*/.mkdir.done \*.pyc ; rm -f 'dist/firefox-42.0a1.en-US.win32.'xpcshell'.tests.zip' && zip -rq9D 'c:/builds/moz2_slave/m-cen-w32-00000000000000000000/build/src/obj-firefox/dist/firefox-42.0a1.en-US.win32.'xpcshell'.tests.zip' xpcshell -x \*/.mkdir.done \*.pyc ; rm -f 'dist/firefox-42.0a1.en-US.win32.'web-platform'.tests.zip' && zip -rq9D 'c:/builds/moz2_slave/m-cen-w32-00000000000000000000/build/src/obj-firefox/dist/firefox-42.0a1.en-US.win32.'web-platform'.tests.zip' web-platform -x \*/.mkdir.done \*.pyc ;
09:53:56     INFO -  mozmake.EXE[1]: Leaving directory 'c:/builds/moz2_slave/m-cen-w32-00000000000000000000/build/src/obj-firefox'
09:53:56     INFO -  mozmake.EXE: Leaving directory 'c:/builds/moz2_slave/m-cen-w32-00000000000000000000/build/src/obj-firefox'

It's possible there are other timeouts in this bug that are unrelated.
Flags: needinfo?(ted)
I landed a fix for symbolstore.py hangs in bug 1164816, which was after mshal's comment 215 (sorry for the delay).

Looking at a sampling of other logs, the hangs appear to be all over the place.
Flags: needinfo?(ted)
Whether it's a temporary increase while someone figures out why we're so slow or just a permanent increase, we need an increased maxtime: it's turned into a daily or multiple-times-per-day adventure to figure out whether the m-c nightlies died before or after they uploaded, and thus whether they must be retriggered to keep people from yelling or must not be retriggered to keep even more people from yelling.
If this bug is responsible for bug 1223104, as Wes believes, I believe its importance should be escalated. There have been no win64 Nightly updates since November 5th, leaving a large segment of Nightly users (who might not think to manually update via archive.mozilla.org) on an outdated Nightly!
This patch adds an extra 7200s to the Windows max runtime. Hopefully that's enough to work around the problem for now. (And hopefully this is the right place to patch this...)
Attachment #8685514 - Flags: review?(jlund)
Comment on attachment 8685514 [details] [diff] [review]
Increase the max runtime to 27000s

Review of attachment 8685514 [details] [diff] [review]:
-----------------------------------------------------------------

this will bump the windows build max allowable run time which looks like what we want for this bug. As opposed to the max no-output timeout (script_timeout).


I should note that this only will change things for trunk based branches and aurora. beta/release/esr don't use mh to build (yet!). Again, judging by the 'Repository breakdown:', this too is all you care about.
Attachment #8685514 - Flags: review?(jlund) → review+
Blocks: 1224298
No longer blocks: 1224298
Three months since last occurrence. Probably related to moving Windows builders to AWS.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: