Closed Bug 847868 Opened 9 years ago Closed 9 years ago

WinXP and Win7 test slaves backed up on Try

Categories

(Release Engineering :: General, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

Details

See bottom graph on:
http://builddata.pub.build.mozilla.org/reports/pending/pending.html

Even though we're at the quiet time of day there are still a lot of WinXP/Win7 jobs pending.

win7 (433)
  433 try
winxp (298)
  298 try

What's changed recently? WinXP/Win7 jobs are not normally the longest pole for Try end-to-end times, but they are at the moment.

Do some Windows machines need a kick? I can only see ~6-8 windows machines on http://build.mozilla.org/builds/last-job-per-slave.html that look like they may have hung.

Catlee, can you find someone to look at this? :-)
Flags: needinfo?(catlee)
https://build.mozilla.org/builds/pending/running.html looks like we're consistently running quite a few winxp/win7.

Could it be that something has regressed test times recently?

We've also fixed branch prioritization so now try really is prioritized lower than most other branches.
Flags: needinfo?(catlee)
https://secure.pub.build.mozilla.org/buildapi/running is lying about some of the test runs. it claims some jobs have been running for more than a day, but the machines have moved onto other work since then.

we had some networking problems over the weekend, so perhaps machines are wedged due to that
per irc w/avih:

He pushed https://tbpl.mozilla.org/?tree=Try&rev=88889739736c to try:

1) builds for all OS went fine
2) tests for osx and linux went fine
3) tests for win7 and winXP are backlogged. 

Adding info here to help debugging....
Well, I'll note that on https://tbpl.mozilla.org/?tree=Try&rev=95f02e4036d4, the OSX tests (other than 10.8, which was a bit better) took around 10 hours to start.  Windows was worse, but mac was pretty horrible.  And Fedora32 took around 12 hours

17 hours later, my Try (pushed @ 10:30pm) is still waiting for Win7 results:
https://tbpl.mozilla.org/?tree=Try&rev=0d841f84764ahttps://tbpl.mozilla.org/?tree=Try&rev=0d841f84764a

WinXP ran at around 9am (~11.5 hours after push)
(In reply to Randell Jesup [:jesup] from comment #4)
> Well, I'll note that on https://tbpl.mozilla.org/?tree=Try&rev=95f02e4036d4,
> the OSX tests (other than 10.8, which was a bit better) took around 10 hours
> to start.  Windows was worse, but mac was pretty horrible.  And Fedora32
> took around 12 hours
> 
> 17 hours later, my Try (pushed @ 10:30pm) is still waiting for Win7 results:
> https://tbpl.mozilla.org/?tree=Try&rev=0d841f84764ahttps://tbpl.mozilla.org/
> ?tree=Try&rev=0d841f84764a
> 
> WinXP ran at around 9am (~11.5 hours after push)

(after mid-air)....per irc w/jesup:

rjesup pushed https://tbpl.mozilla.org/?tree=Try&rev=0d841f84764a to try last night @ Mon Mar 4 22:03:41 2013 PST: 

1) builds for all OS completed
2) tests for osx and linux completed
3) tests for win7 and winXP are backlogged.
We actually publish the pending jobs, in a table that's sortable by clicking on the "submitted at" column header, at https://secure.pub.build.mozilla.org/buildapi/pending - you don't really need IRC to find out that the current Win7 backlog is 18.5 hours and the current WinXP backlog is 14.5 hours. You'll get it anyway, but you don't *need* it ;)
(In reply to Phil Ringnalda (:philor) from comment #6)
> ... that the current Win7 backlog is 18.5 hours ...

20 hours now. It either isn't dequeued, or dequeues slower than jobs are queued.

Anything changed recently which might be causing this?
(In reply to Avi Halachmi (:avih) from comment #7)
> (In reply to Phil Ringnalda (:philor) from comment #6)
> > ... that the current Win7 backlog is 18.5 hours ...
> 
> 20 hours now. It either isn't dequeued, or dequeues slower than jobs are
> queued.
> 
> Anything changed recently which might be causing this?

Nothing that we've found at this point, but we're definitely looking into it!
We're back down to more usual levels of pending jobs on try now. I'm still not certain of the cause of this spike in wait time.
Severity: critical → normal
We had a chance to catch up with the tree closure yesterday.
It doesn't help too much that our average test runtime is greatest on Windows:
http://brasstacks.mozilla.com/gofaster/#/executiontime/test
(In reply to Ed Morley [:edmorley UTC+0] from comment #11)
> It doesn't help too much that our average test runtime is greatest on
> Windows:
> http://brasstacks.mozilla.com/gofaster/#/executiontime/test

I looked at xpcshell tests in bug 617503. That work only scratched the surface. There's probably a lot more that can be investigated / done to speed them up.
Any actions left for this bug? Wait times are better now.

I'm going to open a new one to add those rev3 minis that we have scavenged.
We've also added a handful of win7 and winxp staging machines as production.
We've cleaned few machines that were in limbo.
We've deployed this morning a _dumbwin32proc.py which will allow the XP slaves to cancel builds rather than run all the way.

From this week (Tue, 12 Mar 2013):
Wait: 43605/82.68% (testpool)
xp: 4413
  0:     3207    72.67%
 15:      711    16.11%
 30:      152     3.44%
 45:       40     0.91%
 60:       79     1.79%
 75:       27     0.61%
 90+:      197     4.46%

From last week (Wed, 06 Mar 2013):
Wait: 42168/74.70% (testpool)
xp: 4417
  0:     2618    59.27%
 15:      677    15.33%
 30:      181     4.10%
 45:       83     1.88%
 60:       43     0.97%
 75:       41     0.93%
 90+:      774    17.52%

win7: 4404
  0:     2423    55.02%
 15:      829    18.82%
 30:      268     6.09%
 45:      152     3.45%
 60:       38     0.86%
 75:       46     1.04%
 90+:      648    14.71%
(In reply to Armen Zambrano G. [:armenzg] from comment #13)
> We've deployed this morning a _dumbwin32proc.py which will allow the XP
> slaves to cancel builds rather than run all the way.

Ah great :-)
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) from comment #13)
> Any actions left for this bug? Wait times are better now.
> 
> I'm going to open a new one to add those rev3 minis that we have scavenged.
> We've also added a handful of win7 and winxp staging machines as production.
> We've cleaned few machines that were in limbo.
> We've deployed this morning a _dumbwin32proc.py which will allow the XP
> slaves to cancel builds rather than run all the way.
> 
> From this week (Tue, 12 Mar 2013):
> Wait: 43605/82.68% (testpool)
> xp: 4413
>   0:     3207    72.67%
>  15:      711    16.11%
>  30:      152     3.44%
>  45:       40     0.91%
>  60:       79     1.79%
>  75:       27     0.61%
>  90+:      197     4.46%
> 
> From last week (Wed, 06 Mar 2013):
> Wait: 42168/74.70% (testpool)
> xp: 4417
>   0:     2618    59.27%
>  15:      677    15.33%
>  30:      181     4.10%
>  45:       83     1.88%
>  60:       43     0.97%
>  75:       41     0.93%
>  90+:      774    17.52%
> 
> win7: 4404
>   0:     2423    55.02%
>  15:      829    18.82%
>  30:      268     6.09%
>  45:      152     3.45%
>  60:       38     0.86%
>  75:       46     1.04%
>  90+:      648    14.71%

edmorley: We're still getting consistently good wait times... anything left to do here or can we close this bug as FIXED?
Flags: needinfo?(emorley)
Looks fine to me now, thank you :-)
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(emorley)
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.