If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Investigate why Windows xpcshell tests take so long to run on the tinderboxen

NEW
Unassigned

Status

Testing
XPCShell Harness
5 years ago
3 years ago

People

(Reporter: Away for a while, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildfaster:p2][testing])

Attachments

(1 attachment)

Comment hidden (empty)
The backstory is in bug 617503 - dropping ipv6 took Win7 from 70 minutes to 50 (at that time, it's back pretty close to 70 now), but running locally on worse hardware was taking 7 minutes instead of 50. The same thing probably affects reftests (and crashtests, though crastests are fast enough that it's only 5 minutes vs 15 minutes, rather than the noticeable 20 minutes vs 45 minutes for reftest and 30 minutes versus 70 minutes for xpcshell between Linux32 and Win7).
IMO the first step is to figure out which tests are taking so long. My first step when tackling bug 617503 was to create a spreadsheet of the test times. A badly written geolocation test which took a huge amount of time (and didn't even work) immediately stood out. It's quite possible something like that has snuck in again.
(Reporter)

Comment 3

5 years ago
(In reply to comment #2)
> IMO the first step is to figure out which tests are taking so long. My first
> step when tackling bug 617503 was to create a spreadsheet of the test times. A
> badly written geolocation test which took a huge amount of time (and didn't
> even work) immediately stood out. It's quite possible something like that has
> snuck in again.

Yeah, I can definitely believe that!
Created attachment 657325 [details]
Script for parsing xpcshell log and getting test times

Here's the script I used last time for parsing the xpcshell log and generating a csv of test times. I'd be surprised if it didn't still work.
If you consider 7 minutes to be a normal time for xpcshell tests, then we're slow on all platforms, not only windows.
Taking http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1346421170/mozilla-inbound_win7_test-xpcshell-bm23-tests1-windows-build1639.txt.gz, there are 1594 tests.
704 take more than 1s.
460 more than 2s
336 more than 3s
220 more than 5s
96 more than 10s
6 more than 30s

These 6 are:
toolkit/components/passwordmgr/test/unit/test_storage_mozStorage_3.js,30121.000m
unit/test_partial.js,30850.000m
unit/test_errorhandler.js,37265.000m
toolkit/mozapps/extensions/test/xpcshell-unpack/test_blocklistchange.js,48786.000m
toolkit/mozapps/extensions/test/xpcshell/test_blocklistchange.js,49542.000m
dom/indexedDB/test/unit/test_overlapping_transactions.js,59310.000m
The log I just looked at took 69 minutes over all, 65 in the test running buildstep, but there's a little copying of files in that step, so rather than 7 minutes I'd be perfectly happy with 11 or even 12. (12 is, probably not coincidentally, how long the 10.8 run I just looked at took.)
cumulative duration of all tests that run in less than:
1s -> 3:35
2s -> 9:35
3s -> 14:42
5s -> 21:57
10s -> 36:24
30s -> 60:43

Looks like it's a global problem
(In reply to Mike Hommey [:glandium] from comment #8)
> cumulative duration of all tests that run in less than:
> 1s -> 3:35
> 2s -> 9:35
> 3s -> 14:42
> 5s -> 21:57
> 10s -> 36:24
> 30s -> 60:43
> 
> Looks like it's a global problem

Last year I recall finding some issues with diskio and some longer running tests. See: https://bugzilla.mozilla.org/show_bug.cgi?id=675363

Unfortunately we didn't find any easy solutions to that problem. :(
(In reply to Mike Hommey [:glandium] from comment #8)
> cumulative duration of all tests that run in less than:
> 1s -> 3:35
> 2s -> 9:35
> 3s -> 14:42
> 5s -> 21:57
> 10s -> 36:24
> 30s -> 60:43
> 
> Looks like it's a global problem

FWIW, mochitests show this same sort of distribution too.  A couple of tests are responsible for a disproportionate amount of the testing time.
On Linux:
(http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1346429177/mozilla-inbound_fedora_test-xpcshell-bm24-tests1-linux-build1979.txt.gz)
1535 tests take 33:7
454 tests taking > 1s cumulate to 28:14
239 tests taking > 2s cumulate to 23:15
173 tests taking > 3s cumulate to 20:35
93 tests taking > 5s cumulate to 15:25
28 tests taking > 10s cumulate to 8:06
3 tests taking > 30s cummulate to 1:38
On OSX:
(http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64/1346425731/mozilla-inbound_snowleopard_test-xpcshell-bm11-tests1-macosx-build1889.txt.gz)
1492 tests take 13:06
126 tests taking > 1s cummulate to 6:50
66 tests taking > 2s cumulate to 5:26
43 tests taking > 3s cumulate to 4:31
31 tests taking > 5s cumulate to 3:44
4 tests taking > 10s cumulate to 0:56
Do these times match times when the tests are run locally?
What are the next steps here? 

Seems releng hasn't been very involved to this point. Should this bug live under Testing:XPCShell Harness instead?
Component: Release Engineering → Release Engineering: Automation (General)
OS: Mac OS X → Windows 7
QA Contact: catlee
Whiteboard: [buildfaster:p2][testing]
I don't know if the discussion is on this bug, but we're pretty sure the root cause is just slow I/O.
(In reply to Chris AtLee [:catlee] from comment #13)
> Do these times match times when the tests are run locally?
Question still unanswered, afaict?

(In reply to Chris Cooper [:coop] from comment #14)
> What are the next steps here? 
> 
> Seems releng hasn't been very involved to this point. Should this bug live
> under Testing:XPCShell Harness instead?
Its been ~2months since last comment and ~4months since this bug was filed. Based on comment2, comment3, I'm kicking this over to Testing:XPCShell Harness for investigation by someone who understands the test suite and can debug if anything inappropriate was added to the suite, causing the regression.

As usual, if it would help for RelEng to grant you access to a loaner machine from production, so you can do side-by-side debugging with local runs, please file a dependent bug in mozilla.org:Release Engineering.


(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> I don't know if the discussion is on this bug, but we're pretty sure the
> root cause is just slow I/O.

I dont see discussion about slow hardware i/o in reading earlier comments, so if you could add that info to this bug, it would be helpful.
Component: Release Engineering: Automation (General) → XPCShell Harness
Product: mozilla.org → Testing
QA Contact: catlee
Version: other → unspecified
Sorry, I made that comment and failed to actually look up where that discussion had taken place. Most of it is in bug 675363, which was spun off from bug 617503.
You need to log in before you can comment on or make changes to this bug.