Closed Bug 1428790 Opened 3 years ago Closed 3 years ago

Intermittent command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--cfg'

Categories

(Testing :: Talos, defect, P5)

Version 3
defect

Tracking

(firefox58 fixed, firefox59 fixed)

RESOLVED FIXED
mozilla59
Tracking Status
firefox58 --- fixed
firefox59 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: rwood)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(1 file)

Since this bug was opened (in the last 3 days) there are 77 failures.
They occur mostly on Linux x64 and the rest of them on OS X 10.10, Windows 7, Windows10 64. The affected builds type are debug, opt, pgo.
Recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=154685960&lineNumber=8278
and a relevant part of it:

8278
command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Mozilla-Inbound-Non-PGO'], attempting to kill
8279
process killed by signal 9
8280
program finished with exit code -1
8281
elapsedTime=7200.009734
8282
========= master_lag: 0.13 =========
:rwood, can you please have a look at this?
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
The log seems to indicate that the h2 talos spends a lot of time copying profiles. Note the timestamps:

12:54:37     INFO -  Cloning profile located at /home/cltbld/.mozilla/profiles/simple
12:54:37     INFO -  => minidumps
12:54:37     INFO -  => bookmarkbackups
12:54:37     INFO -  => safebrowsing
12:54:37     INFO -  => OfflineCache
12:54:37     INFO -  => sessionstore-backups
12:54:37     INFO -  => crashes
12:54:37     INFO -  => crashes/events
12:54:37     INFO -  => browser-extension-data
12:54:37     INFO -  => browser-extension-data/screenshots@mozilla.org
12:54:37     INFO -  => datareporting
12:54:37     INFO -  => datareporting/archived
12:54:37     INFO -  => datareporting/archived/2017-11
12:54:38     INFO -  => datareporting/archived/2017-10
12:54:38     INFO -  => datareporting/archived/2018-01
12:54:38     INFO -  => datareporting/archived/2017-09
12:54:38     INFO -  => datareporting/archived/2017-12
12:54:38     INFO -  => startupCache
12:54:38     INFO -  => gmp
12:54:38     INFO -  => gmp/Linux_x86_64-gcc3
12:54:38     INFO -  => storage
12:54:38     INFO -  => storage/permanent
12:54:38     INFO -  => storage/permanent/chrome
12:54:38     INFO -  => storage/permanent/chrome/idb
12:54:38     INFO -  => storage/permanent/chrome/idb/3561288849sdhlie.files
12:54:38     INFO -  => storage/permanent/chrome/idb/2918063365piupsah.files
12:54:38     INFO -  => storage/temporary
12:54:38     INFO -  => storage/default
12:54:38     INFO -  => storage/default/https+++ir.ebaystatic.com
12:54:38     INFO -  => storage/default/https+++ir.ebaystatic.com/idb
12:54:38     INFO -  => storage/default/https+++ir.ebaystatic.com/idb/12183338011.files
12:54:38     INFO -  => storage/default/https+++tpc.googlesyndication.com
12:54:38     INFO -  => storage/default/https+++tpc.googlesyndication.com/idb
12:54:38     INFO -  => storage/default/https+++tpc.googlesyndication.com/idb/12183338011.files
12:54:38     INFO -  => storage/default/https+++ad.doubleclick.net
12:54:38     INFO -  => storage/default/https+++ad.doubleclick.net/idb
12:54:38     INFO -  => storage/default/https+++ad.doubleclick.net/idb/12183338011.files
12:54:38     INFO -  => storage/default/https+++www.youtube.com
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/54
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/56
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/35
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/84
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/194
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/22
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/118
12:54:38     INFO -  => storage/default/https+++www.youtube.com/idb
12:54:38     INFO -  => storage/default/https+++www.youtube.com/idb/3211250388sbwdpsunsohintoatciif.files
12:54:38     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com
12:54:38     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com/idb
12:54:38     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com/idb/12183338011.files
12:54:38     INFO -  => gmp-gmpopenh264
12:54:38     INFO -  => gmp-gmpopenh264/1.6
12:54:38     INFO -  => gmp-gmpopenh264/1.7.1
12:54:38     INFO -  => saved-telemetry-pings
12:54:39     INFO -  => cache2
12:54:39     INFO -  => cache2/entries
13:20:03     INFO -  => cache2/doomed
13:20:04     INFO -  => thumbnails


It looks like copying cache2/entries takes 20+ minutes, and this happens 4 times during a single run of h2. All the h2 seem to take close to 2 hours and will randomly go over 2 hours, resulting in this failure. This "simple" profile doesn't seem very simple, and might have gotten polluted somewhere along the way.
yeah, windows is slow by far- possibly we need to determine how to only unpack the profile once.  :rwood, could you look into this to see if we can realistically only unpack/copy the profile once?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4)
> yeah, windows is slow by far- possibly we need to determine how to only
> unpack the profile once.  :rwood, could you look into this to see if we can
> realistically only unpack/copy the profile once?

Sure I'll have a look, thx
Flags: needinfo?(rwood)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4)
> yeah, windows is slow by far- possibly we need to determine how to only
> unpack the profile once.  :rwood, could you look into this to see if we can
> realistically only unpack/copy the profile once?

I spent some time looking into this. Problems with this approach (copying the profile once):

- we currently blow away the temporary folder used by talos which includes the profile; if we want to reuse the profile we'll need to keep the temporary folder around
- when running tp6 we have to install a mitmproxy CA certificate into the profile and set up a Firefox proxy; right now we are blowing away the profile so we don't have to remove those. However if we want to reuse the same profile we'll need to write code to remove the CA certificate and also remove the proxy
- we set browser prefs for the tppageloader options; if we aren't blowing away the profile then we'll have to make sure we remove/reset any of those browser prefs after the test run
- I'm a bit concerned about running tests using the same profile and the impact that may have on tests further down the chain

IMO I don't think just copying the profile once is the way to go here. Problem is I don't have another solution in mind ATM.
Can we use a slimmer profile then? It seems like most of the time is spent copying cache entries in the profile, if we can reduce the cache size in the canned profile it will likely help quite a bit.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)
> Can we use a slimmer profile then? It seems like most of the time is spent
> copying cache entries in the profile, if we can reduce the cache size in the
> canned profile it will likely help quite a bit.

The profile is heavy/large on purpose to test using a heavy user profile, so that would default the purpose really IMO. We may need to just disable this on Windows as it's fine on the other platforms.

:tarek, what do you think? I think we need to disable this on Windows, or as Kartikaya suggests use a reduced heavy profile (however is there a point to the test then?). Thanks!
Flags: needinfo?(tarek)
This is already disabled on Windows. It's failing on linux64 mostly, now.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #12)
> This is already disabled on Windows. It's failing on linux64 mostly, now.

Ah right yes, I saw the 'h2' symbol on inbound Win 10 - turns out mochitest headless also uses the same 'h2' symbol
Talos h1 (heavy 1) is still running on Windows but not h2
Yeah I think we could skip windows altogether since copying around files is notoriously slow there
Flags: needinfo?(tarek)
we need to skip linux also
Talos h1 has the same issue on Win10 (Bug 1428794). Alright I'm going to disable h1 and h2 on Windows and Linux, at least until a fix can be found...
Duplicate of this bug: 1428794
Comment on attachment 8943313 [details]
Bug 1428790 - Disable talos heavy user profile tests on windows and linux;

https://reviewboard.mozilla.org/r/213624/#review219342

ok!
Attachment #8943313 - Flags: review?(jmaher) → review+
Comment on attachment 8943313 [details]
Bug 1428790 - Disable talos heavy user profile tests on windows and linux;

https://reviewboard.mozilla.org/r/213624/#review219348

::: taskcluster/ci/test/talos.yml:216
(Diff revision 1)
>      try-name: h1
>      treeherder-symbol: tc-T(h1)
>      max-run-time: 5400
>      run-on-projects:
>          by-test-platform:
>              .*-qr/.*: ['mozilla-central', 'try']

You'll need to drop this -qr line as well. Same in h2. Otherwise platforms like linux64-qr will match both the .*-qr/.* line and linux.* and the python code will complain about it
Comment on attachment 8943313 [details]
Bug 1428790 - Disable talos heavy user profile tests on windows and linux;

https://reviewboard.mozilla.org/r/213624/#review219348

Just discovered that on my try push, thanks!
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/89cb08844b17
Disable talos heavy user profile tests on windows and linux; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/89cb08844b17
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Assignee: nobody → rwood
https://hg.mozilla.org/releases/mozilla-release/rev/a3c6c687eba1
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Duplicate of this bug: 1429310
Blocks: 1432430
You need to log in before you can comment on or make changes to this bug.