Intermittent command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--cfg'

RESOLVED FIXED in Firefox 58

Status

defect
P5
normal
RESOLVED FIXED
2 years ago
Last year

People

(Reporter: intermittent-bug-filer, Assigned: rwood)

Tracking

({intermittent-failure})

Version 3
mozilla59
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox58 fixed, firefox59 fixed)

Details

(Whiteboard: [stockwell disabled])

Attachments

(1 attachment)

Comment hidden (Intermittent Failures Robot)
Since this bug was opened (in the last 3 days) there are 77 failures.
They occur mostly on Linux x64 and the rest of them on OS X 10.10, Windows 7, Windows10 64. The affected builds type are debug, opt, pgo.
Recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=154685960&lineNumber=8278
and a relevant part of it:

8278
command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Mozilla-Inbound-Non-PGO'], attempting to kill
8279
process killed by signal 9
8280
program finished with exit code -1
8281
elapsedTime=7200.009734
8282
========= master_lag: 0.13 =========
:rwood, can you please have a look at this?
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
The log seems to indicate that the h2 talos spends a lot of time copying profiles. Note the timestamps:

12:54:37     INFO -  Cloning profile located at /home/cltbld/.mozilla/profiles/simple
12:54:37     INFO -  => minidumps
12:54:37     INFO -  => bookmarkbackups
12:54:37     INFO -  => safebrowsing
12:54:37     INFO -  => OfflineCache
12:54:37     INFO -  => sessionstore-backups
12:54:37     INFO -  => crashes
12:54:37     INFO -  => crashes/events
12:54:37     INFO -  => browser-extension-data
12:54:37     INFO -  => browser-extension-data/screenshots@mozilla.org
12:54:37     INFO -  => datareporting
12:54:37     INFO -  => datareporting/archived
12:54:37     INFO -  => datareporting/archived/2017-11
12:54:38     INFO -  => datareporting/archived/2017-10
12:54:38     INFO -  => datareporting/archived/2018-01
12:54:38     INFO -  => datareporting/archived/2017-09
12:54:38     INFO -  => datareporting/archived/2017-12
12:54:38     INFO -  => startupCache
12:54:38     INFO -  => gmp
12:54:38     INFO -  => gmp/Linux_x86_64-gcc3
12:54:38     INFO -  => storage
12:54:38     INFO -  => storage/permanent
12:54:38     INFO -  => storage/permanent/chrome
12:54:38     INFO -  => storage/permanent/chrome/idb
12:54:38     INFO -  => storage/permanent/chrome/idb/3561288849sdhlie.files
12:54:38     INFO -  => storage/permanent/chrome/idb/2918063365piupsah.files
12:54:38     INFO -  => storage/temporary
12:54:38     INFO -  => storage/default
12:54:38     INFO -  => storage/default/https+++ir.ebaystatic.com
12:54:38     INFO -  => storage/default/https+++ir.ebaystatic.com/idb
12:54:38     INFO -  => storage/default/https+++ir.ebaystatic.com/idb/12183338011.files
12:54:38     INFO -  => storage/default/https+++tpc.googlesyndication.com
12:54:38     INFO -  => storage/default/https+++tpc.googlesyndication.com/idb
12:54:38     INFO -  => storage/default/https+++tpc.googlesyndication.com/idb/12183338011.files
12:54:38     INFO -  => storage/default/https+++ad.doubleclick.net
12:54:38     INFO -  => storage/default/https+++ad.doubleclick.net/idb
12:54:38     INFO -  => storage/default/https+++ad.doubleclick.net/idb/12183338011.files
12:54:38     INFO -  => storage/default/https+++www.youtube.com
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/54
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/56
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/35
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/84
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/194
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/22
12:54:38     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/118
12:54:38     INFO -  => storage/default/https+++www.youtube.com/idb
12:54:38     INFO -  => storage/default/https+++www.youtube.com/idb/3211250388sbwdpsunsohintoatciif.files
12:54:38     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com
12:54:38     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com/idb
12:54:38     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com/idb/12183338011.files
12:54:38     INFO -  => gmp-gmpopenh264
12:54:38     INFO -  => gmp-gmpopenh264/1.6
12:54:38     INFO -  => gmp-gmpopenh264/1.7.1
12:54:38     INFO -  => saved-telemetry-pings
12:54:39     INFO -  => cache2
12:54:39     INFO -  => cache2/entries
13:20:03     INFO -  => cache2/doomed
13:20:04     INFO -  => thumbnails


It looks like copying cache2/entries takes 20+ minutes, and this happens 4 times during a single run of h2. All the h2 seem to take close to 2 hours and will randomly go over 2 hours, resulting in this failure. This "simple" profile doesn't seem very simple, and might have gotten polluted somewhere along the way.
yeah, windows is slow by far- possibly we need to determine how to only unpack the profile once.  :rwood, could you look into this to see if we can realistically only unpack/copy the profile once?
Assignee

Comment 5

2 years ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4)
> yeah, windows is slow by far- possibly we need to determine how to only
> unpack the profile once.  :rwood, could you look into this to see if we can
> realistically only unpack/copy the profile once?

Sure I'll have a look, thx
Flags: needinfo?(rwood)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Assignee

Comment 9

2 years ago
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4)
> yeah, windows is slow by far- possibly we need to determine how to only
> unpack the profile once.  :rwood, could you look into this to see if we can
> realistically only unpack/copy the profile once?

I spent some time looking into this. Problems with this approach (copying the profile once):

- we currently blow away the temporary folder used by talos which includes the profile; if we want to reuse the profile we'll need to keep the temporary folder around
- when running tp6 we have to install a mitmproxy CA certificate into the profile and set up a Firefox proxy; right now we are blowing away the profile so we don't have to remove those. However if we want to reuse the same profile we'll need to write code to remove the CA certificate and also remove the proxy
- we set browser prefs for the tppageloader options; if we aren't blowing away the profile then we'll have to make sure we remove/reset any of those browser prefs after the test run
- I'm a bit concerned about running tests using the same profile and the impact that may have on tests further down the chain

IMO I don't think just copying the profile once is the way to go here. Problem is I don't have another solution in mind ATM.
Can we use a slimmer profile then? It seems like most of the time is spent copying cache entries in the profile, if we can reduce the cache size in the canned profile it will likely help quite a bit.
Assignee

Comment 11

2 years ago
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)
> Can we use a slimmer profile then? It seems like most of the time is spent
> copying cache entries in the profile, if we can reduce the cache size in the
> canned profile it will likely help quite a bit.

The profile is heavy/large on purpose to test using a heavy user profile, so that would default the purpose really IMO. We may need to just disable this on Windows as it's fine on the other platforms.

:tarek, what do you think? I think we need to disable this on Windows, or as Kartikaya suggests use a reduced heavy profile (however is there a point to the test then?). Thanks!
Flags: needinfo?(tarek)
This is already disabled on Windows. It's failing on linux64 mostly, now.
Assignee

Comment 13

2 years ago
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #12)
> This is already disabled on Windows. It's failing on linux64 mostly, now.

Ah right yes, I saw the 'h2' symbol on inbound Win 10 - turns out mochitest headless also uses the same 'h2' symbol
Assignee

Comment 14

2 years ago
Talos h1 (heavy 1) is still running on Windows but not h2
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Yeah I think we could skip windows altogether since copying around files is notoriously slow there
Flags: needinfo?(tarek)
we need to skip linux also
Assignee

Comment 19

Last year
Talos h1 has the same issue on Win10 (Bug 1428794). Alright I'm going to disable h1 and h2 on Windows and Linux, at least until a fix can be found...
Assignee

Updated

Last year
Duplicate of this bug: 1428794
Comment hidden (mozreview-request)

Comment 22

Last year
mozreview-review
Comment on attachment 8943313 [details]
Bug 1428790 - Disable talos heavy user profile tests on windows and linux;

https://reviewboard.mozilla.org/r/213624/#review219342

ok!
Attachment #8943313 - Flags: review?(jmaher) → review+
Comment on attachment 8943313 [details]
Bug 1428790 - Disable talos heavy user profile tests on windows and linux;

https://reviewboard.mozilla.org/r/213624/#review219348

::: taskcluster/ci/test/talos.yml:216
(Diff revision 1)
>      try-name: h1
>      treeherder-symbol: tc-T(h1)
>      max-run-time: 5400
>      run-on-projects:
>          by-test-platform:
>              .*-qr/.*: ['mozilla-central', 'try']

You'll need to drop this -qr line as well. Same in h2. Otherwise platforms like linux64-qr will match both the .*-qr/.* line and linux.* and the python code will complain about it
Assignee

Comment 24

Last year
mozreview-review-reply
Comment on attachment 8943313 [details]
Bug 1428790 - Disable talos heavy user profile tests on windows and linux;

https://reviewboard.mozilla.org/r/213624/#review219348

Just discovered that on my try push, thanks!
Comment hidden (mozreview-request)

Comment 26

Last year
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/89cb08844b17
Disable talos heavy user profile tests on windows and linux; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/89cb08844b17
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Assignee: nobody → rwood
Comment hidden (Intermittent Failures Robot)

Comment 29

Last year
bugherderuplift
https://hg.mozilla.org/releases/mozilla-release/rev/a3c6c687eba1
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Assignee

Updated

Last year
Duplicate of this bug: 1429310
Comment hidden (Intermittent Failures Robot)
Blocks: 1432430
Comment hidden (Intermittent Failures Robot)
You need to log in before you can comment on or make changes to this bug.