Closed Bug 1428790 Opened 7 years ago Closed 7 years ago

Intermittent command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--cfg'

Categories

(Testing :: Talos, defect, P5)

Version 3
defect

Tracking

(firefox58 fixed, firefox59 fixed)

RESOLVED FIXED
mozilla59
Tracking Status
firefox58 --- fixed
firefox59 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: rwood)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(1 file)

Since this bug was opened (in the last 3 days) there are 77 failures. They occur mostly on Linux x64 and the rest of them on OS X 10.10, Windows 7, Windows10 64. The affected builds type are debug, opt, pgo. Recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=154685960&lineNumber=8278 and a relevant part of it: 8278 command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Mozilla-Inbound-Non-PGO'], attempting to kill 8279 process killed by signal 9 8280 program finished with exit code -1 8281 elapsedTime=7200.009734 8282 ========= master_lag: 0.13 ========= :rwood, can you please have a look at this?
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
The log seems to indicate that the h2 talos spends a lot of time copying profiles. Note the timestamps: 12:54:37 INFO - Cloning profile located at /home/cltbld/.mozilla/profiles/simple 12:54:37 INFO - => minidumps 12:54:37 INFO - => bookmarkbackups 12:54:37 INFO - => safebrowsing 12:54:37 INFO - => OfflineCache 12:54:37 INFO - => sessionstore-backups 12:54:37 INFO - => crashes 12:54:37 INFO - => crashes/events 12:54:37 INFO - => browser-extension-data 12:54:37 INFO - => browser-extension-data/screenshots@mozilla.org 12:54:37 INFO - => datareporting 12:54:37 INFO - => datareporting/archived 12:54:37 INFO - => datareporting/archived/2017-11 12:54:38 INFO - => datareporting/archived/2017-10 12:54:38 INFO - => datareporting/archived/2018-01 12:54:38 INFO - => datareporting/archived/2017-09 12:54:38 INFO - => datareporting/archived/2017-12 12:54:38 INFO - => startupCache 12:54:38 INFO - => gmp 12:54:38 INFO - => gmp/Linux_x86_64-gcc3 12:54:38 INFO - => storage 12:54:38 INFO - => storage/permanent 12:54:38 INFO - => storage/permanent/chrome 12:54:38 INFO - => storage/permanent/chrome/idb 12:54:38 INFO - => storage/permanent/chrome/idb/3561288849sdhlie.files 12:54:38 INFO - => storage/permanent/chrome/idb/2918063365piupsah.files 12:54:38 INFO - => storage/temporary 12:54:38 INFO - => storage/default 12:54:38 INFO - => storage/default/https+++ir.ebaystatic.com 12:54:38 INFO - => storage/default/https+++ir.ebaystatic.com/idb 12:54:38 INFO - => storage/default/https+++ir.ebaystatic.com/idb/12183338011.files 12:54:38 INFO - => storage/default/https+++tpc.googlesyndication.com 12:54:38 INFO - => storage/default/https+++tpc.googlesyndication.com/idb 12:54:38 INFO - => storage/default/https+++tpc.googlesyndication.com/idb/12183338011.files 12:54:38 INFO - => storage/default/https+++ad.doubleclick.net 12:54:38 INFO - => storage/default/https+++ad.doubleclick.net/idb 12:54:38 INFO - => storage/default/https+++ad.doubleclick.net/idb/12183338011.files 12:54:38 INFO - => storage/default/https+++www.youtube.com 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache/morgue 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache/morgue/54 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache/morgue/56 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache/morgue/35 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache/morgue/84 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache/morgue/194 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache/morgue/22 12:54:38 INFO - => storage/default/https+++www.youtube.com/cache/morgue/118 12:54:38 INFO - => storage/default/https+++www.youtube.com/idb 12:54:38 INFO - => storage/default/https+++www.youtube.com/idb/3211250388sbwdpsunsohintoatciif.files 12:54:38 INFO - => storage/default/https+++images-na.ssl-images-amazon.com 12:54:38 INFO - => storage/default/https+++images-na.ssl-images-amazon.com/idb 12:54:38 INFO - => storage/default/https+++images-na.ssl-images-amazon.com/idb/12183338011.files 12:54:38 INFO - => gmp-gmpopenh264 12:54:38 INFO - => gmp-gmpopenh264/1.6 12:54:38 INFO - => gmp-gmpopenh264/1.7.1 12:54:38 INFO - => saved-telemetry-pings 12:54:39 INFO - => cache2 12:54:39 INFO - => cache2/entries 13:20:03 INFO - => cache2/doomed 13:20:04 INFO - => thumbnails It looks like copying cache2/entries takes 20+ minutes, and this happens 4 times during a single run of h2. All the h2 seem to take close to 2 hours and will randomly go over 2 hours, resulting in this failure. This "simple" profile doesn't seem very simple, and might have gotten polluted somewhere along the way.
yeah, windows is slow by far- possibly we need to determine how to only unpack the profile once. :rwood, could you look into this to see if we can realistically only unpack/copy the profile once?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4) > yeah, windows is slow by far- possibly we need to determine how to only > unpack the profile once. :rwood, could you look into this to see if we can > realistically only unpack/copy the profile once? Sure I'll have a look, thx
Flags: needinfo?(rwood)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4) > yeah, windows is slow by far- possibly we need to determine how to only > unpack the profile once. :rwood, could you look into this to see if we can > realistically only unpack/copy the profile once? I spent some time looking into this. Problems with this approach (copying the profile once): - we currently blow away the temporary folder used by talos which includes the profile; if we want to reuse the profile we'll need to keep the temporary folder around - when running tp6 we have to install a mitmproxy CA certificate into the profile and set up a Firefox proxy; right now we are blowing away the profile so we don't have to remove those. However if we want to reuse the same profile we'll need to write code to remove the CA certificate and also remove the proxy - we set browser prefs for the tppageloader options; if we aren't blowing away the profile then we'll have to make sure we remove/reset any of those browser prefs after the test run - I'm a bit concerned about running tests using the same profile and the impact that may have on tests further down the chain IMO I don't think just copying the profile once is the way to go here. Problem is I don't have another solution in mind ATM.
Can we use a slimmer profile then? It seems like most of the time is spent copying cache entries in the profile, if we can reduce the cache size in the canned profile it will likely help quite a bit.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10) > Can we use a slimmer profile then? It seems like most of the time is spent > copying cache entries in the profile, if we can reduce the cache size in the > canned profile it will likely help quite a bit. The profile is heavy/large on purpose to test using a heavy user profile, so that would default the purpose really IMO. We may need to just disable this on Windows as it's fine on the other platforms. :tarek, what do you think? I think we need to disable this on Windows, or as Kartikaya suggests use a reduced heavy profile (however is there a point to the test then?). Thanks!
Flags: needinfo?(tarek)
This is already disabled on Windows. It's failing on linux64 mostly, now.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #12) > This is already disabled on Windows. It's failing on linux64 mostly, now. Ah right yes, I saw the 'h2' symbol on inbound Win 10 - turns out mochitest headless also uses the same 'h2' symbol
Talos h1 (heavy 1) is still running on Windows but not h2
Yeah I think we could skip windows altogether since copying around files is notoriously slow there
Flags: needinfo?(tarek)
we need to skip linux also
Talos h1 has the same issue on Win10 (Bug 1428794). Alright I'm going to disable h1 and h2 on Windows and Linux, at least until a fix can be found...
Comment on attachment 8943313 [details] Bug 1428790 - Disable talos heavy user profile tests on windows and linux; https://reviewboard.mozilla.org/r/213624/#review219342 ok!
Attachment #8943313 - Flags: review?(jmaher) → review+
Comment on attachment 8943313 [details] Bug 1428790 - Disable talos heavy user profile tests on windows and linux; https://reviewboard.mozilla.org/r/213624/#review219348 ::: taskcluster/ci/test/talos.yml:216 (Diff revision 1) > try-name: h1 > treeherder-symbol: tc-T(h1) > max-run-time: 5400 > run-on-projects: > by-test-platform: > .*-qr/.*: ['mozilla-central', 'try'] You'll need to drop this -qr line as well. Same in h2. Otherwise platforms like linux64-qr will match both the .*-qr/.* line and linux.* and the python code will complain about it
Comment on attachment 8943313 [details] Bug 1428790 - Disable talos heavy user profile tests on windows and linux; https://reviewboard.mozilla.org/r/213624/#review219348 Just discovered that on my try push, thanks!
Pushed by rwood@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/89cb08844b17 Disable talos heavy user profile tests on windows and linux; r=jmaher
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Assignee: nobody → rwood
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Blocks: 1432430
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: