Closed Bug 1415858 Opened 7 years ago Closed 7 years ago

Intermittent command timed out: 7200 seconds elapsed, attempting to kill in tp6_facebook_heavy,tp6_amazon_heavy after start

Categories

(Testing :: Talos, defect, P5)

Version 3
defect

Tracking

(firefox58 fixed, firefox59 fixed)

RESOLVED FIXED
mozilla59
Tracking Status
firefox58 --- fixed
firefox59 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jmaher)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(3 files)

Filed by: archaeopteryx [at] coole-files.de https://treeherder.mozilla.org/logviewer.html#?job_id=143198621&repo=autoland https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win32/autoland_win7_ix_test-h2-e10s-bm111-tests1-windows-build397.txt.gz 19:30:14 INFO - TEST-INFO | 3492: exit 0 19:31:48 INFO - TEST-OK | tp6_amazon_heavy | took 1909961ms 19:31:48 INFO - TEST-START | tp6_facebook_heavy 19:31:48 INFO - Initialising browser for tp6_facebook_heavy test... 19:32:22 INFO - Local copy of 'simple' is fresh enough 19:32:22 INFO - 0 days old command timed out: 7200 seconds elapsed, attempting to kill program finished with exit code 1 elapsedTime=7208.629000 ========= master_lag: 1.10 ========= ========= Finished 'c:/mozilla-build/python27/python -u ...' warnings (results: 1, elapsed: 2 hrs, 9 secs) (at 2017-11-08 19:58:55.971795) =========
There are 37 failures in the last 7 days. The occur on Windows 7 and the affected build types are opt and pgo. A recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=144088285&lineNumber=1687 :rwood, Can you please take a look?
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
adding :tarek to this bug as he helped develop the heavy profile and integration to talos- he might have some ideas.
I think the tp6 heavy tests are just taking alot longer to run and we're running into the default talos ttest timeout of 7200 seconds: https://searchfox.org/mozilla-central/rev/fe75164c55321e011d9d13b6d05c1e00d0a640be/testing/talos/talos/ttest.py#168
Flags: needinfo?(rwood)
TEST-OK | tp6_amazon_heavy | took 1990367ms that is really long! we should split this up then, or figure out why with the heavy profile this takes so long- maybe we can analyze this in more detail over time. Kyle, is there a query we can do on the tp6_*_heavy test names to get a list of runtimes?
Flags: needinfo?(klahnakoski)
Oh but hang out, this is strange, 30 minutes elapsed time in between checking how old the profile is, and launching Firefox... yeah something wrong here: 04:28:55 INFO - TEST-START | tp6_google_heavy 04:28:55 INFO - Initialising browser for tp6_google_heavy test... 04:32:09 INFO - Local copy of 'simple' is fresh enough 04:32:09 INFO - 3 days old 05:02:09 INFO - Application command: C:\slave\test\build\application\firefox\firefox http://localhost:49901/getInfo.html -profile c:\users\cltbld~1.000\appdata\local\temp\tmpjby9kz\profile 05:02:10 INFO - TEST-INFO | started process 4044 (C:\slave\test\build\application\firefox\firefox http://localhost:49901/getInfo.html) 05:02:55 INFO - TEST-INFO | 4044: exit 0 05:02:55 INFO - Browser initialized.
odd that it is fresh enough but 3 days old- maybe there is a bug in that logic. 30 minutes is insane- I know it took me about 15 minutes to download and extract a profile locally, I blamed it on my "high" speed internet connection I have at home.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #6) > odd that it is fresh enough but 3 days old- maybe there is a bug in that > logic. 30 minutes is insane- I know it took me about 15 minutes to download > and extract a profile locally, I blamed it on my "high" speed internet > connection I have at home. It only updates if the profile is > 7 days old, so it isn't even downloading at all... don't know what's happening in that mysterious 30 minutes... https://searchfox.org/mozilla-central/rev/fe75164c55321e011d9d13b6d05c1e00d0a640be/testing/talos/talos/heavy.py#106
:tarek any ideas?
Flags: needinfo?(tarek)
Once the profile is downloaded, it's extracted and then for each run it's cloned in a temp location. The cloning happend every time since we want to run the test with a vanilla heavy profile, but 30mn sounds insane. I will take a deeper look when I start working tomorrow.
In any case I think we should add more LOG.info() calls in ffsetup after https://searchfox.org/mozilla-central/source/testing/talos/talos/ffsetup.py#101 So we get some feedback about the cloning step, etc
Assignee: nobody → tarek
Flags: needinfo?(tarek)
Comment on attachment 8928464 [details] Bug 1415858 - Adding logs for Talos cloning step https://reviewboard.mozilla.org/r/199718/#review204840 ::: testing/mozbase/mozprofile/tests/test_clone_cleanup.py:41 (Diff revision 1) > def test_restore_true(self): > + counter = [0] > + > + def _feedback(dir, content): > + counter[0] += 1 > + return [] I don't understand this method, can you describe it with a comment? ::: testing/talos/talos/ffsetup.py:109 (Diff revision 1) > + > + def _feedback(directory, content): > + sub = directory.split(profile_path)[-1].lstrip("/") > + if sub: > + LOG.info("=> %s" % sub) > + return [] sam here, I see _feedback and it is related to ignore, but always returns []
Attachment #8928464 - Flags: review?(jmaher) → review-
I could not reproduce the issue on my laptop. Once the current patch has landed we should have more info on where the tiome is spent. Maybe there's a disk issue on the slaves when the profile is copied in the tmp space
(In reply to Tarek Ziadé (:tarek) from comment #15) > I could not reproduce the issue on my laptop. Once the current patch has > landed we should have more info on where the tiome is spent. Maybe there's a > disk issue on the slaves when the profile is copied in the tmp space Thanks Tarek!
Comment on attachment 8928464 [details] Bug 1415858 - Adding logs for Talos cloning step https://reviewboard.mozilla.org/r/199718/#review204892 thanks, lets see if this helps surface the area where we are hanging.
Attachment #8928464 - Flags: review?(jmaher) → review+
Pushed by tziade@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ecf4310e41f6 Adding logs for Talos cloning step r=jmaher
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Alright so we have more info, see https://treeherder.mozilla.org/logviewer.html#?job_id=145055559&repo=autoland&lineNumber=2638 Ity takes 30 minutes to copy the cache2/entries subfolder, which is the heaviest folder in the profile $ du -ch ~/.mozilla/profiles/simple/cache2/entries/ 635M /Users/tarek/.mozilla/profiles/simple/cache2/entries/ 635M total $ ls ~/.mozilla/profiles/simple/cache2/entries/ | wc -l 38908 It works but it's slow. There's nothing we can do at the code level, Python relies on the system to copy the files and we can't speed it up from the code. I guess the next step is to investigate on the hardware/vm setup e.g. is the tmp folder on the same disk? can we use a RAM disk ? etc. Joel, do you know who we should ping for help?
Flags: needinfo?(jmaher)
maybe have cache2/entries compressed in a sub-tarball when it's copied and then extracted in the target foilder can speed up things, to avoid 40k operations. I will investigate.
According to my tests on darwin, directly extracting a the tarball reduces the time by 1.5 to 2 when working on the same disk. But it's very fast compared to the windows slave (from 10 seconds to 20 seconds) Can someone try my script on windows ? (adding it here) The change would simply mean keeping the downloaded tarball in ~/.mozilla/profile and extend .clone() so it can deal with tarballs. That would reduce the time to 15mn, which still sounds like amazingly slow.
Attached file comparing.py
Python script to compare the speed of extracting on the fly instead of file-to-file copy. It makes the assumption that you have a tarball of the cache2/entries directory, from the latest profile tarball
on first thought, possibly we don't need this on windows7, the reported numbers for win7 vs 10 follow the same pattern over the last 30 days. Both win7 and win10 run on the same physical hardware (IX machines): https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation there could be configuration differences. Typically win10 takes 30-40 minutes to complete, and win7 takes 100+ minutes to complete- there is obviously a problem with fileIO on win7 (and we see this with unittests as well). I don't think there is anything to do, these are a single disk instance. My theory is we should just not run heavy tests on win7, I agree that unpacking a profile is faster than copying, but at some point, we are spending 1+ hour/push managing this for little return- when we get faster data on all our other platforms. :rwood, what are your thoughts?
Flags: needinfo?(rwood)
Flags: needinfo?(klahnakoski)
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #25) > on first thought, possibly we don't need this on windows7, the reported > numbers for win7 vs 10 follow the same pattern over the last 30 days. Both > win7 and win10 run on the same physical hardware (IX machines): > https://wiki.mozilla.org/Buildbot/Talos/ > Misc#Hardware_Profile_of_machines_used_in_automation > > there could be configuration differences. Typically win10 takes 30-40 > minutes to complete, and win7 takes 100+ minutes to complete- there is > obviously a problem with fileIO on win7 (and we see this with unittests as > well). I don't think there is anything to do, these are a single disk > instance. > > My theory is we should just not run heavy tests on win7, I agree that > unpacking a profile is faster than copying, but at some point, we are > spending 1+ hour/push managing this for little return- when we get faster > data on all our other platforms. > > :rwood, what are your thoughts? Wow, yeah in that case I agree, it's not really worth over 2x the resource time to run the heavy tests on Win 7.
Flags: needinfo?(rwood)
Maybe we could turn off the heavy tests on Win 7 except leave them available on Win 7 for try?
Attachment #8929147 - Flags: review?(rwood) → review+
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c9bfe1a78bbb disable h2 on win7 except for try due to slow fileIO. r=rwood
Blocks: 1207900
Assignee: tarek → nobody
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Assignee: nobody → jmaher
Summary: Intermittent command timed out: 7200 seconds elapsed, attempting to kill in tp6_facebook_heavy after start → Intermittent command timed out: 7200 seconds elapsed, attempting to kill in tp6_facebook_heavy,tp6_amazon_heavy after start
Whiteboard: [stockwell needswork] → [stockwell disabled]
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=154354580&repo=autoland&lineNumber=11117 05:39:08 INFO - TEST-START | tp6_facebook_heavy 05:39:08 INFO - Initialising browser for tp6_facebook_heavy test... 05:39:08 INFO - /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings 05:39:08 INFO - InsecurePlatformWarning 05:39:09 INFO - /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings 05:39:09 INFO - InsecurePlatformWarning 05:39:09 INFO - /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings 05:39:09 INFO - InsecurePlatformWarning 05:39:15 INFO - Local copy of 'simple' is fresh enough 05:39:15 INFO - 1 days old 05:39:15 INFO - Cloning profile located at /home/cltbld/.mozilla/profiles/simple 05:39:15 INFO - => safebrowsing 05:39:15 INFO - => OfflineCache 05:39:15 INFO - => minidumps 05:39:15 INFO - => gmp-gmpopenh264 05:39:15 INFO - => gmp-gmpopenh264/1.7.1 05:39:15 INFO - => gmp-gmpopenh264/1.6 05:39:15 INFO - => startupCache 05:39:15 INFO - => thumbnails 05:39:15 INFO - => bookmarkbackups 05:39:15 INFO - => sessionstore-backups 05:39:15 INFO - => storage 05:39:15 INFO - => storage/default 05:39:15 INFO - => storage/default/https+++images-na.ssl-images-amazon.com 05:39:15 INFO - => storage/default/https+++images-na.ssl-images-amazon.com/idb 05:39:15 INFO - => storage/default/https+++images-na.ssl-images-amazon.com/idb/12183338011.files 05:39:15 INFO - => storage/default/https+++tpc.googlesyndication.com 05:39:15 INFO - => storage/default/https+++tpc.googlesyndication.com/idb 05:39:15 INFO - => storage/default/https+++tpc.googlesyndication.com/idb/12183338011.files 05:39:15 INFO - => storage/default/https+++ir.ebaystatic.com 05:39:15 INFO - => storage/default/https+++ir.ebaystatic.com/idb 05:39:15 INFO - => storage/default/https+++ir.ebaystatic.com/idb/12183338011.files 05:39:15 INFO - => storage/default/https+++ad.doubleclick.net 05:39:15 INFO - => storage/default/https+++ad.doubleclick.net/idb 05:39:15 INFO - => storage/default/https+++ad.doubleclick.net/idb/12183338011.files 05:39:15 INFO - => storage/default/https+++www.youtube.com 05:39:15 INFO - => storage/default/https+++www.youtube.com/idb 05:39:15 INFO - => storage/default/https+++www.youtube.com/idb/3211250388sbwdpsunsohintoatciif.files 05:39:15 INFO - => storage/default/https+++www.youtube.com/cache 05:39:15 INFO - => storage/default/https+++www.youtube.com/cache/morgue 05:39:15 INFO - => storage/default/https+++www.youtube.com/cache/morgue/118 05:39:15 INFO - => storage/default/https+++www.youtube.com/cache/morgue/194 05:39:15 INFO - => storage/default/https+++www.youtube.com/cache/morgue/22 05:39:15 INFO - => storage/default/https+++www.youtube.com/cache/morgue/84 05:39:15 INFO - => storage/default/https+++www.youtube.com/cache/morgue/56 05:39:15 INFO - => storage/default/https+++www.youtube.com/cache/morgue/54 05:39:15 INFO - => storage/temporary 05:39:15 INFO - => storage/permanent 05:39:15 INFO - => storage/permanent/chrome 05:39:15 INFO - => storage/permanent/chrome/idb 05:39:15 INFO - => storage/permanent/chrome/idb/3561288849sdhlie.files 05:39:15 INFO - => storage/permanent/chrome/idb/2918063365piupsah.files 05:39:16 INFO - => crashes 05:39:16 INFO - => crashes/events 05:39:16 INFO - => gmp 05:39:16 INFO - => gmp/Linux_x86_64-gcc3 05:39:16 INFO - => browser-extension-data 05:39:16 INFO - => browser-extension-data/screenshots@mozilla.org 05:39:16 INFO - => saved-telemetry-pings 05:39:16 INFO - => datareporting 05:39:16 INFO - => datareporting/archived 05:39:16 INFO - => datareporting/archived/2017-10 05:39:16 INFO - => datareporting/archived/2017-12 05:39:17 INFO - => datareporting/archived/2018-01 05:39:17 INFO - => datareporting/archived/2017-09 05:39:17 INFO - => datareporting/archived/2017-11 05:39:17 INFO - => cache2 05:39:17 INFO - => cache2/doomed 05:39:17 INFO - => cache2/entries command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Autoland-Non-PGO', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Autoland-Non-PGO'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=7200.011426 ========= master_lag: 0.19 ========= ========= Finished '/tools/buildbot/bin/python scripts/scripts/talos_script.py ...' failed (results: 2, elapsed: 2 hrs, 0 secs) (at 2018-01-05 06:00:19.537689) =========
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
There have been 47 failures in the last 7 days. The most failures are occurring on the Linux x64 platform, but there also some occurring on linux64-qr and two on OS X 10.10. The most affected build type is opt,but there are also some failures occurring on pgo. A recent log file: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=156985504&lineNumber=10653 And a relevant part of it: 16:16:32 INFO - 127.0.0.1:35513: ProtocolException('Error in HTTP connection: TcpDisconnect("[(\'SSL routines\', \'ssl3_write_pending\', \'bad write retry\')]",)',) 16:16:32 INFO - 127.0.0.1:35513: HTTP/2 connection terminated by client: error code: 2, last stream id: 0, additional data: None 16:16:32 INFO - 127.0.0.1:35513: TcpDisconnect("[('SSL routines', 'ssl3_write_pending', 'bad write retry')]",) 16:16:32 INFO - 127.0.0.1:35997: CONNECT www.amazon.com:443 16:16:32 INFO - << Error in HTTP connection: TcpDisconnect("(32, 'EPIPE')",) 16:16:32 INFO - 127.0.0.1:35513: clientdisconnect 16:16:32 INFO - 127.0.0.1:35997: clientdisconnect 16:16:33 INFO - TEST-INFO | 20692: exit 0 16:16:45 INFO - TEST-OK | tp6_amazon_heavy | took 2250061ms 16:16:45 INFO - TEST-START | tp6_facebook_heavy 16:16:45 INFO - Initialising browser for tp6_facebook_heavy test... 16:16:45 INFO - /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings 16:16:45 INFO - InsecurePlatformWarning 16:16:45 INFO - /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings 16:16:45 INFO - InsecurePlatformWarning 16:16:46 INFO - /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings 16:16:46 INFO - InsecurePlatformWarning 16:16:48 INFO - Local copy of 'simple' is fresh enough 16:16:48 INFO - 6 days old 16:16:48 INFO - Cloning profile located at /home/cltbld/.mozilla/profiles/simple 16:16:49 INFO - => OfflineCache 16:16:49 INFO - => datareporting 16:16:49 INFO - => datareporting/archived 16:16:49 INFO - => datareporting/archived/2017-10 16:16:49 INFO - => datareporting/archived/2017-09 16:16:49 INFO - => datareporting/archived/2017-11 16:16:49 INFO - => datareporting/archived/2017-12 16:16:49 INFO - => datareporting/archived/2018-01 16:16:49 INFO - => gmp 16:16:49 INFO - => gmp/Linux_x86_64-gcc3 16:16:49 INFO - => crashes 16:16:49 INFO - => crashes/events 16:16:49 INFO - => bookmarkbackups 16:16:49 INFO - => cache2 16:16:50 INFO - => cache2/entries command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Firefox', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Firefox'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=7200.120554 ========= master_lag: 0.19 ========= ========= Finished '/tools/buildbot/bin/python scripts/scripts/talos_script.py ...' failed (results: 2, elapsed: 2 hrs, 0 secs) (at 2018-01-17 16:27:58.478040) ========= :rwood As you are the triage owner of this component, could you please take a look at this?
Flags: needinfo?(rwood)
Whiteboard: [stockwell disabled] → [stockwell needswork]
The talos heavy user profile tests have been disabled on Windows & Linux in Bug 1428790.
Flags: needinfo?(rwood)
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell needswork] → [stockwell disabled]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: