Closed Bug 1415858 Opened 6 years ago Closed 5 years ago

Intermittent command timed out: 7200 seconds elapsed, attempting to kill in tp6_facebook_heavy,tp6_amazon_heavy after start

Categories

(Testing :: Talos, defect, P5)

Version 3
defect

Tracking

(firefox58 fixed, firefox59 fixed)

RESOLVED FIXED
mozilla59
Tracking Status
firefox58 --- fixed
firefox59 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jmaher)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(3 files)

Filed by: archaeopteryx [at] coole-files.de

https://treeherder.mozilla.org/logviewer.html#?job_id=143198621&repo=autoland

https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win32/autoland_win7_ix_test-h2-e10s-bm111-tests1-windows-build397.txt.gz

19:30:14     INFO -  TEST-INFO | 3492: exit 0
19:31:48     INFO -  TEST-OK | tp6_amazon_heavy | took 1909961ms
19:31:48     INFO -  TEST-START | tp6_facebook_heavy
19:31:48     INFO -  Initialising browser for tp6_facebook_heavy test...
19:32:22     INFO -  Local copy of 'simple' is fresh enough
19:32:22     INFO -  0 days old

command timed out: 7200 seconds elapsed, attempting to kill
program finished with exit code 1
elapsedTime=7208.629000
========= master_lag: 1.10 =========
========= Finished 'c:/mozilla-build/python27/python -u ...' warnings (results: 1, elapsed: 2 hrs, 9 secs) (at 2017-11-08 19:58:55.971795) =========
There are 37 failures in the last 7 days.
The occur on Windows 7 and the affected build types are opt and pgo.

A recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=144088285&lineNumber=1687

:rwood, Can you please take a look?
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
adding :tarek to this bug as he helped develop the heavy profile and integration to talos- he might have some ideas.
I think the tp6 heavy tests are just taking alot longer to run and we're running into the default talos ttest timeout of 7200 seconds:

https://searchfox.org/mozilla-central/rev/fe75164c55321e011d9d13b6d05c1e00d0a640be/testing/talos/talos/ttest.py#168
Flags: needinfo?(rwood)
TEST-OK | tp6_amazon_heavy | took 1990367ms

that is really long!  we should split this up then, or figure out why with the heavy profile this takes so long- maybe we can analyze this in more detail over time.

Kyle, is there a query we can do on the tp6_*_heavy test names to get a list of runtimes?
Flags: needinfo?(klahnakoski)
Oh but hang out, this is strange, 30 minutes elapsed time in between checking how old the profile is, and launching Firefox... yeah something wrong here:

04:28:55     INFO -  TEST-START | tp6_google_heavy
04:28:55     INFO -  Initialising browser for tp6_google_heavy test...
04:32:09     INFO -  Local copy of 'simple' is fresh enough
04:32:09     INFO -  3 days old
05:02:09     INFO -  Application command: C:\slave\test\build\application\firefox\firefox  http://localhost:49901/getInfo.html -profile c:\users\cltbld~1.000\appdata\local\temp\tmpjby9kz\profile
05:02:10     INFO -  TEST-INFO | started process 4044 (C:\slave\test\build\application\firefox\firefox  http://localhost:49901/getInfo.html)
05:02:55     INFO -  TEST-INFO | 4044: exit 0
05:02:55     INFO -  Browser initialized.
odd that it is fresh enough but 3 days old- maybe there is a bug in that logic.  30 minutes is insane- I know it took me about 15 minutes to download and extract a profile locally, I blamed it on my "high" speed internet connection I have at home.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #6)
> odd that it is fresh enough but 3 days old- maybe there is a bug in that
> logic.  30 minutes is insane- I know it took me about 15 minutes to download
> and extract a profile locally, I blamed it on my "high" speed internet
> connection I have at home.

It only updates if the profile is > 7 days old, so it isn't even downloading at all... don't know what's happening in that mysterious 30 minutes...

https://searchfox.org/mozilla-central/rev/fe75164c55321e011d9d13b6d05c1e00d0a640be/testing/talos/talos/heavy.py#106
:tarek any ideas?
Flags: needinfo?(tarek)
Once the profile is downloaded, it's extracted and then for each run it's cloned in a temp location. The cloning happend every time since we want to run the test with a vanilla heavy profile, but 30mn sounds insane. 

I will take a deeper look when I start working tomorrow.
In any case I think we should add more LOG.info() calls in ffsetup after 

https://searchfox.org/mozilla-central/source/testing/talos/talos/ffsetup.py#101

So we get some feedback about the cloning step, etc
Assignee: nobody → tarek
Flags: needinfo?(tarek)
Comment on attachment 8928464 [details]
Bug 1415858 - Adding logs for Talos cloning step

https://reviewboard.mozilla.org/r/199718/#review204840

::: testing/mozbase/mozprofile/tests/test_clone_cleanup.py:41
(Diff revision 1)
>      def test_restore_true(self):
> +        counter = [0]
> +
> +        def _feedback(dir, content):
> +            counter[0] += 1
> +            return []

I don't understand this method, can you describe it with a comment?

::: testing/talos/talos/ffsetup.py:109
(Diff revision 1)
> +
> +        def _feedback(directory, content):
> +            sub = directory.split(profile_path)[-1].lstrip("/")
> +            if sub:
> +                LOG.info("=> %s" % sub)
> +            return []

sam here, I see _feedback and it is related to ignore, but always returns []
Attachment #8928464 - Flags: review?(jmaher) → review-
I could not reproduce the issue on my laptop. Once the current patch has landed we should have more info on where the tiome is spent. Maybe there's a disk issue on the slaves when the profile is copied in the tmp space
(In reply to Tarek Ziadé (:tarek) from comment #15)
> I could not reproduce the issue on my laptop. Once the current patch has
> landed we should have more info on where the tiome is spent. Maybe there's a
> disk issue on the slaves when the profile is copied in the tmp space

Thanks Tarek!
Comment on attachment 8928464 [details]
Bug 1415858 - Adding logs for Talos cloning step

https://reviewboard.mozilla.org/r/199718/#review204892

thanks, lets see if this helps surface the area where we are hanging.
Attachment #8928464 - Flags: review?(jmaher) → review+
Pushed by tziade@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ecf4310e41f6
Adding logs for Talos cloning step r=jmaher
https://hg.mozilla.org/mozilla-central/rev/ecf4310e41f6
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Alright so we have more info, see 

https://treeherder.mozilla.org/logviewer.html#?job_id=145055559&repo=autoland&lineNumber=2638

Ity takes 30 minutes to copy the cache2/entries subfolder, which is the heaviest folder in the profile

$ du -ch ~/.mozilla/profiles/simple/cache2/entries/
635M	/Users/tarek/.mozilla/profiles/simple/cache2/entries/
635M	total
$ ls ~/.mozilla/profiles/simple/cache2/entries/ | wc -l
   38908


It works but it's slow. There's nothing we can do at the code level, Python relies on the system to copy the files and we can't speed it up from the code.

I guess the next step is to investigate on the hardware/vm setup e.g. is the tmp folder on the same disk? can we use a RAM disk ? etc. 

Joel, do you know who we should ping for help?
Flags: needinfo?(jmaher)
maybe have cache2/entries compressed in a sub-tarball when it's copied and then extracted in the target foilder can speed up things, to avoid 40k operations. I will investigate.
According to my tests on darwin, directly extracting a the tarball reduces the time by 1.5 to 2 when working on the same disk. But it's very fast compared to the windows slave (from 10 seconds to 20 seconds)

Can someone try my script on windows ? (adding it here)

The change would simply mean keeping the downloaded tarball in ~/.mozilla/profile and extend .clone() so it can deal with tarballs.

That would reduce the time to 15mn, which still sounds like amazingly slow.
Attached file comparing.py
Python script to compare the speed of extracting on the fly instead of file-to-file copy. It makes the assumption that you have a tarball of the cache2/entries directory, from the latest profile tarball
on first thought, possibly we don't need this on windows7, the reported numbers for win7 vs 10 follow the same pattern over the last 30 days.  Both win7 and win10 run on the same physical hardware (IX machines):
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation

there could be configuration differences.  Typically win10 takes 30-40 minutes to complete, and win7 takes 100+ minutes to complete- there is obviously a problem with fileIO on win7 (and we see this with unittests as well).  I don't think there is anything to do, these are a single disk instance.

My theory is we should just not run heavy tests on win7, I agree that unpacking a profile is faster than copying, but at some point, we are spending 1+ hour/push managing this for little return- when we get faster data on all our other platforms.

:rwood, what are your thoughts?
Flags: needinfo?(rwood)
Flags: needinfo?(klahnakoski)
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #25)
> on first thought, possibly we don't need this on windows7, the reported
> numbers for win7 vs 10 follow the same pattern over the last 30 days.  Both
> win7 and win10 run on the same physical hardware (IX machines):
> https://wiki.mozilla.org/Buildbot/Talos/
> Misc#Hardware_Profile_of_machines_used_in_automation
> 
> there could be configuration differences.  Typically win10 takes 30-40
> minutes to complete, and win7 takes 100+ minutes to complete- there is
> obviously a problem with fileIO on win7 (and we see this with unittests as
> well).  I don't think there is anything to do, these are a single disk
> instance.
> 
> My theory is we should just not run heavy tests on win7, I agree that
> unpacking a profile is faster than copying, but at some point, we are
> spending 1+ hour/push managing this for little return- when we get faster
> data on all our other platforms.
> 
> :rwood, what are your thoughts?

Wow, yeah in that case I agree, it's not really worth over 2x the resource time to run the heavy tests on Win 7.
Flags: needinfo?(rwood)
Maybe we could turn off the heavy tests on Win 7 except leave them available on Win 7 for try?
Attachment #8929147 - Flags: review?(rwood) → review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c9bfe1a78bbb
disable h2 on win7 except for try due to slow fileIO. r=rwood
Assignee: tarek → nobody
https://hg.mozilla.org/mozilla-central/rev/c9bfe1a78bbb
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Assignee: nobody → jmaher
Summary: Intermittent command timed out: 7200 seconds elapsed, attempting to kill in tp6_facebook_heavy after start → Intermittent command timed out: 7200 seconds elapsed, attempting to kill in tp6_facebook_heavy,tp6_amazon_heavy after start
Whiteboard: [stockwell needswork] → [stockwell disabled]
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=154354580&repo=autoland&lineNumber=11117

05:39:08     INFO -  TEST-START | tp6_facebook_heavy
05:39:08     INFO -  Initialising browser for tp6_facebook_heavy test...
05:39:08     INFO -  /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
05:39:08     INFO -    InsecurePlatformWarning
05:39:09     INFO -  /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
05:39:09     INFO -    InsecurePlatformWarning
05:39:09     INFO -  /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
05:39:09     INFO -    InsecurePlatformWarning
05:39:15     INFO -  Local copy of 'simple' is fresh enough
05:39:15     INFO -  1 days old
05:39:15     INFO -  Cloning profile located at /home/cltbld/.mozilla/profiles/simple
05:39:15     INFO -  => safebrowsing
05:39:15     INFO -  => OfflineCache
05:39:15     INFO -  => minidumps
05:39:15     INFO -  => gmp-gmpopenh264
05:39:15     INFO -  => gmp-gmpopenh264/1.7.1
05:39:15     INFO -  => gmp-gmpopenh264/1.6
05:39:15     INFO -  => startupCache
05:39:15     INFO -  => thumbnails
05:39:15     INFO -  => bookmarkbackups
05:39:15     INFO -  => sessionstore-backups
05:39:15     INFO -  => storage
05:39:15     INFO -  => storage/default
05:39:15     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com
05:39:15     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com/idb
05:39:15     INFO -  => storage/default/https+++images-na.ssl-images-amazon.com/idb/12183338011.files
05:39:15     INFO -  => storage/default/https+++tpc.googlesyndication.com
05:39:15     INFO -  => storage/default/https+++tpc.googlesyndication.com/idb
05:39:15     INFO -  => storage/default/https+++tpc.googlesyndication.com/idb/12183338011.files
05:39:15     INFO -  => storage/default/https+++ir.ebaystatic.com
05:39:15     INFO -  => storage/default/https+++ir.ebaystatic.com/idb
05:39:15     INFO -  => storage/default/https+++ir.ebaystatic.com/idb/12183338011.files
05:39:15     INFO -  => storage/default/https+++ad.doubleclick.net
05:39:15     INFO -  => storage/default/https+++ad.doubleclick.net/idb
05:39:15     INFO -  => storage/default/https+++ad.doubleclick.net/idb/12183338011.files
05:39:15     INFO -  => storage/default/https+++www.youtube.com
05:39:15     INFO -  => storage/default/https+++www.youtube.com/idb
05:39:15     INFO -  => storage/default/https+++www.youtube.com/idb/3211250388sbwdpsunsohintoatciif.files
05:39:15     INFO -  => storage/default/https+++www.youtube.com/cache
05:39:15     INFO -  => storage/default/https+++www.youtube.com/cache/morgue
05:39:15     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/118
05:39:15     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/194
05:39:15     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/22
05:39:15     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/84
05:39:15     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/56
05:39:15     INFO -  => storage/default/https+++www.youtube.com/cache/morgue/54
05:39:15     INFO -  => storage/temporary
05:39:15     INFO -  => storage/permanent
05:39:15     INFO -  => storage/permanent/chrome
05:39:15     INFO -  => storage/permanent/chrome/idb
05:39:15     INFO -  => storage/permanent/chrome/idb/3561288849sdhlie.files
05:39:15     INFO -  => storage/permanent/chrome/idb/2918063365piupsah.files
05:39:16     INFO -  => crashes
05:39:16     INFO -  => crashes/events
05:39:16     INFO -  => gmp
05:39:16     INFO -  => gmp/Linux_x86_64-gcc3
05:39:16     INFO -  => browser-extension-data
05:39:16     INFO -  => browser-extension-data/screenshots@mozilla.org
05:39:16     INFO -  => saved-telemetry-pings
05:39:16     INFO -  => datareporting
05:39:16     INFO -  => datareporting/archived
05:39:16     INFO -  => datareporting/archived/2017-10
05:39:16     INFO -  => datareporting/archived/2017-12
05:39:17     INFO -  => datareporting/archived/2018-01
05:39:17     INFO -  => datareporting/archived/2017-09
05:39:17     INFO -  => datareporting/archived/2017-11
05:39:17     INFO -  => cache2
05:39:17     INFO -  => cache2/doomed
05:39:17     INFO -  => cache2/entries

command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Autoland-Non-PGO', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Autoland-Non-PGO'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=7200.011426
========= master_lag: 0.19 =========
========= Finished '/tools/buildbot/bin/python scripts/scripts/talos_script.py ...' failed (results: 2, elapsed: 2 hrs, 0 secs) (at 2018-01-05 06:00:19.537689) =========
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
There have been 47 failures in the last 7 days.
The most failures are occurring on the Linux x64 platform, but there also some occurring on linux64-qr and two on OS X 10.10.
The most affected build type is opt,but there are also some failures occurring on pgo.

A recent log file: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=156985504&lineNumber=10653

And a relevant part of it:
16:16:32     INFO -  127.0.0.1:35513: ProtocolException('Error in HTTP connection: TcpDisconnect("[(\'SSL routines\', \'ssl3_write_pending\', \'bad write retry\')]",)',)
16:16:32     INFO -  127.0.0.1:35513: HTTP/2 connection terminated by client: error code: 2, last stream id: 0, additional data: None
16:16:32     INFO -  127.0.0.1:35513: TcpDisconnect("[('SSL routines', 'ssl3_write_pending', 'bad write retry')]",)
16:16:32     INFO -  127.0.0.1:35997: CONNECT www.amazon.com:443
16:16:32     INFO -   << Error in HTTP connection: TcpDisconnect("(32, 'EPIPE')",)
16:16:32     INFO -  127.0.0.1:35513: clientdisconnect
16:16:32     INFO -  127.0.0.1:35997: clientdisconnect
16:16:33     INFO -  TEST-INFO | 20692: exit 0
16:16:45     INFO -  TEST-OK | tp6_amazon_heavy | took 2250061ms
16:16:45     INFO -  TEST-START | tp6_facebook_heavy
16:16:45     INFO -  Initialising browser for tp6_facebook_heavy test...
16:16:45     INFO -  /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
16:16:45     INFO -    InsecurePlatformWarning
16:16:45     INFO -  /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
16:16:45     INFO -    InsecurePlatformWarning
16:16:46     INFO -  /builds/slave/test/build/venv/lib/python2.7/site-packages/urllib3/util/ssl_.py:137: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
16:16:46     INFO -    InsecurePlatformWarning
16:16:48     INFO -  Local copy of 'simple' is fresh enough
16:16:48     INFO -  6 days old
16:16:48     INFO -  Cloning profile located at /home/cltbld/.mozilla/profiles/simple
16:16:49     INFO -  => OfflineCache
16:16:49     INFO -  => datareporting
16:16:49     INFO -  => datareporting/archived
16:16:49     INFO -  => datareporting/archived/2017-10
16:16:49     INFO -  => datareporting/archived/2017-09
16:16:49     INFO -  => datareporting/archived/2017-11
16:16:49     INFO -  => datareporting/archived/2017-12
16:16:49     INFO -  => datareporting/archived/2018-01
16:16:49     INFO -  => gmp
16:16:49     INFO -  => gmp/Linux_x86_64-gcc3
16:16:49     INFO -  => crashes
16:16:49     INFO -  => crashes/events
16:16:49     INFO -  => bookmarkbackups
16:16:49     INFO -  => cache2
16:16:50     INFO -  => cache2/entries

command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'h2-e10s', '--add-option', '--webServer,localhost', '--branch-name', 'Firefox', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Firefox'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=7200.120554
========= master_lag: 0.19 =========
========= Finished '/tools/buildbot/bin/python scripts/scripts/talos_script.py ...' failed (results: 2, elapsed: 2 hrs, 0 secs) (at 2018-01-17 16:27:58.478040) =========

:rwood As you are the triage owner of this component, could you please take a look at this?
Flags: needinfo?(rwood)
Whiteboard: [stockwell disabled] → [stockwell needswork]
The talos heavy user profile tests have been disabled on Windows & Linux in Bug 1428790.
Flags: needinfo?(rwood)
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell needswork] → [stockwell disabled]
You need to log in before you can comment on or make changes to this bug.