Open Bug 1534396 Opened 1 year ago Updated 1 year ago

wpt on windows7 has a lot of overhead

Categories

(Testing :: General, defect, P3)

Version 3
defect

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

Details

:bwc pointed out to me while running a single test on try that wpt takes 2-3 minutes except for windows7 which takes ~13 minutes- that is 10 minutes of overhead to account for.

here is a try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f5167f22a7070dd678765c988f60b40e2cb965f

a windows10 pgo log:
https://taskcluster-artifacts.net/RSdpiv8LSUGaS4SYj4veBA/0/public/logs/live_backing.log

a windows 7 pgo log:
https://taskcluster-artifacts.net/GNW3pe2wT7KmjvdFnXF4RQ/0/public/logs/live_backing.log

in the windows7 log, here is what I think is the 10 minutes of overhead:
19:08:40 INFO - Downloading and extracting to Z:\task_1552329770\build\tests these dirs mach, bin/, config/, mozbase/, marionette/, tools/, web-platform/, mozpack/, mozbuild/ from https://queue.taskcluster.net/v1/task/J-JZoSHoSdOtfYyV3JKSkQ/artifacts/public/build/target.web-platform.tests.tar.gz
19:08:40 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/J-JZoSHoSdOtfYyV3JKSkQ/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
19:08:40 INFO - Fetch https://queue.taskcluster.net/v1/task/J-JZoSHoSdOtfYyV3JKSkQ/artifacts/public/build/target.web-platform.tests.tar.gz into memory
19:09:12 INFO - Content-Length response header: 54180620
19:09:12 INFO - Bytes received: 54180620
19:13:56 INFO - Adding 'mozinfo.json' for extraction from common.tests archive
19:13:56 INFO - Downloading and extracting to Z:\task_1552329770\build\tests these dirs mach, bin/, config/, mozbase/, marionette/, tools/, web-platform/, mozpack/, mozbuild/, mozinfo.json from https://queue.taskcluster.net/v1/task/J-JZoSHoSdOtfYyV3JKSkQ/artifacts/public/build/target.common.tests.tar.gz
19:13:56 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/J-JZoSHoSdOtfYyV3JKSkQ/artifacts/public/build/target.common.tests.tar.gz'}, attempt #1
19:13:56 INFO - Fetch https://queue.taskcluster.net/v1/task/J-JZoSHoSdOtfYyV3JKSkQ/artifacts/public/build/target.common.tests.tar.gz into memory
19:14:28 INFO - Content-Length response header: 22486284
19:14:28 INFO - Bytes received: 22486284
19:18:14 INFO - Downloading https://queue.taskcluster.net/v1/task/J-JZoSHoSdOtfYyV3JKSkQ/artifacts/public/build/target.zip to Z:\task_1552329770\build\target.zip
19:18:14 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/J-JZoSHoSdOtfYyV3JKSkQ/artifacts/public/build/target.zip', 'file_name': 'Z:\task_1552329770\build\target.zip'}, attempt #1
19:18:49 INFO - Downloaded 67029006 bytes.

that seems to be consistent for multiple logs- assuming this is a common case, we could save a LOT of time in automation.

I have noticed a similar "pause" in windows-aarch64 laptops. afaik, this time is spent extracting files from the archive (disk i/o?)

this is the same for mochitest/xpcshell (I assume others as well) - probably a 32 bit issue (win/aarch64 runs the harness in 32 bit x86 mode)

there might not be much to do here; maybe there are shortcuts we can take in how we unpack, download

Is this still a concern? I checked the windows7-32-shippable/opt and windows10-64-shippable/opt logs, and the overhead seems much more reasonable now. See logs below for an example run of web-platform-tests-1.

windows7-32-shippable: https://taskcluster-artifacts.net/CyNa0fVjS0SNEStDF28AEA/0/public/logs/live_backing.log

18:00:14     INFO - Downloading packages: [u'target.web-platform.tests.tar.gz', u'target.common.tests.tar.gz'] for test suite categories: ['web-platform']
18:00:14     INFO - Downloading and extracting to Z:\task_1556558337\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/* from https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz
18:00:14     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
18:00:14     INFO - Fetch https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz into memory
18:00:15     INFO - Content-Length response header: 54812554
18:00:15     INFO - Bytes received: 54812554
18:05:31     INFO - Adding 'mozinfo.json' for extraction from common.tests archive
18:05:31     INFO - Downloading and extracting to Z:\task_1556558337\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/*, mozinfo.json from https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.common.tests.tar.gz

Time taken is ~5 minutes. My assumption is that between the conclusion of download (Bytes received) and the adding of mozinfo.json is the extraction time. Granted, it should not take 5 minutes, but perhaps one of the task overhead items mentioned (change compression algorithm) may help.

windoss10-64-shippable: https://taskcluster-artifacts.net/DypAqM5IQyqAtFmzO8pnHg/0/public/logs/live_backing.log

18:07:00     INFO - Downloading packages: [u'target.web-platform.tests.tar.gz', u'target.common.tests.tar.gz'] for test suite categories: ['web-platform']
18:07:00     INFO - Downloading and extracting to Z:\task_1556558871\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/* from https://queue.taskcluster.net/v1/task/bRa-eMgLQlewSJNmtN1-MA/artifacts/public/build/target.web-platform.tests.tar.gz
18:07:00     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/bRa-eMgLQlewSJNmtN1-MA/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
18:07:00     INFO - Fetch https://queue.taskcluster.net/v1/task/bRa-eMgLQlewSJNmtN1-MA/artifacts/public/build/target.web-platform.tests.tar.gz into memory
18:07:30     INFO - Content-Length response header: 54812496
18:07:30     INFO - Bytes received: 54812496
18:08:02     INFO - Adding 'mozinfo.json' for extraction from common.tests archive
18:08:02     INFO - Downloading and extracting to Z:\task_1556558871\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/*, mozinfo.json from https://queue.taskcluster.net/v1/task/bRa-eMgLQlewSJNmtN1-MA/artifacts/public/build/target.common.tests.tar.gz

Interestingly, on windows10-64 the issue is much less pronounced - just about 30 seconds. The difference in time between windows7 and windows10 is definitely something that is worth investigating.

from the win7 shippable log above:

18:00:14     INFO - Downloaded 1265 bytes.
18:00:14     INFO - Reading from file Z:\task_1556558337\build\target.test_packages.json
18:00:14     INFO - Downloading packages: [u'target.web-platform.tests.tar.gz', u'target.common.tests.tar.gz'] for test suite categories: ['web-platform']
18:00:14     INFO - Downloading and extracting to Z:\task_1556558337\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/* from https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz
18:00:14     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
18:00:14     INFO - Fetch https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz into memory
18:00:15     INFO - Content-Length response header: 54812554
18:00:15     INFO - Bytes received: 54812554
18:05:31     INFO - Adding 'mozinfo.json' for extraction from common.tests archive
18:05:31     INFO - Downloading and extracting to Z:\task_1556558337\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/*, mozinfo.json from https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.common.tests.tar.gz
18:05:31     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.common.tests.tar.gz'}, attempt #1
18:05:31     INFO - Fetch https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.common.tests.tar.gz into memory
18:05:33     INFO - Content-Length response header: 22811704
18:05:33     INFO - Bytes received: 22811704
18:09:49     INFO - Downloading https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.zip to Z:\task_1556558337\build\target.zip
18:09:49     INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.zip', 'file_name': u'Z:\\task_1556558337\\build\\target.zip'}, attempt #1
18:09:51     INFO - Downloaded 68605548 bytes.

if you look further, we don't start the next downloading line until 4 minutes later, I assume that is related to downloading wpt.tar.gz and common-tests.tar.gz;

comparing that to win10, we do not see an extra 4 minute jump there.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #5)

from the win7 shippable log above:

18:00:14     INFO - Downloaded 1265 bytes.
18:00:14     INFO - Reading from file Z:\task_1556558337\build\target.test_packages.json
18:00:14     INFO - Downloading packages: [u'target.web-platform.tests.tar.gz', u'target.common.tests.tar.gz'] for test suite categories: ['web-platform']
18:00:14     INFO - Downloading and extracting to Z:\task_1556558337\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/* from https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz
18:00:14     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
18:00:14     INFO - Fetch https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.web-platform.tests.tar.gz into memory
18:00:15     INFO - Content-Length response header: 54812554
18:00:15     INFO - Bytes received: 54812554
18:05:31     INFO - Adding 'mozinfo.json' for extraction from common.tests archive
18:05:31     INFO - Downloading and extracting to Z:\task_1556558337\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/*, mozinfo.json from https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.common.tests.tar.gz
18:05:31     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.common.tests.tar.gz'}, attempt #1
18:05:31     INFO - Fetch https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.common.tests.tar.gz into memory
18:05:33     INFO - Content-Length response header: 22811704
18:05:33     INFO - Bytes received: 22811704
18:09:49     INFO - Downloading https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.zip to Z:\task_1556558337\build\target.zip
18:09:49     INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/PSe05NHCSmSUzDB0cuSHsQ/artifacts/public/build/target.zip', 'file_name': u'Z:\\task_1556558337\\build\\target.zip'}, attempt #1
18:09:51     INFO - Downloaded 68605548 bytes.

if you look further, we don't start the next downloading line until 4 minutes later, I assume that is related to downloading wpt.tar.gz and common-tests.tar.gz;

comparing that to win10, we do not see an extra 4 minute jump there.

I did make that observation by the way, a bit further down in the comment.

I've added this to the list of task overhead items I could work on.

:q and :grenade - I was directed to you by jmaher. Been looking into this bug for a few days now, attempting to replicate in a Windows 7 virtual machine, and so far I have been unsuccessful.

To refresh, this is a typical log from a windows7-32 run of web-platform-tests:

16:56:33     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/M0fwvSGYT1-dNFtrLymBNg/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
16:56:33     INFO - Fetch https://queue.taskcluster.net/v1/task/M0fwvSGYT1-dNFtrLymBNg/artifacts/public/build/target.web-platform.tests.tar.gz into memory
16:57:05     INFO - Content-Length response header: 55032001
16:57:05     INFO - Bytes received: 55032001
17:01:42     INFO - Adding 'mozinfo.json' for extraction from common.tests archive
17:01:42     INFO - Downloading and extracting to Z:\task_1558018818\build\tests these dirs mach, bin/*, config/*, mozbase/*, marionette/*, tools/*, web-platform/*, mozpack/*, mozbuild/*, mozinfo.json from https://queue.taskcluster.net/v1/task/M0fwvSGYT1-dNFtrLymBNg/artifacts/public/build/target.common.tests.tar.gz

Note that, after finishing download of target.web-platform.tests.tar.gz, it takes anywhere between 4-5 minutes for the next log item to appear, which happens to be Adding 'mozinfo.json' for extraction from common.tests archive.

It is possible that some log lines are missing, and there are tasks actively executing in reality, but my hunch is that something about the way Windows 7 manages compressed files and/or how mozinfo executes is causing this large delay.

Importantly, similar delay is not observable on Windows 10; there is a delay, but it is a lot more manageable at approximately 30 seconds.

Would either of you know more about the Windows configuration that is used in the CI environment that might explain this delay? Perhaps the disk configuration is different in CI?

Flags: needinfo?(rthijssen)
Flags: needinfo?(q)

Here's a try push that narrows down the issue a bit more:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=ea4dcaa5292e8b28a3598ffbf5089fccd52ebf09

It looks like the big difference between windows 7 and windows 10 is the performance of tarfile extractall() at:

https://hg.mozilla.org/try/rev/82f7ebbe43e8fe7558435304952589a000c48c47#l1.34

It might be interesting to compare extraction of .zip, bz2, etc archives.

i don't really have a good explanation other than windows 7 32 bit is slower than everything else we run. processes that highlight this include compilation, compression and io intensive things like cloning mozilla-central.

the underlying hardware is the same, we use ec2 instance types that are 64 bit capable. we just run a 32 bit os on them.
we've played with a lot of configuration settings like page files, process priorities, etc, but windows 7, 32 bit has always just been slow.

windows 7, 32 bit, is also not able to make use of more than 3gb of ram, so it doesn't matter how much ram the instance type is allocated, the os only utilises the first 3gb.

Flags: needinfo?(rthijssen)

For archive extraction bz2 is almost always super slow (but the archives are smaller, so you can win a little on total time in the right circumstances; I doubt that's the case here). If we can get zstd working for the decompression that could be a big win, since (at least on other platforms) it has very good decompress performance and smaller transfer size than gz.

(I also note that it would be ideal here to use a hg clone rather than this archive, but I think that also had performance problems on Windows and afaik the work on it has stalled).

Oh and xz might also be better than gz. xz and zstd are the ones that I'd try.

I have performed a comparison using the MINGW32 shell provided by mozilla-build, on windows 7.

OS: Windows 7 64bit
vCPU: 2
RAM: 2048MB
GPU RAM: 176MB
Disk free space: 0.9GB

Hardware is as follows:
CPU: Intel i7-7567U
Model: Macbook Pro 2017

bz2

real: 7m38.072s
user: 0m41.843s
sys: 2m8.062s

gz

real: 3m52.160s
user: 0m14.155s
sys: 1m3.546s

xz

real: 7m55.996s
user: 0m29.280s
sys: 2m8.343s

Looking at this, it still seems like gz is the best compression method out of the three for Windows 7 at least.

one small differences is our automation is on windows7 32 bit, not 64 bit

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #13)

one small differences is our automation is on windows7 32 bit, not 64 bit

I understand, though with the VM being allocated only 2GB of RAM would it make a significant difference? I was under impression that even if it did, 64bit with less than 2GB RAM would be slightly slower than comparable 32bit.

Flags: needinfo?(q)
You need to log in before you can comment on or make changes to this bug.