Windows WPT runs take ~3m to `tarfile.extractall()` for target.web-platform.tests.tar.gz
Categories
(Testing :: web-platform-tests, defect, P3)
Tracking
(Not tracked)
People
(Reporter: jgilbert, Assigned: jgilbert)
Details
- [task 2023-04-24T22:31:40.946Z] 22:31:40 INFO - Downloading and extracting to Z:\task_168237204151197\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RsVUtrJWSeyW09f-fonghA/artifacts/public/build/target.web-platform.tests.tar.gz
- [task 2023-04-24T22:31:40.946Z] 22:31:40 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RsVUtrJWSeyW09f-fonghA/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
- [task 2023-04-24T22:31:40.955Z] 22:31:40 INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RsVUtrJWSeyW09f-fonghA/artifacts/public/build/target.web-platform.tests.tar.gz into memory
- [task 2023-04-24T22:31:42.134Z] 22:31:42 INFO - Content-Length response header: 68244870
- [task 2023-04-24T22:31:42.134Z] 22:31:42 INFO - Bytes received: 68244870
- [task 2023-04-24T22:35:21.026Z] 22:35:21 INFO - Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RsVUtrJWSeyW09f-fonghA/artifacts/public/build/target.zip to Z:\task_168237204151197\build\target.zip
[...]
- [task 2023-04-24T22:35:24.011Z] 22:35:24 INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RsVUtrJWSeyW09f-fonghA/artifacts/public/build/target.crashreporter-symbols.zip into memory
- [task 2023-04-24T22:35:37.488Z] 22:35:37 INFO - Content-Length response header: 383002553
- [task 2023-04-24T22:35:37.488Z] 22:35:37 INFO - Bytes received: 383002553
- [task 2023-04-24T22:35:56.293Z] 22:35:56 INFO - mkdir: Z:\task_168237204151197\build\blobber_upload_dir
[...]
- [task 2023-04-24T22:35:56.305Z] 22:35:56 INFO - Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/DQccezDCQmaaAyQ09BN4Jg/artifacts/public/tests-by-manifest.json.gz to Z:/task_168237204151197/fetches\tests-by-manifest.json.gz
But on Linux, the same operation appears to take only 20s:
- [task 2023-04-24T22:28:52.467Z] 22:28:52 INFO - Downloading and extracting to /builds/worker/workspace/build/tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/QVFUnWj8TBecHp-KMKgwng/artifacts/public/build/target.web-platform.tests.tar.gz
- [task 2023-04-24T22:28:52.467Z] 22:28:52 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/QVFUnWj8TBecHp-KMKgwng/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
- [task 2023-04-24T22:28:52.467Z] 22:28:52 INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/QVFUnWj8TBecHp-KMKgwng/artifacts/public/build/target.web-platform.tests.tar.gz into memory
- [task 2023-04-24T22:28:53.265Z] 22:28:53 INFO - Content-Length response header: 68244856
- [task 2023-04-24T22:28:53.265Z] 22:28:53 INFO - Bytes received: 68244856
- [task 2023-04-24T22:29:13.960Z] 22:29:13 INFO - Downloading and extracting to /builds/worker/workspace/build/tests these dirs mach, bin/, config/, extensions/, mozbase/, marionette/, tools/, web-platform/, mozpack/, mozbuild/* from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/QVFUnWj8TBecHp-KMKgwng/artifacts/public/build/target.condprof.tests.tar.gz
| Assignee | ||
Comment 1•2 years ago
|
||
Fortunately, there's pretty much only one place "Bytes received" is present in tree:
https://searchfox.org/mozilla-central/rev/0ffaecaa075887ab07bf4c607c61ea2faa81b172/testing/mozharness/mozharness/base/script.py#513
Fortunately only one caller:
compressed_file = self.retry(
self.fetch_url_into_memory, kwargs={"url": url}, **retry_args
)
# 2) We're guaranteed to have download the file with error_level=FATAL
# Let's unpack the file
function, kwargs = _determine_extraction_method_and_kwargs(url)
try:
function(**kwargs)
except zipfile.BadZipfile:
# Dump the exception and exit
self.exception(level=FATAL)
This eventually ends up calling:
def _safe_extract(tar, path=".", *, numeric_owner=False):
def _files(tar, path):
for member in tar:
_validate_tar_member(member, path)
yield member
tar.extractall(path, members=_files(tar, path), numeric_owner=numeric_owner)
So we need a drop-in replacement for this.
| Assignee | ||
Comment 2•2 years ago
|
||
I have an untar.py proof-of-concept that extracts the .tar.gz in ~12s, down from ~35s.
It's ~110K files, and much of the time is spent in what looks like pretty bare-metal open() os/kernel code.
That's as far as I can push it, and now the 'write files' share of that time is less than the time it takes to delete the resulting directory, which shows we're probably running out of room for improvement. :)
PS C:\dev\mozilla> py .\untar.py .\target.web-platform.tests.tar.gz
target.web-platform.tests.tar.gz
Deleting target.web-platform.tests...
Deleted target.web-platform.tests in 9.800s.
Via extractall_faster...
Indexing tar...
Indexed 110K items in 2.207s.
Extracting 110K items...
Read 267MB in 1.523s
via ThreadPoolExecutor+write_all_files...
Via 100 pool workers...
1101 chunks
1/110060 (1/s)...
101/110060 (255/s)...
19301/110060 (58464/s)...
20101/110060 (2449/s)...
25001/110060 (10030/s)...
33501/110060 (24247/s)...
33601/110060 (238/s)...
34801/110060 (1355/s)...
54301/110060 (20942/s)...
54401/110060 (142/s)...
56801/110060 (4103/s)...
73701/110060 (9774/s)...
Wrote in 8.377s. (13.1K/s, 31.8MB/s)
Extracted after 12.166s.
| Assignee | ||
Comment 3•2 years ago
|
||
[task 2023-04-27T07:12:13.198Z] 07:12:13 INFO - Downloading and extracting to Z:\task_168257263320551\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YrwPBA8FQAWk7e-kdHO_1Q/artifacts/public/build/target.web-platform.tests.tar.gz
[task 2023-04-27T07:12:13.198Z] 07:12:13 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YrwPBA8FQAWk7e-kdHO_1Q/artifacts/public/build/target.web-platform.tests.tar.gz'}, attempt #1
[task 2023-04-27T07:12:13.198Z] 07:12:13 INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YrwPBA8FQAWk7e-kdHO_1Q/artifacts/public/build/target.web-platform.tests.tar.gz into memory
[task 2023-04-27T07:12:14.348Z] 07:12:14 INFO - Content-Length response header: 68523167
[task 2023-04-27T07:12:14.348Z] 07:12:14 INFO - Bytes received: 68523167
[task 2023-04-27T07:16:40.490Z] Extracted 110070 items into Z:\task_168257263320551\build\tests/ in 211.045s. (tar.extractall)
[task 2023-04-27T07:16:40.490Z] Indexed 110K items in 0.000s.
[task 2023-04-27T07:16:40.490Z] Extracting 110K items...
[task 2023-04-27T07:16:43.938Z] Read 267MB in 3.450s
[task 2023-04-27T07:16:44.391Z] 1101 chunks
[task 2023-04-27T07:16:44.391Z] via ThreadPoolExecutor+write_all_files...
[task 2023-04-27T07:16:44.391Z] Via 12 pool workers...
[task 2023-04-27T07:16:44.688Z] 2601/110070 (8609/s)...
[task 2023-04-27T07:16:45.000Z] 5901/110070 (10917/s)...
[task 2023-04-27T07:16:46.470Z] 8001/110070 (1427/s)...
[task 2023-04-27T07:16:46.783Z] 11201/110070 (10397/s)...
[task 2023-04-27T07:16:47.096Z] 14801/110070 (11096/s)...
[task 2023-04-27T07:16:49.284Z] 17301/110070 (1146/s)...
[task 2023-04-27T07:16:49.612Z] 20801/110070 (10659/s)...
[task 2023-04-27T07:16:49.909Z] 24201/110070 (11085/s)...
[task 2023-04-27T07:16:52.629Z] 26201/110070 (737/s)...
[task 2023-04-27T07:16:52.957Z] 29501/110070 (10346/s)...
[task 2023-04-27T07:16:53.254Z] 33201/110070 (12231/s)...
[task 2023-04-27T07:16:55.145Z] 35101/110070 (1005/s)...
[task 2023-04-27T07:16:55.458Z] 37501/110070 (7451/s)...
[task 2023-04-27T07:16:55.787Z] 41301/110070 (11944/s)...
[task 2023-04-27T07:16:57.428Z] 43101/110070 (1095/s)...
[task 2023-04-27T07:16:57.741Z] 46301/110070 (10038/s)...
[task 2023-04-27T07:16:58.038Z] 49701/110070 (11289/s)...
[task 2023-04-27T07:16:59.241Z] 51501/110070 (1511/s)...
[task 2023-04-27T07:16:59.554Z] 54401/110070 (9171/s)...
[task 2023-04-27T07:16:59.867Z] 58601/110070 (13447/s)...
[task 2023-04-27T07:17:01.117Z] 59401/110070 (636/s)...
[task 2023-04-27T07:17:01.430Z] 62301/110070 (9378/s)...
[task 2023-04-27T07:17:01.742Z] 65601/110070 (10783/s)...
[task 2023-04-27T07:17:03.289Z] 67701/110070 (1355/s)...
[task 2023-04-27T07:17:03.587Z] 70601/110070 (9509/s)...
[task 2023-04-27T07:17:03.900Z] 74701/110070 (13578/s)...
[task 2023-04-27T07:17:05.181Z] 75601/110070 (697/s)...
[task 2023-04-27T07:17:05.494Z] 78501/110070 (9437/s)...
[task 2023-04-27T07:17:05.807Z] 81701/110070 (10393/s)...
[task 2023-04-27T07:17:07.667Z] 84201/110070 (1338/s)...
[task 2023-04-27T07:17:07.980Z] 87101/110070 (9540/s)...
[task 2023-04-27T07:17:08.293Z] 89901/110070 (8945/s)...
[task 2023-04-27T07:17:10.184Z] 92201/110070 (1213/s)...
[task 2023-04-27T07:17:10.481Z] 95101/110070 (9505/s)...
[task 2023-04-27T07:17:10.793Z] 97901/110070 (9188/s)...
[task 2023-04-27T07:17:12.920Z] 100601/110070 (1266/s)...
[task 2023-04-27T07:17:13.233Z] 103101/110070 (8195/s)...
[task 2023-04-27T07:17:13.531Z] 106101/110070 (9948/s)...
[task 2023-04-27T07:17:15.469Z] 108601/110070 (1292/s)...
[task 2023-04-27T07:17:15.515Z] Wrote in 31.580s. (3485/s, 8459KB/s)
[task 2023-04-27T07:17:15.625Z] Extracted 110070 items into Z:\task_168257263320551\build\tests/ in 35.130s. (fastar.extractall_faster)
Before: 3m41s
After: 35s
No change on e.g. Linux because it takes a different path, but it's only taking 18s there so it's fine.
| Assignee | ||
Comment 4•2 years ago
|
||
Re-running the test, it seems like the run time is almost bimodal:
[task 2023-04-28T01:50:25.976Z] Wrote in 59.218s. (1858/s, 4511KB/s)
[task 2023-04-28T01:54:10.116Z] Wrote in 310.565s. (354/s, 860KB/s)
[task 2023-04-28T01:50:00.389Z] Wrote in 61.229s. (1797/s, 4363KB/s)
[task 2023-04-28T01:50:06.406Z] Wrote in 59.800s. (1840/s, 4467KB/s)
[task 2023-04-27T17:54:18.464Z] Wrote in 295.211s. (372/s, 904KB/s)
Looking again, my comparison earlier was potentially flawed, since I didn't delete the directory between the two commands, so the _faster one might have had an advantage.
I'm triggering more tests to get more data.
| Assignee | ||
Updated•2 years ago
|
| Assignee | ||
Comment 5•2 years ago
•
|
||
I wonder if we are running into variance due to worker history and burst-vs-base rates.
For example, for Azure "premium ssd": https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types#premium-ssd-size
Its "P10" 128GB disk can do 3,500iops in burst, but only 500iops at base rate. Suspiciously, this does roughly match the bimodal character of the run lengths if we look at items/s, which is bimodal here at ~1800/s and ~360/s.
Regardless, many of these runs are ~2m faster now with the patch, but unfortunately they are also sometimes ~2m slower. It's possible that tuning the various parallelism parameters might help out the slow case, but I want to make sure there's no easier solution. (ramdisks? It looks like these with-gpu vm configs have a ton of ram)
It looks like (according to the internet) generally it can be tricky to get good perf for disks on Azure, because of latency overhead when writing to a disk that's "secretly" auto-serialized (managed?) to blobs over the network. (and that we therefore want to ideally stick to "temporary disks" for ephemeral stuff like this)
Does any of this track? I'm outside my expertise once we get into Azure stuff. :)
Comment 6•2 years ago
|
||
there is standard and premium; for our main test machines we use standard disks, so not premium which means no SSD (HDD) and much lower IOPS. These are not temporary disks, we have 2 disks:
c:\ (os, preinstalled tools, user profile)
z:\ (task specific data, like download, extract, logs, etc.)
so some stuff is run from c:\mozilla-build, like python.exe, but both c:\ and z:\ are the same type of disk.
moving up to SSD (which is premium) would be costly, we do that for the -source machines and have a smaller pool and a small number of jobs that run there.
It might make sense to land what you have to optimize a bit; possibly on a -source machine it will be a lot faster. you could hack that by doing:
./mach try fuzzy --worker-override='win11-64-2009=gecko-t/win11-64-2009-source-alpha' -q 'test-windows ...'
the -alpha is because we haven't migrated win10->win11 for the -source jobs. if you want gpu, you might be able to use:
./mach try fuzzy --worker-override='win11-64-2009-gpu=gecko-t/win11-64-2009-ssd-gpu' -q 'test-windows ...'
| Assignee | ||
Comment 7•2 years ago
|
||
We talked some on Matrix about other options here, like (third-party software) ramdisks and Azure "Temporary Disks".
[jmaher:] we are going to experiment with the temp drive (which is SSD), might take a few days
Comment 8•2 years ago
|
||
The severity field is not set for this bug.
:jgraham, could you have a look please?
For more information, please visit BugBot documentation.
Updated•2 years ago
|
| Assignee | ||
Comment 9•2 years ago
|
||
jmaher says:
not yet, a lot of other projects have been completed and I know it is still on the todo list, it is a p2 right now
| Assignee | ||
Comment 12•1 year ago
|
||
Okay! It just seems like a pretty solid cost win for many windows test runs, if we can drop 3m off each job run time!
Comment 13•1 year ago
|
||
I know there has been recent changes to the disks on azure, not sure if those are helping out here or not.
:jmoss, could you share what changes in the last month or so landed on azure testers related to disks?
Comment 14•1 year ago
•
|
||
:jmaher :jgilbert
Azure Virtual Machines allow for different types of disk layout. More information about each type of disk here. We have migrated from using managed disks to using ephemeral disks to reduce cost and increase performance.
Prior to March 2024, each Azure Virtual Machine was using the following for data storage:
OS Disk: Located at C:, which was a managed disk where we installed base software.
Data Disk: A managed disk that had 2 partitions, Z:\ and Y:. Z:\ is where tasksDir for Generic Worker was set and Y:\ was where each task stored cached hg and pip packages.
Temporary Disk: Located on D:, which was only used to store pagefile.
After March 2024, we updated the gecko-t/win10*,gecko-t/win11*, and gecko-t/win2022* pools with the following for data storage:
OS Disk: Located at C:, which was a managed disk where we installed base software.
Data Disk: No longer used.
Temporary Disk: Located on D:\ which is where tasksDir is set to, along with cached hg and pip packages.
After June 2024, we updated the gecko-t/win10*,gecko-t/win11*, and gecko-t/win2022* pools with the following for data storage:
OS Disk: Located at C:, which is no longer a managed disk, but rather an ephemeral OS disk that has base software installed.
Data Disk: No longer used.
Temporary Disk: Located on D:\ which is where tasksDir is set to, along with cached hg and pip packages.
Prior to March 2024, we were setting specific worker pools to have different skus on the data disk (more info here based on the worker pool). For example, worker pools with -ssd or -source were often set to Premium SSD SKU.
We now have it so that the worker pools named gecko-t/win10*,gecko-t/win11*, and gecko-t/win2022* are all running ephemeral OS disks to strike a balance between cost and performance.
With all of that said, I would imagine we'd see a better improvement tarfile.extractall(). :jgilbert would you be able to validate that all of these data disk adjustments in the last few months have improved performance?
| Assignee | ||
Comment 15•1 year ago
|
||
Awesome, thanks, I'll take a look!
Comment 16•1 year ago
|
||
(In reply to Kelsey Gilbert [:jgilbert] from comment #15)
Awesome, thanks, I'll take a look!
Hi! Are there any updates to this bug? Are we still seeing performance issues?
Description
•