Closed Bug 1611894 Opened 4 years ago Closed 4 years ago

Fetch tasks take ~30 min to extract tests.common.tar.gz on Windows 10 x64 2017 Ref HW opt

Categories

(Testing :: Raptor, defect, P3)

Version 3
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: Bebe, Unassigned)

References

Details

Test time out because a fetch task takes 30 min to download a 11 MB file

[task 2020-01-28T01:54:26.926Z] 01:54:26     INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.common.tests.tar.gz into memory
[task 2020-01-28T01:54:27.564Z] 01:54:27     INFO - Content-Length response header: 24317696
[task 2020-01-28T01:54:27.564Z] 01:54:27     INFO - Bytes received: 24317696
[task 2020-01-28T02:25:28.624Z] 02:25:28     INFO - Downloading and extracting to C:\Users\task_1580170351\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.raptor.tests.tar.gz
[task 2020-01-28T02:25:28.624Z] 02:25:28     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.raptor.tests.tar.gz'}, attempt #1
[task 2020-01-28T02:25:28.624Z] 02:25:28     INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.raptor.tests.tar.gz into memory
[task 2020-01-28T02:25:29.368Z] 02:25:29     INFO - Content-Length response header: 11600882
[task 2020-01-28T02:25:29.368Z] 02:25:29     INFO - Bytes received: 11600882
[task 2020-01-28T02:28:29.302Z] 02:28:29     INFO - Downloading and extracting to C:\Users\task_1580170351\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.condprof.tests.tar.gz
[task 2020-01-28T02:28:29.302Z] 02:28:29     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.condprof.tests.tar.gz'}, attempt #1
[task 2020-01-28T02:28:29.303Z] 02:28:29     INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.condprof.tests.tar.gz into memory
[task 2020-01-28T02:28:29.805Z] 02:28:29     INFO - Content-Length response header: 5245058
[task 2020-01-28T02:28:29.805Z] 02:28:29     INFO - Bytes received: 5245058

Can we get someone from CI to take a look overt this?

Flags: needinfo?(jmaher)
Flags: needinfo?(bob)
Blocks: 1502032

I don't think the issue is that it took 30 minutes to fetch the target.common.tests.tar.gz since it looks like it already finished downloading it and then waited before starting to fetch target.raptor.tests.tar.gz.

[task 2020-01-28T01:54:27.564Z] 01:54:27     INFO - Bytes received: 24317696
[task 2020-01-28T02:25:28.624Z] 02:25:28     INFO - Downloading and extracting to C:\Users\task_1580170351\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.raptor.tests.tar.gz

This makes me think we've got a sleep somewhere... I'm not familiar with win64-aarch64 or the fetch code. After spending a little time digging around, nothing stands out to me. egao might have a better idea.

Edwin: If you don't have time or don't know why this might be happening, please go ahead NI me again and I'll spend some more time on it.

Flags: needinfo?(bob) → needinfo?(egao)

Just to be clear, I think this is the windows10-640-reference hardware which is not windows10-aarch64.

With that said, I do know that there is one other case of windows specific issue relating to file download/decompression, which is on windows7-32 web-platform-tests.

I'm not sure what I can do here - this seems to be managed by releng. :markco, do you have any clues as to why this download is taking so long?

[task 2020-01-28T01:54:26.925Z] 01:54:26     INFO - Downloading and extracting to C:\Users\task_1580170351\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.common.tests.tar.gz
[task 2020-01-28T01:54:26.926Z] 01:54:26     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.common.tests.tar.gz'}, attempt #1
[task 2020-01-28T01:54:26.926Z] 01:54:26     INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.common.tests.tar.gz into memory
[task 2020-01-28T01:54:27.564Z] 01:54:27     INFO - Content-Length response header: 24317696
[task 2020-01-28T01:54:27.564Z] 01:54:27     INFO - Bytes received: 24317696
[task 2020-01-28T02:25:28.624Z] 02:25:28     INFO - Downloading and extracting to C:\Users\task_1580170351\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.raptor.tests.tar.gz

I've also retriggered the job a couple more times to see if this is a consistently reproducible issue:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=df59b74d33d74c9e4fda377616b20c40da832a71&searchStr=2017%2Ctp6-c-30

Flags: needinfo?(egao) → needinfo?(mark.cornmesser)

I can confirm this issue is on Windows 10 x64 2017 Ref HW opt hardware

Flags: needinfo?(mark.cornmesser) → needinfo?(mcornmesser)

Update, the retriggers showed that subsequent runs were fine, so I think it was just an infra glitch.

Flags: needinfo?(mcornmesser)

It's not just a glitch: You can find many such failures in bug 1502032 and bug 1589796.

Blocks: 1589796
Priority: -- → P2
Assignee: nobody → fstrugariu
Status: NEW → ASSIGNED
Flags: needinfo?(jmaher)

I tried running with extra logging on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6c44f351fb3ce4676ce1b18a3d295de075bf503
I did not reproduce a 30 minute download, but noted that the common tests download-and-extract generally takes at least a minute.

In this case it took about 9 minutes:
https://treeherder.mozilla.org/logviewer.html#?job_id=287252769&repo=try

[task 2020-02-02T21:09:42.220Z] 21:09:42     INFO - Downloading packages: [u'target.common.tests.tar.gz', u'target.raptor.tests.tar.gz', u'target.condprof.tests.tar.gz'] for test suite categories: [u'common', u'condprof', u'raptor']
[task 2020-02-02T21:09:42.220Z] 21:09:42     INFO - Downloading and extracting to C:\Users\task_1580668548\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/FpDgSkyuQcCmeyc8RoTaiQ/artifacts/public/build/target.common.tests.tar.gz
[task 2020-02-02T21:09:42.220Z] 21:09:42     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/FpDgSkyuQcCmeyc8RoTaiQ/artifacts/public/build/target.common.tests.tar.gz'}, attempt #1
[task 2020-02-02T21:09:42.221Z] 21:09:42     INFO - Fetch https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/FpDgSkyuQcCmeyc8RoTaiQ/artifacts/public/build/target.common.tests.tar.gz into memory
[task 2020-02-02T21:09:43.159Z] 21:09:43     INFO - Content-Length response header: 24332881
[task 2020-02-02T21:09:43.159Z] 21:09:43     INFO - Bytes received: 24332881
[task 2020-02-02T21:09:43.159Z] 21:09:43     INFO - Content-Encoding is not "gzip"
[task 2020-02-02T21:09:43.174Z] 21:09:43     INFO - fetch_url_into_memory returns BytesIO
[task 2020-02-02T21:09:43.183Z] 21:09:43     INFO - Mimetype: application/x-gzip
[task 2020-02-02T21:09:43.183Z] 21:09:43     INFO - download_unpack determined extraction method
[task 2020-02-02T21:18:28.025Z] 21:18:28     INFO - deflate() done
[task 2020-02-02T21:18:28.077Z] 21:18:28     INFO - download_unpack extraction complete

It looks like deflate() is the culprit, particularly when run on target.common.tests.tar.gz, on Windows ref-hw.
https://searchfox.org/mozilla-central/rev/3a0a8e2762821c6afc1d235b3eb3dde63ad3b01a/testing/mozharness/mozharness/base/script.py#667

Is the deflate code being called for each and every test job or has Raptor a specific path? If it's for all jobs we should move this bug to mozharness, right?

Flags: needinfo?(gbrown)

The deflate() code is used by many test tasks. But it only seems to be problematic on the Window Ref HW platform, which is almost exclusively used for Raptor tests. So, mozharness/windows-hw/raptor? I don't have a strong opinion on the best bug component at this time.

Flags: needinfo?(gbrown)

Interesting is bug 1609295 here where Firefox sometimes doesn't even startup. Maybe something blocks applications from being launched at all, which finally end-up in such timeouts. Or it's very slow network or disk I/O?

I wonder if there are specific machines only where this happens, or if it's across all of them. Florin, could you have a look?

Flags: needinfo?(fstrugariu)

Windows 10 x64 2017 Ref HW is a VERY slow machine.
We have a lot of intermittent failures on this hardware. Checked the nodes and we see intermittent on all the available nodes.
As this specific failure does not generate a error is hard to identify it in the Application timed out. list.

Can we set a timeout or something similar to this download ?

Flags: needinfo?(fstrugariu)

How much memory do those machines have? I'm asking because any download of a test package ends-up in memory, and not on disk. Maybe we face low memory conditions here, which involves a lot of swapping on disk when extracting the file?

(In reply to Geoff Brown [:gbrown] from comment #11)

The deflate() code is used by many test tasks. But it only seems to be problematic on the Window Ref HW platform, which is almost exclusively used for Raptor tests. So, mozharness/windows-hw/raptor? I don't have a strong opinion on the best bug component at this time.

Just for reference here is the code:
https://searchfox.org/mozilla-central/rev/3a0a8e2762821c6afc1d235b3eb3dde63ad3b01a/testing/mozharness/mozharness/base/script.py#667

Note that this internally uses the tarfile Python package for extracting all the data. First this is bad because you never should call extractall() on files from unknown sources (see the warning in the Python docs). Instead we should retrieve the members first, and only extract those, which do not escape the current working directory.

By using extract() for each member we could also have a (temporary) log.debug() to show progress in case it is wanted. That way we could see if it is a specific file the process hangs on, or all the files.

As discussed with Florin on IRC I'm taking this bug. I only have to wait for the loaner as requested on bug 1613613.

Assignee: fstrugariu → hskupin
Priority: P2 → P1

All of the very-slow extractions that I have seen happen on target.common.tests.tar.gz. One feature of target.common.tests.tar.gz is that it contains many thousands of files, including tens of thousands of test files for jsreftest and jittest -- none of which are needed for raptor. I made a very rough attempt at removing the jsreftest/jittest files from the common archive in

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ebc7e4d51eb3623700f46e7dde6f550f37aab14e

I think all of the common archive extractions completed in less than 1 minute.

Separating jsreftest/jittest files from the common archive would be (at least slightly) beneficial to all tasks that download the common archive, but I am not sure of complications (would probably need jsreftest to run with common+reftest+jsreftest archives?).

Please note that you can also specify which folders to extract from common.tests.tar.gz:

https://searchfox.org/mozilla-central/rev/a1592902acabf9bded973067133baaac1457f3d3/testing/mozharness/mozharness/mozilla/testing/firefox_ui_tests.py#153-165

Whatever makes sense in a short term. I could believe that splitting out jsreftest and jittest would involve more work.

Great idea - thanks!

Geoff, how should we proceed here? Are you going on an update for your patch or should I do the file/folder selection as mentioned in my last comment?

Flags: needinfo?(gbrown)
Summary: Fetch tasks take ~30 min to download a file on Windows 10 x64 2017 Ref HW opt → Fetch tasks take ~30 min to extract tests.common.tar.gz on Windows 10 x64 2017 Ref HW opt

Please do the file/folder selection.

Flags: needinfo?(gbrown)

Florin, do we know how often that actually happened in the past 2 weeks? Where have all the failures been classified against?

Flags: needinfo?(fstrugariu)

I don't know how do generate that data as no error massage or classification is generated for this error

As you can see form the stacktrace there is no error generated for the hang...

[task 2020-01-28T02:25:29.368Z] 02:25:29     INFO - Bytes received: 11600882
[task 2020-01-28T02:28:29.302Z] 02:28:29     INFO - Downloading and extracting to C:\Users\task_1580170351\build\tests these dirs * from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AW10S9PlT5a-_ecD579uYQ/artifacts/public/build/target.condprof.tests.tar.gz
Flags: needinfo?(fstrugariu) → needinfo?(hskupin)

I haven't seen this failure for a while. As such I will unassign myself for now.

Status: ASSIGNED → NEW
Flags: needinfo?(hskupin)
Priority: P1 → P3
Assignee: hskupin → nobody
No longer blocks: 1612795

I haven't seen this failure at all in the last time. Lets close as incomplete for now.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.