Open Bug 1809652 Opened 2 years ago Updated 20 days ago

Intermittent [taskcluster:error] Task aborted - max run time exceeded | check log for raptor-benchmark Info: Cloning the benchmarks to <path>/performance-tests/perf-automation

Categories

(Testing :: Performance, defect, P5)

defect

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [retriggered])

Attachments

(2 obsolete files)

Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=401918676&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/EMUkgH7qQE2eAUwohKk7hw/runs/0/artifacts/public/logs/live_backing.log


[task 2023-01-10T11:34:58.716Z] 11:34:58     INFO -  condprof INFO | Creating metadata...
[task 2023-01-10T11:34:58.719Z] 11:34:58     INFO -  condprof INFO | Saving metadata file in /var/folders/wz/g2zb5lgn6737640bv4sq9x6h000014/T/tmpu024_k14/profile/condprofile.json
[task 2023-01-10T11:34:58.719Z] 11:34:58     INFO -  condprof INFO | Profile at /var/folders/wz/g2zb5lgn6737640bv4sq9x6h000014/T/tmpu024_k14/profile.
[task 2023-01-10T11:34:58.719Z] 11:34:58     INFO -  Done.
[task 2023-01-10T11:34:58.860Z] 11:34:58     INFO -  raptor-perftest Info: Created a conditioned-profile copy: /var/folders/wz/g2zb5lgn6737640bv4sq9x6h000014/T/tmptchxz_6y/profile
[task 2023-01-10T11:34:58.860Z] 11:34:58     INFO -  raptor-perftest Info: Merging profile: /opt/worker/tasks/task_167334929937991/build/tests/raptor/raptor/profile_data/base
[task 2023-01-10T11:34:58.865Z] 11:34:58     INFO -  raptor-perftest Info: Merging profile: /opt/worker/tasks/task_167334929937991/build/tests/raptor/raptor/profile_data/common
[task 2023-01-10T11:34:58.869Z] 11:34:58     INFO -  raptor-perftest Info: Merging profile: /opt/worker/tasks/task_167334929937991/build/tests/raptor/raptor/profile_data/perf
[task 2023-01-10T11:34:58.874Z] 11:34:58     INFO -  raptor-perftest Info: Merging profile: /opt/worker/tasks/task_167334929937991/build/tests/raptor/raptor/profile_data/raptor
[task 2023-01-10T11:34:58.876Z] 11:34:58     INFO -  raptor-perftest Info: Browser preferences: {'fission.autostart': True}
[task 2023-01-10T11:34:58.877Z] 11:34:58     INFO -  raptor-perftest Info: Local browser profile: /var/folders/wz/g2zb5lgn6737640bv4sq9x6h000014/T/tmptchxz_6y/profile
[task 2023-01-10T11:34:58.877Z] 11:34:58     INFO -  raptor-browsertime Info: Removing mozprofile delimiters from browser profile
[task 2023-01-10T11:34:58.877Z] 11:34:58     INFO -  raptor-browsertime Info: cwd: '/opt/worker/tasks/task_167334929937991/build'
[task 2023-01-10T11:34:58.877Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_node: /opt/worker/tasks/task_167334929937991/fetches/node/bin/node
[task 2023-01-10T11:34:58.878Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_node: os.stat_result(st_mode=33261, st_ino=1660386008, st_dev=16777221, st_nlink=1, st_uid=36, st_gid=20, st_size=82563760, st_atime=1673350360, st_mtime=1670920349, st_ctime=1673350360)
[task 2023-01-10T11:34:58.878Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_browsertimejs: /opt/worker/tasks/task_167334929937991/fetches/browsertime/node_modules/browsertime/bin/browsertime.js
[task 2023-01-10T11:34:58.878Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_browsertimejs: os.stat_result(st_mode=33261, st_ino=1660383631, st_dev=16777221, st_nlink=1, st_uid=36, st_gid=20, st_size=6152, st_atime=1673350359, st_mtime=1671858889, st_ctime=1673350360)
[task 2023-01-10T11:34:58.879Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_geckodriver: /opt/worker/tasks/task_167334929937991/fetches/geckodriver
[task 2023-01-10T11:34:58.879Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_geckodriver: os.stat_result(st_mode=33261, st_ino=1660383337, st_dev=16777221, st_nlink=1, st_uid=36, st_gid=20, st_size=5366252, st_atime=1673350350, st_mtime=1672807224, st_ctime=1673350350)
[task 2023-01-10T11:34:58.879Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_chromedriver: /opt/worker/tasks/task_167334929937991/fetches/{}chromedriver
[task 2023-01-10T11:34:58.880Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_chromedriver: [Errno 2] No such file or directory: '/opt/worker/tasks/task_167334929937991/fetches/{}chromedriver'
[task 2023-01-10T11:34:58.880Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_user_args: []
[task 2023-01-10T11:34:58.880Z] 11:34:58     INFO -  raptor-browsertime Info: browsertime_user_args: stat: path should be string, bytes, os.PathLike or integer, not list
[task 2023-01-10T11:34:58.880Z] 11:34:58     INFO -  raptor-perftest Info: starting test: wasm-misc-optimizing
[task 2023-01-10T11:34:58.881Z] 11:34:58     INFO -  raptor-browsertime Info: setting test-specific Firefox preferences
[task 2023-01-10T11:34:58.881Z] 11:34:58     INFO -  raptor-browsertime Info: Removing mozprofile delimiters from browser profile
[task 2023-01-10T11:34:58.889Z] 11:34:58     INFO -  raptor-benchmark Info: Cloning the benchmarks to /var/folders/wz/g2zb5lgn6737640bv4sq9x6h000014/T/tmpi6msigff/performance-tests/perf-automation
[task 2023-01-10T11:34:58.893Z] 11:34:58     INFO -  Cloning into '/private/var/folders/wz/g2zb5lgn6737640bv4sq9x6h000014/T/tmpi6msigff/performance-tests/perf-automation'...
[taskcluster:error] Aborting task...
[taskcluster:warn 2023-01-10T11:47:29.116Z] no such process
[taskcluster 2023-01-10T11:47:29.116Z] === Task Finished ===
[taskcluster 2023-01-10T11:47:29.116Z] Task Duration: 15m0.172843s
[taskcluster 2023-01-10T11:47:29.268Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2023-04-10T09:40:10.629Z
[taskcluster 2023-01-10T11:47:29.522Z] Uploading redirect artifact public/logs/live.log to URL https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/EMUkgH7qQE2eAUwohKk7hw/runs/0/artifacts/public%2Flogs%2Flive_backing.log with mime type "text/plain; charset=utf-8" and expiry 2023-04-10T09:40:10.629Z
[taskcluster:error] Task aborted - max run time exceeded
Blocks: 1589796
Assignee: nobody → afinder

Note that the problem here is with cloning the perf-automation repository:

[task 2023-01-10T11:34:58.889Z] 11:34:58     INFO -  raptor-benchmark Info: Cloning the benchmarks to /var/folders/wz/g2zb5lgn6737640bv4sq9x6h000014/T/tmpi6msigff/performance-tests/perf-automation
[task 2023-01-10T11:34:58.893Z] 11:34:58     INFO -  Cloning into '/private/var/folders/wz/g2zb5lgn6737640bv4sq9x6h000014/T/tmpi6msigff/performance-tests/perf-automation'...
[taskcluster:error] Aborting task...
[taskcluster:warn 2023-01-10T11:47:29.116Z] no such process
[taskcluster 2023-01-10T11:47:29.116Z] === Task Finished ===

Why is it stalled for nearly 13 minutes?

Component: General → Performance
Summary: Intermittent [taskcluster:error] Task aborted - max run time exceeded → Intermittent [taskcluster:error] Task aborted - max run time exceeded | check log for raptor-benchmark Info: Cloning the benchmarks to <path>/performance-tests/perf-automation

First occurrence so far: range of retriggers and backfills

Whiteboard: [retriggered]
Attachment #9313237 - Attachment description: Bug 1809652 Intermittent [taskcluster:error] Task aborted - max run time exceeded | extended run-time for benchmark tests r=#perftest → Bug 1809652 Extended max-run-time for benchmark tests r=#perftest
Attachment #9313237 - Attachment description: Bug 1809652 Extended max-run-time for benchmark tests r=#perftest → Bug 1809652 Extend max-run-time for benchmark tests r=#perftest

Hi Nazim! Based on a try push linked to a patch that extends the runtime for the browsertime benchmark tests, looking at the log file, it seems that there might be an issue in the extra-profiler run that causes this timeout (hinted by the following line from the logfile for the try push job: [task 2023-01-20T14:42:40.409Z] 14:42:40 INFO - raptor-browsertime Info: Browsertime process for extra profiler run timed out after 2021 seconds). Do you have a suggestion for a potential fix for this, or how we can proceed with the investigation ?

Flags: needinfo?(canaltinova)

Yes, it looks like there is a problem right after (or during) we stop the profiler. Inside the browsertime codebase we stop the profiler here: https://github.com/sitespeedio/browsertime/blob/7aa74245a5603a346a745c55dddddcb02d56cfd2/lib/firefox/geckoProfiler.js#L160-L164 And then start gzipping the profile data: https://github.com/sitespeedio/browsertime/blob/7aa74245a5603a346a745c55dddddcb02d56cfd2/lib/firefox/geckoProfiler.js#L171-L176 It's not 100% clear to me which one is failing because we don't have any logs right after we stop it, but I suspect that we are timing out in the latter part because it happens right after stopping the profiler and it doesn't have any timeout check.

I think we should add some more logs and add a timeout for that gzipping part just in case that part takes longer (just like this one. I can send a browsertime PR for that.

Flags: needinfo?(canaltinova)

I sent a PR to browsertime to fix this issue and it's landed: https://github.com/sitespeedio/browsertime/pull/1896

Now, it should either fix this issue or give us more insight on the problem. The next step is to update our browsertime to use the latest version (I think). But I'm not sure how that works. Could you help me with that?

Hey Nazim,
As per your comment I wanted to followup and see if you still needed any help with next steps

(In reply to Nazım Can Altınova [:canova][:canaltinova on phabricator] from comment #11)

I sent a PR to browsertime to fix this issue and it's landed: https://github.com/sitespeedio/browsertime/pull/1896

Now, it should either fix this issue or give us more insight on the problem. The next step is to update our browsertime to use the latest version (I think). But I'm not sure how that works. Could you help me with that?

Flags: needinfo?(canaltinova)

Browsertime was updated via bug 1817204, so we should wait to see if this improves our failure rate.

Depends on: 1817204
Flags: needinfo?(canaltinova)
Attachment #9313237 - Attachment is obsolete: true
Attachment #9312820 - Attachment is obsolete: true

:canova, do you think you could take another look at this failure? I think the timeout you added helped in some cases, but in these cases it looks like we're failing after the area of the timeout.

[task 2023-03-06T07:32:00.628Z] 07:32:00     INFO -  raptor-browsertime Info: Skip setting default preferences for Firefox
[task 2023-03-06T07:32:03.274Z] 07:32:03     INFO -  raptor-browsertime Info: Starting a browsertime benchamrk
[task 2023-03-06T07:32:03.275Z] 07:32:03     INFO -  raptor-browsertime Info: Waiting for 1000 ms (post_startup_delay)
[task 2023-03-06T07:32:04.275Z] 07:32:04     INFO -  raptor-browsertime Info: Navigating to about:blank
[task 2023-03-06T07:32:04.276Z] 07:32:04     INFO -  raptor-browsertime Info: Navigating to url about:blank iteration 1
[task 2023-03-06T07:32:09.308Z] 07:32:09     INFO -  raptor-browsertime Info: Cycle 0, waiting for 1000 ms
[task 2023-03-06T07:32:10.310Z] 07:32:10     INFO -  raptor-browsertime Info: Cycle 0, starting the measure
[task 2023-03-06T07:32:10.311Z] 07:32:10     INFO -  raptor-browsertime Info: Testing url http://127.0.0.1:49920/twitch-animation/index.html iteration 1
[task 2023-03-06T07:32:10.359Z] 07:32:10     INFO -  raptor-browsertime Info: Start GeckoProfiler with features: ["js","stackwalk","cpu","screenshots"] and threads: ["GeckoMain","Compositor","Renderer"]
[task 2023-03-06T07:32:19.439Z] 07:32:19     INFO -  raptor-browsertime Info: Collecting Gecko profile from C:\Users\task_1678061290\build\blobber_upload_dir\browsertime-results\twitch-animation\profiling\pages\127_0_0_1\twitch-animation\index.html\data\geckoProfile-1.json to C:\Users\task_1678061290\build\blobber_upload_dir\browsertime-results\twitch-animation\profiling\pages\127_0_0_1\twitch-animation\index.html\data\geckoProfile-1.json
[task 2023-03-06T07:32:21.883Z] 07:32:21     INFO -  raptor-browsertime Info: Stop GeckoProfiler.
[task 2023-03-06T07:32:21.895Z] 07:32:21     INFO -  raptor-browsertime Info: Gzip file the profile.
[task 2023-03-06T07:32:22.626Z] 07:32:22     INFO -  raptor-browsertime Info: Done stopping GeckoProfiler.
[taskcluster:error] Aborting task...
[taskcluster 2023-03-06T07:44:02.260Z] SUCCESS: The process with PID 2676 (child process of PID 7152) has been terminated.
Flags: needinfo?(canaltinova)
Assignee: afinder → nobody

Along with the 10 failures here, there are also 17 failures in bug 1809667 that are related to this issue.

Hey :sparky!

It looks like the last profiler log message is being executed properly. So I don't think this is profiler code related. See the Done stopping GeckoProfiler. log here: https://github.com/sitespeedio/browsertime/blob/a8e8d133998c58947f49711b45717885b6351b14/lib/firefox/geckoProfiler.js#L184
When we arrive to this line, we are done with everything profiler related.

I specifically wanted to add this log at the end, to see if we are successfully exiting the profiler code. Because before that, we were only seeing a previous profiler log message but was never sure where it was failing.

So my theory is that, after this line, we return to this function and then we hang here somewhere: https://github.com/sitespeedio/browsertime/blob/a8e8d133998c58947f49711b45717885b6351b14/lib/firefox/webdriver/firefox.js#L121

I'm not sure where it's hanging here though, as we don't have log messages in that function. I suggest adding more logs here as well to understand what's happening.

Flags: needinfo?(canaltinova)

Ok, thanks :canova!

:kshampur, can you look into this a bit to see if there's a way we could fix this? I think the best approach here would be to set a shorter timeout for the code that runs the profiler run, and ignore the profile if we fail: https://searchfox.org/mozilla-central/source/testing/raptor/raptor/browsertime/base.py#820-827

If we decrease the proc_timeout to 7 minutes at most, that should give enough time to complete a single test run for a profile in a passing case.

Flags: needinfo?(kshampur)

I tried to reproduce this but no luck as everything seems to be passing on multiple retriggers: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&author=kshampur%40mozilla.com&selectedTaskRun=HBsB8V2tQEWFveSTsaPUiw.0

and also I see from the latest orange factor graph it is now zero?
https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2023-03-02&endday=2023-03-09&tree=trunk&failurehash=all&bug=1809652

and correspondingly, since this seeped into Bug 1809667, comment 42,
I see that the orange factor for that bug has also gone down considerably https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2023-03-02&endday=2023-03-09&tree=all&failurehash=all&bug=1809667

I will keep the ni? and keep monitoring

I see the failure rate is still high in CI but I am unable to reproduce on Try
recent Try run with rebuilds that are still largely passing

So I may just write a patch for the time out, and once landed monitor that to see if it helps m-c/autoland

Flags: needinfo?(kshampur)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: