Closed Bug 1609295 Opened 6 years ago Closed 4 years ago

Intermittent FIREFOX raptor-perftest Critical: Test failed to finish. Application timed out after <random> seconds

Categories

(Testing :: Raptor, defect, P3)

defect

Tracking

(Fission Milestone:Future, firefox73 unaffected, firefox74 affected, firefox75 disabled)

RESOLVED INCOMPLETE
Fission Milestone Future
Tracking Status
firefox73 --- unaffected
firefox74 --- affected
firefox75 --- disabled

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, test-disabled, Whiteboard: [stockwell disabled][ci-costs-2020:done][stockwell needswork:owner] fission-perf)

Attachments

(2 files, 3 obsolete files)

Filed by: nbeleuzu [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=284947604&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MJj0e6ARTcms4Hm-Gxcr0Q/runs/0/artifacts/public/logs/live_backing.log


[task 2020-01-15T00:46:07.516Z] 00:46:07 INFO - raptor-mitmproxy Info: Verified mitmproxy CA certificate is installed in Firefox
[task 2020-01-15T00:46:07.517Z] 00:46:07 INFO - raptor-perftest Info: Playback recording date: 2019-06-24
[task 2020-01-15T00:46:07.517Z] 00:46:07 INFO - raptor-main Info: starting raptor test: raptor-tp6-pinterest-firefox-cold
[task 2020-01-15T00:46:07.517Z] 00:46:07 INFO - raptor-main Info: test settings: {'expected_browser_cycles': 25, 'alert_threshold': '2.0', 'playback_pageset_manifest': 'mitm4-linux-firefox-pinterest.manifest', 'browser_cycles': '25', 'here': 'C:\Users\task_1579040311\build\tests\raptor\raptor\tests\tp6\desktop', 'expected': 'pass', 'measure': ['fnbpaint', 'fcp', 'dcf', 'loadtime'], 'path': 'C:\Users\task_1579040311\build\tests\raptor\raptor\tests\tp6\desktop\raptor-tp6-pinterest-firefox-cold', 'cold': True, 'manifest_relpath': 'tests\tp6\desktop\raptor-tp6-28-cold.ini', 'unit': 'ms', 'alert_on': ['fcp', 'loadtime'], 'name': 'raptor-tp6-pinterest-firefox-cold', 'playback_recordings': 'pinterest.mp', 'page_cycles': 1, 'apps': 'firefox', 'test_url': 'https://pinterest.com/', 'manifest': 'C:\Users\task_1579040311\build\tests\raptor\raptor\tests\tp6\desktop\raptor-tp6-28-cold.ini', 'ancestor_manifest': 'raptor.ini', 'page_timeout': '60000', 'playback': 'mitmproxy', 'browser_cycle': 1, 'lower_is_better': True, 'type': 'pageload', 'relpath': 'tests\tp6\desktop\raptor-tp6-pinterest-firefox-cold'}
[task 2020-01-15T00:46:07.518Z] 00:46:07 INFO - raptor-main Info: raptor config: {'subtest_alert_on': ['fcp', 'loadtime'], 'e10s': True, 'app': 'firefox', 'symbols_path': 'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/X3nGgh_NQz2aNlAcjSFKBw/artifacts/public/build/target.crashreporter-symbols.zip', 'power_test': False, 'playback_binary_zip': None, 'run_local': False, 'playback_files': ['C:\Users\task_1579040311\testing\mozproxy\pinterest.mp'], 'gecko_profile_interval': None, 'binary': 'C:\Users\task_1579040311\build\application\firefox\firefox.exe', 'local_profile_dir': 'c:\users\task_1579040311\appdata\local\temp\tmpncgvmv', 'playback_binary_manifest': 'mitmproxy-rel-bin-4.0.4-{platform}.manifest', 'enable_control_server_wait': False, 'no_conditioned_profile': False, 'platform': 'win', 'playback_version': '4.0.4', 'enable_webrender': False, 'gecko_profile': False, 'gecko_profile_entries': None, 'playback_pageset_manifest': 'C:\Users\task_1579040311\build\tests\raptor\raptor\playback\mitm4-linux-firefox-pinterest.manifest', 'playback_pageset_zip': None, 'playback_tool': 'mitmproxy', 'host': '127.0.0.1', 'enable_fission': False, 'memory_test': False, 'extra_prefs': {}, 'obj_path': None, 'device_name': None, 'cpu_test': False, 'is_release_build': False, 'processor': 'x86_64'}
[task 2020-01-15T00:46:07.518Z] 00:46:07 INFO - raptor-gen-test-config Info: writing test settings into background js, so webext can get it
[task 2020-01-15T00:46:07.518Z] 00:46:07 INFO - raptor-gen-test-config Info: finished writing test config to C:\Users\task_1579040311\build\tests\raptor\webext\raptor\auto_gen_test_config.js
[task 2020-01-15T00:46:07.518Z] 00:46:07 INFO - raptor-main Info: installing webext C:\Users\task_1579040311\build\tests\raptor\raptor..\webext\raptor
[task 2020-01-15T00:46:07.523Z] 00:46:07 INFO - raptor-main Info: starting firefox
[task 2020-01-15T00:46:07.523Z] 00:46:07 INFO - Application command: C:\Users\task_1579040311\build\application\firefox\firefox.exe --wait-for-browser -profile c:\users\task_1579040311\appdata\local\temp\tmpncgvmv
[task 2020-01-15T00:49:12.461Z] 00:49:12 CRITICAL - raptor-perftest Critical: Tests failed to finish! Application timed out.
[task 2020-01-15T00:49:12.461Z] 00:49:12 ERROR - raptor-perftest Error: Test failed to finish. Application timed out after 170 seconds
[task 2020-01-15T00:49:12.461Z] 00:49:12 INFO - raptor-mitmproxy Info: Stopping mitmproxy playback, killing process 3964
[task 2020-01-15T00:49:12.500Z] 00:49:12 INFO - raptor-mitmproxy Info: Successfully killed the mitmproxy playback process with exit code 572
[task 2020-01-15T00:49:12.500Z] 00:49:12 INFO - raptor-mitmproxy Info: Turning off the browser proxy
[task 2020-01-15T00:49:12.500Z] 00:49:12 INFO - raptor-mitmproxy Info: writing: C:\Users\task_1579040311\build\application\firefox\distribution\policies.json
[task 2020-01-15T00:49:12.501Z] 00:49:12 INFO - raptor-main Info: removing webext C:\Users\task_1579040311\build\tests\raptor\raptor..\webext\raptor
[task 2020-01-15T00:49:12.503Z] 00:49:12 INFO - perftest-results-handler Info: summarizing raptor test results
[task 2020-01-15T00:49:12.503Z] 00:49:12 INFO - perftest-output Error: no raptor test results found for raptor-tp6-pinterest-firefox-cold
[task 2020-01-15T00:49:12.503Z] 00:49:12 INFO - perftest-output Info: error: no raptor test results found, so no need to combine browser cycles
[task 2020-01-15T00:49:12.503Z] 00:49:12 INFO - perftest-output Error: no summarized raptor results found for raptor-tp6-pinterest-firefox-cold
[task 2020-01-15T00:49:12.523Z] 00:49:12 INFO - perftest-results-handler Critical: PERFHERDER_DATA was seen 0 times, expected 1.
[task 2020-01-15T00:49:12.523Z] 00:49:12 INFO - raptor-control-server Info: shutting down control server
[task 2020-01-15T00:49:12.705Z] 00:49:12 INFO - raptor-main Info: finished
[task 2020-01-15T00:49:12.705Z] 00:49:12 ERROR - raptor-main Critical: TEST-UNEXPECTED-FAIL: no raptor test results were found for raptor-tp6-pinterest-firefox-cold
[task 2020-01-15T00:49:18.466Z] 00:49:18 ERROR - Return code: 1
[task 2020-01-15T00:49:18.466Z] 00:49:18 WARNING - setting return code to 1
[task 2020-01-15T00:49:18.466Z] 00:49:18 INFO - Copying Raptor results to upload dir:
[task 2020-01-15T00:49:18.466Z] 00:49:18 INFO - C:\Users\task_1579040311\build\blobber_upload_dir\perfherder-data.json
[task 2020-01-15T00:49:18.467Z] 00:49:18 INFO - Copying raptor results from C:\Users\task_1579040311\build\raptor.json to C:\Users\task_1579040311\build\blobber_upload_dir\perfherder-data.json
[task 2020-01-15T00:49:18.467Z] 00:49:18 CRITICAL - Error copying results C:\Users\task_1579040311\build\raptor.json to upload dir C:\Users\task_1579040311\build\blobber_upload_dir\perfherder-data.json
[task 2020-01-15T00:49:18.467Z] 00:49:18 INFO - [Errno 2] No such file or directory: u'C:\Users\task_1579040311\build\raptor.json'
[task 2020-01-15T00:49:18.468Z] 00:49:18 INFO - Running post-action listener: _package_coverage_data
[task 2020-01-15T00:49:18.468Z] 00:49:18 INFO - Running post-action listener: _resource_record_post_action
[task 2020-01-15T00:49:18.468Z] 00:49:18 INFO - Running post-action listener: process_java_coverage_data
[task 2020-01-15T00:49:18.468Z] 00:49:18 INFO - Running post-action listener: stop_device
[task 2020-01-15T00:49:18.468Z] 00:49:18 INFO - [mozharness: 2020-01-15 00:49:18.468000Z] Finished run-tests step (success)
[task 2020-01-15T00:49:18.468Z] 00:49:18 INFO - Running post-run listener: _resource_record_post_run
[task 2020-01-15T00:49:18.600Z] 00:49:18 INFO - Total resource usage - Wall time: 275s; CPU: 3.0%; Read bytes: 143614464; Write bytes: 423882752; Read time: 1109; Write time: 830
[task 2020-01-15T00:49:18.600Z] 00:49:18 INFO - TinderboxPrint: CPU usage<br/>3.4%

Priority: P5 → P2

Bebe, this is showing up in our disable-recommended filter, could you, please, take a look?

Flags: needinfo?(fstrugariu)
Flags: needinfo?(fstrugariu)
Priority: P2 → P1

Update: this has 64 total failures in the last week and 192 failures in the last 30 days, most of the on windows10-64-ref-hw-2017: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-01-28&endday=2020-02-04&tree=trunk&bug=1609295

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=287417788&repo=autoland&lineNumber=1532

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

I wonder if that has the same underlying problem as bug 1382162. I assume that Firefox might not fully start, and as such it hangs forever.

Having access to one of those machine would be great. So asking for a loaner might be a good idea to further investigate the problem.

Priority: P1 → P2

Hm worker type gecko-t-win10-64-ref-hw looks like to be bitbar. Not sure if we could request a loaner or similar at all. Joel, do you know more?

Otherwise (if a loaner doesn't work) does anyone have such a machine locally for testing?

Flags: needinfo?(jmaher)

we don't have local machines, the few machines we have are hosted at bitbar as they need hands on more often. Here are the specs:
https://wiki.mozilla.org/TestEngineering/Performance/Platforms

If you want a loaner, do ask :markco and he can help you get remote access.

Flags: needinfo?(jmaher)

I'm taking this bug. I only have to wait for the loaner as requested on bug 1613613.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: P2 → P1

While waiting for the loaner I will prepare a special Firefox build with way more logging during startup. Maybe that will help us to identify where the hang actually happens.

I tried to run a couple of the raptor tests for this platform but due to the limited availability it is very hard to get any kidn of job running. So most of them timed out. The rest were passing so far:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=288097487&revision=ab1a1f92d6d7e6b01f0b219a5ec5053b038e9e2b

Means I will really have to wait until I got remote access to one of those machine. :/

OS: Unspecified → Windows 10
Hardware: Unspecified → x86_64
Summary: Intermittent raptor-perftest Critical: Tests failed to finish! Application timed out. → Intermittent windows10-64-ref-hw-2017 raptor-perftest Critical: Tests failed to finish! Application timed out.

Hi Henrik, these tests fail on both win and android on integration. Should we file a different bug for the android ones?
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-03&endday=2020-02-10&tree=trunk&bug=1609295

Flags: needinfo?(hskupin)

(In reply to Andreea Pavel [:apavel] from comment #35)

Hi Henrik, these tests fail on both win and android on integration. Should we file a different bug for the android ones?
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-03&endday=2020-02-10&tree=trunk&bug=1609295

Oh, interesting. If that also fails on Android for G5 then we should clearly revert the summary change. Maybe that gives me a better chance to claim some workers for testing the try build. Thanks a lot!

Flags: needinfo?(hskupin)
Summary: Intermittent windows10-64-ref-hw-2017 raptor-perftest Critical: Tests failed to finish! Application timed out. → Intermittent raptor-perftest Critical: Tests failed to finish! Application timed out.
OS: Windows 10 → All
Hardware: x86_64 → All

You're welcome.

There are 58 total failures in the last 7 days on android-hw-g5-7-0-arm7-api-16 pgo and windows10-64-ref-hw-2017 opt

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288144747&repo=autoland&lineNumber=1541

[task 2020-02-10T08:33:36.636Z] 08:30:20 INFO - raptor-webext-android Info: starting geckoview
[task 2020-02-10T08:33:36.636Z] 08:30:21 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell am force-stop org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:30:21 INFO - adb Granting important runtime permissions to org.mozilla.geckoview_example
[task 2020-02-10T08:33:36.636Z] 08:30:22 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.WRITE_EXTERNAL_STORAGE, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:30:23 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.READ_EXTERNAL_STORAGE, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:30:24 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.ACCESS_COARSE_LOCATION, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:30:25 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.ACCESS_FINE_LOCATION, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:30:26 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.CAMERA, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:30:27 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.RECORD_AUDIO, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:30:27 INFO - adb launch_application: am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.MAIN -d about:blank
[task 2020-02-10T08:33:36.636Z] 08:30:29 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.MAIN -d about:blank, timeout: None, root: False, timedout: None, exitcode: 0, output: Starting: Intent { act=android.intent.action.MAIN dat=about:blank cmp=org.mozilla.geckoview_example/.GeckoViewActivity }
[task 2020-02-10T08:33:36.636Z] 08:30:29 INFO - Status: ok
[task 2020-02-10T08:33:36.636Z] 08:30:29 INFO - Activity: org.mozilla.geckoview_example/.GeckoViewActivity
[task 2020-02-10T08:33:36.636Z] 08:30:29 INFO - ThisTime: 705
[task 2020-02-10T08:33:36.636Z] 08:30:29 INFO - TotalTime: 705
[task 2020-02-10T08:33:36.636Z] 08:30:29 INFO - WaitTime: 719
[task 2020-02-10T08:33:36.636Z] 08:30:29 INFO - Complete
[task 2020-02-10T08:33:36.636Z] 08:30:29 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell pidof org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output: 9114
[task 2020-02-10T08:33:36.636Z] 08:33:16 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell am force-stop org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:33:16 CRITICAL - raptor-perftest Critical: Tests failed to finish! Application timed out.
[task 2020-02-10T08:33:36.636Z] 08:33:16 ERROR - raptor-perftest Error: Test failed to finish. Application timed out after 170 seconds
[task 2020-02-10T08:33:36.636Z] 08:33:16 INFO - raptor-webext-android Info: removing reverse socket connections
[task 2020-02-10T08:33:36.636Z] 08:33:16 INFO - adb command_output: adb -s ZY32259STL wait-for-device reverse --remove-all, timeout: None, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:33:16 INFO - adb shell_bool: adb -s ZY32259STL wait-for-device shell test -d /sdcard/raptor/profile/minidumps, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:33:16 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:33:16 INFO - adb shell_bool: adb -s ZY32259STL wait-for-device shell test -d /sdcard/raptor/profile/minidumps, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:33:16 INFO - adb command_output: adb -s ZY32259STL wait-for-device pull /sdcard/raptor/profile/minidumps /tmp/tmpB897xm/minidumps, timeout: None, timedout: None, exitcode: 0, output: /sdcard/raptor/profile/minidumps/: 0 files pulled.
[task 2020-02-10T08:33:36.636Z] 08:33:16 INFO - raptor-mitmproxy Info: Stopping mitmproxy playback, killing process 823
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - raptor-mitmproxy Info: Successfully killed the mitmproxy playback process
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - raptor-webext Info: removing webext /builds/task_1581323280/workspace/build/tests/raptor/raptor/webextension/../../webext/raptor
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - perftest-results-handler Info: summarizing raptor test results
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - perftest-output Error: no raptor test results found for raptor-tp6m-bing-geckoview-cold
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - perftest-output Info: error: no raptor test results found, so no need to combine browser cycles
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - perftest-output Error: no summarized raptor results found for raptor-tp6m-bing-geckoview-cold
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - perftest-results-handler Critical: PERFHERDER_DATA was seen 0 times, expected 1.
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - raptor-webext-android Info: removing test folder for raptor: /sdcard/raptor
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell rm -r /sdcard/raptor, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - adb shell_output: adb -s ZY32259STL wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - adb shell_bool: adb -s ZY32259STL wait-for-device shell test -e /sdcard/raptor, timeout: None, root: False, timedout: None, exitcode: 1, output:
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - raptor-control-server Info: shutting down control server
[task 2020-02-10T08:33:36.636Z] 08:33:17 INFO - raptor-webext Info: finished
[task 2020-02-10T08:33:36.636Z] 08:33:17 ERROR - raptor-main Critical: TEST-UNEXPECTED-FAIL: no raptor test results were found for raptor-tp6m-bing-geckoview-cold
[task 2020-02-10T08:33:36.636Z] 08:33:17 ERROR - Return code: 1

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Depends on: 1614331
No longer depends on: 1614331

Henrik this has reached the disable-recommended queue. is there a way to disable the test?

Flags: needinfo?(hskupin)

(In reply to Andreea Pavel [:apavel] from comment #48)

Henrik this has reached the disable-recommended queue. is there a way to disable the test?

I will have to forward this to Dave. It might still take a while until we are able to investigate and fix the problem. Underlying reason is that I do not have hardware and using try is not possible.

Flags: needinfo?(hskupin) → needinfo?(dave.hunt)

(In reply to Andreea Pavel [:apavel] from comment #48)

Henrik this has reached the disable-recommended queue. is there a way to disable the test?

It's not a case of disabling a test. This is more than likely a harness issue, or perhaps a number of issues potentially including the specific hardware. The slower devices appear to be most affected, which makes sense given this is a timeout. My suggestion would be to temporarily disable performance tests on windows10-64-ref-hw-2017. This will free up devices for investigation on try, and will reduce the number of failures we see.

Flags: needinfo?(dave.hunt)

that sounds logical- if there is work CI-Automation can do to help, please needinfo in this bug.

Dave can you take a look?

Flags: needinfo?(dave.hunt)
Flags: needinfo?(dave.hunt) → needinfo?(fstrugariu)

Patch Reviewed

Flags: needinfo?(fstrugariu)

I scheduled the landing, but autoland is currently closed due to hg issues. It will land once the trees are open again.

Keywords: leave-open
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/93761375e661 temporarily disable performance tests on windows10-64-ref-hw-2017 r=perftest-reviewers,Bebe,whimboo
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Whiteboard: [stockwell disabled] → [stockwell disabled][ci-costs-2020:done]

Henrik, this started spiking on autoland starting with the ~27th of February and there are 70 total failures since, on android-hw-g5-7-0-arm7-api-16
Looks like spikes started from this merge: https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=7a5cb26a2d518e9cfaf512ba9a06239b573d7f0e

Tarek, can it be from bug 1602657?

Henrik, should we temporarily update the disabling patch or wait for a fix here?

Flags: needinfo?(tarek)
Flags: needinfo?(hskupin)

Interesting. I would have thought that bug 1614275 could have caused that but that's not in this changeset. Before its landing we had a wrong reporting of various different failures all with no raptor results found.

Now that this is not only on Windows 10 HW ref 2017 devices it might be easier to investigate. Given that this failure should be across all Raptor tests I cannot see that we can disable anything here without affecting Raptor overall. Given that Tarek's patch is the only Raptor change in that merge, maybe you could run a retriggered try build for Moto G5 with his patch backed-out? Thanks

Depends on: 1614275
Flags: needinfo?(hskupin) → needinfo?(apavel)
No longer blocks: 1582337
No longer blocks: 1593239

Andreea, bug 1602657 is known to change the performance results baseline of some tests on windows. We are not entirely sure yet why, but we've decided to use it as the new baseline. We did not have any case of failures or timeout with the patch so I don't think it's related.

Flags: needinfo?(tarek)

Most of the failing jobs are on Android for the G5 device, and there are the following entries visible in the Android logcat:

03-03 00:17:46.491 10403 10418 I GeckoConsole: 1583194666490	addons.xpi-utils	WARN	addMetadata: Add-on raptor@mozilla.org is invalid: Error: Error while loading 'file:///storage/emulated/0/raptor/profile/extensions/raptor@mozilla.org/manifest.json' (NS_ERROR_FILE_ACCESS_DENIED)(resource://gre/modules/Extension.jsm:557:20) JS Stack trace: readJSON/</<@Extension.jsm:557:20
03-03 00:17:46.491 10403 10418 I GeckoConsole: onStopRequest@NetUtil.jsm:128:18
03-03 00:17:46.491 10403 10418 I GeckoConsole: awaitPromise@XPIProvider.jsm:226:15
03-03 00:17:46.491 10403 10418 I GeckoConsole: syncLoadManifest@XPIInstall.jsm:754:22
03-03 00:17:46.491 10403 10418 I GeckoConsole: addMetadata@XPIDatabase.jsm:2782:32
03-03 00:17:46.491 10403 10418 I GeckoConsole: processFileChanges@XPIDatabase.jsm:3221:26
03-03 00:17:46.491 10403 10418 I GeckoConsole: checkForChanges@XPIProvider.jsm:2933:55
03-03 00:17:46.491 10403 10418 I GeckoConsole: startup@XPIProvider.jsm:2393:12
03-03 00:17:46.491 10403 10418 I GeckoConsole: callProvider@AddonManager.jsm:215:31
03-03 00:17:46.491 10403 10418 I GeckoConsole: _startProvider@AddonManager.jsm:587:17
03-03 00:17:46.491 10403 10418 I GeckoConsole: startup@AddonManager.jsm:811:14
03-03 00:17:46.491 10403 10418 I GeckoConsole: startup@AddonManager.jsm:3482:26
03-03 00:17:46.491 10403 10418 I GeckoConsole: observe@addonManager.js:87:29

Which means that something prevents access to the manifest.json file.

For desktop we actually hit a crash of Firefox (see bug 1544522), which is not correctly reported. I filed bug 1619512 to fix the order of log output.

Here a try push for G5 with the extensions.logging.enabled pref set to true:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2255d12b8e7607b78dbd81e014c0ec68207adfa8

Maybe that reveals some more information, which can be used to narrow down this problem.

There are also cases when we send the test settings to the web extension but then nothing happens:

[task 2020-03-03T10:40:26.763Z] 10:38:10     INFO -  raptor-control-server Info: sent test settings to webext runner
[task 2020-03-03T10:40:26.763Z] 10:40:26     INFO -  adb shell_output: adb -s ZY322M5BVC wait-for-device shell am force-stop org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output:

Maybe there is an uncaught exception in the web extension which we do not handle correctly.

When checking the appropriate Android logcat I can see that the last console output from the webextension is:

03-03 10:38:10.791 7365 7380 I GeckoConsole: [raptor-runnerjs] using page timeout: 60000ms

It means that the test settings have been received and processed, but then storing the settings seem to fail, because we never get the log entry for wrote settings to ext local storage. As such this block of code is related:

https://searchfox.org/mozilla-central/rev/91f6c02fcf4c16f78fdc4417f61f192688294066/testing/raptor/webext/raptor/runner.js#155-198

Given that nothing can fail here in the switch block, the failure is with actually saving the test settings in local storage. This explains the following log messages:

03-03 10:38:10.818  7365  7433 E GeckoConsole: [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:612"]
03-03 10:38:10.819  7365  7380 E GeckoConsole: [JavaScript Error: "UnknownError: The operation failed for reasons unrelated to the database itself and not covered by any other error code." {file: "resource://gre/modules/ExtensionStorageIDB.jsm" line: 812}]
03-03 10:38:10.819  7365  7380 E GeckoConsole: normalizeStorageError@resource://gre/modules/ExtensionStorageIDB.jsm:812:10
03-03 10:38:10.819  7365  7380 E GeckoConsole: getAPI/local[method]@chrome://extensions/content/child/ext-storage.js:273:37

This can easily be reproduced by throwing a custom error anywhere within raptorRunner. As such we need an outer try/catch, and report back such an error similar to page load timeouts, which is then appropriately handled by the control server.

Flags: needinfo?(apavel)

If this is specific to certain devices, we can ask bitbar to reimage or replace them. If you need help figuring out which devices have this failure NI me and I'll figure it out and update the bug.

Bob, those failures seem to happen are across all devices. So not specific to just some specific ones. But maybe this catch all failure message is also covering other bugs, which I haven't seen yet. With my upcoming patch we will report specific failure messages.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f35c8018719f0f8fb9c700e673340ade18d5b50

(In reply to Tarek Ziadé (:tarek) from comment #68)

Andreea, bug 1602657 is known to change the performance results baseline of some tests on windows. We are not entirely sure yet why, but we've decided to use it as the new baseline. We did not have any case of failures or timeout with the patch so I don't think it's related.

Thanks for looking.

The try still shows problems for Android, and the reason might be the following...

03-03 22:10:08.289 7853 7868 E GeckoConsole: [JavaScript Error: "SyntaxError: The URI is malformed." {file: "moz-extension://925e6aad-1e42-4cab-a95e-412bdcc1f2fc/runner.js" line: 615}]

That failure is thrown inside of postToControlServer(), which means the web extension failed to send a message to the control server. And this may have been the error message.

Here a try build with some logging in that method:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b7411827fa4e6f115ab19f14534dcfb8fd4c3c50

In some cases my former patch works and we get a specific failure:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b7411827fa4e6f115ab19f14534dcfb8fd4c3c50&selectedJob=291599862

[task 2020-03-04T09:34:16.315Z] 09:33:50 CRITICAL - raptor-perftest Critical: Failed to run raptor-tp6m-amazon-geckoview-cold: An unexpected error occurred

Sadly the Android logcat doesn't contain any details, so I assume this specific log output hasn't reached the console on the Android device before the application died.

In another case we still have the generic message:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=291599825&repo=try

[task 2020-03-04T09:34:36.728Z] 09:34:36 CRITICAL - raptor-perftest Critical: Test failed to finish. Application timed out after 141 seconds

And that is indeed a problem with the URL of the control server:

03-04 09:31:52.984 7285 7300 I Gecko : console.info: "[raptor-runnerjs] About to send message to control server: http://null:null/"

As it looks like both the host and port haven't been set correctly. And indeed the test settings are sometimes wrong. Here what I can see in my local file:

    return {"browser": "firefox",
            "cs_port": "53901",
            "test_name": "raptor-firefox-tp6",
            "test_settings_url": "http://None:53901/json/raptor-firefox-tp6.json",
            "post_startup_delay": "0",
            "benchmark_port": "0",
            "host": "None",
            "debug_mode": "0",
            "browser_cycle": "1"};
}

There is no host set, but a port. So something is wrong in the generation of the auto_gen_test_config.js file.

Here another try build with a hardened method for writing out the test config for the web extension. It doesn't yet contain the fix for supplying a valid host and port for the control server, but makes sure that an exception is raised under such a condition. That means we actually won't start the test at all, and as result won't run into an application timeout.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ef98b77d0f930616bbd87a2c713352875006931b

There is actually a problem on Android to load the auto_gen_test_config.js file:

03-04 15:37:55.999 10646 10661 W GeckoConsole: [JavaScript Warning: "Loading failed for the <script> with source “moz-extension://d8d6ff4f-f0c0-471d-91e4-f79828a68ca8/auto_gen_test_config.js”." {file: "moz-extension://d8d6ff4f-f0c0-471d-91e4-f79828a68ca8/_generated_background_page.html" line: 5}]
03-04 15:37:56.008 10646 10661 I Gecko   : console.info: "[raptor-runnerjs] About to send message to control server: http://null:null/"

Maybe we sometimes miss to transfer it to the device when installing the extension. That's something to check next.

The method should never allow invalid settings for the
control server host and port to be written to the test
config as read by the webextension. It would mean that
it will never be able to send messages to the control
server, which will cause Raptor to kill the test with
an application timeout.

:bebe as Henrik will be out for a few days, can someone from the team pick this up?

Version: Version 3 → unspecified

There are 339 total failures in the last 7 days on

  • android-hw-g5-7-0-arm7-api-16 pgo
  • android-hw-p2-8-0-android-aarch64 pgo
  • linux64-shippable opt
  • macosx1014-64-shippable opt
  • windows10-64-ref-hw-2017 opt

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=291944105&repo=autoland&lineNumber=1548

[task 2020-03-06T05:44:19.958Z] 05:41:52 INFO - raptor-webext-android Info: starting geckoview
[task 2020-03-06T05:44:19.958Z] 05:41:53 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell am force-stop org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:41:53 INFO - adb Granting important runtime permissions to org.mozilla.geckoview_example
[task 2020-03-06T05:44:19.958Z] 05:41:54 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.WRITE_EXTERNAL_STORAGE, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:41:55 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.READ_EXTERNAL_STORAGE, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:41:56 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.ACCESS_COARSE_LOCATION, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:41:57 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.ACCESS_FINE_LOCATION, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:41:58 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.CAMERA, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:41:59 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.RECORD_AUDIO, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:41:59 INFO - adb launch_application: am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.MAIN -d about:blank
[task 2020-03-06T05:44:19.958Z] 05:42:01 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.MAIN -d about:blank, timeout: None, root: False, timedout: None, exitcode: 0, output: Starting: Intent { act=android.intent.action.MAIN dat=about:blank cmp=org.mozilla.geckoview_example/.GeckoViewActivity }
[task 2020-03-06T05:44:19.958Z] 05:42:01 INFO - Status: ok
[task 2020-03-06T05:44:19.958Z] 05:42:01 INFO - Activity: org.mozilla.geckoview_example/.GeckoViewActivity
[task 2020-03-06T05:44:19.958Z] 05:42:01 INFO - ThisTime: 649
[task 2020-03-06T05:44:19.958Z] 05:42:01 INFO - TotalTime: 649
[task 2020-03-06T05:44:19.958Z] 05:42:01 INFO - WaitTime: 663
[task 2020-03-06T05:44:19.958Z] 05:42:01 INFO - Complete
[task 2020-03-06T05:44:19.958Z] 05:42:01 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell pidof org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output: 7408
[task 2020-03-06T05:44:19.958Z] 05:44:19 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell am force-stop org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:44:19 CRITICAL - raptor-perftest Critical: Tests failed to finish! Application timed out.
[task 2020-03-06T05:44:19.958Z] 05:44:19 ERROR - raptor-perftest Error: Test failed to finish. Application timed out after 141 seconds
[task 2020-03-06T05:44:19.958Z] 05:44:19 INFO - raptor-webext-android Info: removing reverse socket connections
[task 2020-03-06T05:44:19.958Z] 05:44:19 INFO - adb command_output: adb -s ZY3223LVXH wait-for-device reverse --remove-all, timeout: None, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:19.958Z] 05:44:19 INFO - adb shell_bool: adb -s ZY3223LVXH wait-for-device shell test -d /sdcard/raptor/profile/minidumps, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:45.551Z] 05:44:19 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - adb shell_bool: adb -s ZY3223LVXH wait-for-device shell test -d /sdcard/raptor/profile/minidumps, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - adb command_output: adb -s ZY3223LVXH wait-for-device pull /sdcard/raptor/profile/minidumps /tmp/tmpgNUT81/minidumps, timeout: None, timedout: None, exitcode: 0, output: /sdcard/raptor/profile/minidumps/: 0 files pulled, 0 skipped.
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - raptor-mitmproxy Info: Stopping mitmproxy playback, killing process 826
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - raptor-mitmproxy Info: Successfully killed the mitmproxy playback process
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - raptor-webext Info: removing webext /builds/task_1583473174/workspace/build/tests/raptor/raptor/webextension/../../webext/raptor
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - raptor-webext-android Info: removing test folder for raptor: /sdcard/raptor
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell rm -r /sdcard/raptor, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - adb shell_output: adb -s ZY3223LVXH wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - adb shell_bool: adb -s ZY3223LVXH wait-for-device shell test -e /sdcard/raptor, timeout: None, root: False, timedout: None, exitcode: 1, output:
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - raptor-control-server Info: shutting down control server
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - raptor-webext Info: finished
[task 2020-03-06T05:44:45.551Z] 05:44:20 ERROR - Return code: 1
[task 2020-03-06T05:44:45.551Z] 05:44:20 WARNING - setting return code to 1
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - Killing logcat pid 600.
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - Copying Raptor results to upload dir:
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - /builds/task_1583473174/workspace/build/blobber_upload_dir/perfherder-data.json
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - Copying raptor results from /builds/task_1583473174/workspace/build/raptor.json to /builds/task_1583473174/workspace/build/blobber_upload_dir/perfherder-data.json
[task 2020-03-06T05:44:45.551Z] 05:44:20 CRITICAL - Error copying results /builds/task_1583473174/workspace/build/raptor.json to upload dir /builds/task_1583473174/workspace/build/blobber_upload_dir/perfherder-data.json
[task 2020-03-06T05:44:45.551Z] 05:44:20 INFO - [Errno 2] No such file or directory: u'/builds/task_1583473174/workspace/build/raptor.json'

Flags: needinfo?(fstrugariu)

for the android tests we can find in logcat:

03-05 19:44:46.908 11451 11466 I Gecko   : console.info: "[raptor-runnerjs] starting raptorRunner"
03-05 19:44:46.911 11451 11466 I GeckoConsole: [raptor-runnerjs] starting raptorRunner
03-05 19:44:46.918 11451 11466 I Gecko   : console.info: "[raptor-runnerjs] test name is: raptor-tp6m-espn-geckoview-cold"
03-05 19:44:46.918 11451 11466 I GeckoConsole: [raptor-runnerjs] test name is: raptor-tp6m-espn-geckoview-cold
03-05 19:44:46.919 11451 11466 I Gecko   : console.info: "[raptor-runnerjs] test settings url is: http://127.0.0.1:54485/json/raptor-tp6m-espn-geckoview-cold.json"
03-05 19:44:46.919 11451 11466 I GeckoConsole: [raptor-runnerjs] test settings url is: http://127.0.0.1:54485/json/raptor-tp6m-espn-geckoview-cold.json
03-05 19:44:46.928 11451 11466 E GeckoConsole: [JavaScript Error: "SyntaxError: The URI is malformed." {file: "moz-extension://dc1542ce-64bf-4c49-8295-f68097c4d9dc/runner.js" line: 615}]
03-05 19:44:46.929 11491 11506 W ResourceType: Too many attribute references, stopped at: 0x01010099
03-05 19:44:46.936 11451 11451 D GeckoNetworkManager: Incoming event enableNotifications for state OnWithListeners -> null
03-05 19:44:46.936 11451 11451 W GeckoNetworkManager: Invalid event enableNotifications for state OnWithListeners
03-05 19:44:46.970 11491 11506 D GeckoViewContent[C]: onInit

It look like the raptor web extension is returning a js error:

03-05 19:44:46.928 11451 11466 E GeckoConsole: [JavaScript Error: "SyntaxError: The URI is malformed." {file: "moz-extension://dc1542ce-64bf-4c49-8295-f68097c4d9dc/runner.js" line: 615}]

https://searchfox.org/mozilla-central/source/testing/raptor/webext/raptor/runner.js#615

Not sure if this is fixed in Henrik's patch... Can someone with more in depth js knowledge take a look pls

Flags: needinfo?(hskupin)
Flags: needinfo?(fstrugariu)
Flags: needinfo?(dave.hunt)

I consistently see that malformed URI error in both passing and failing tests so I think it's a red herring. That said, we do have an issue where the host is "None" from time to time. I noticed that the mozharness --host setting doesn't have a default setting so that might be the root cause of it but I haven't dug into that too much; :whimboo's patch will help us debug this further.

It seems like the error GeckoConsole: [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:613"] is the real one here. I haven't been able to find out why it happens, but I managed to find that it's caused by clearing the extensions local storage so it's not related to what we are setting:

03-06 18:14:31.753  8014  8029 I Gecko   : console.info: "[raptor-runnerjs] attempting to write to local storage"
03-06 18:14:31.754  8014  8029 I GeckoConsole: [raptor-runnerjs] attempting to write to local storage
03-06 18:14:31.754  8014  8029 I Gecko   : console.info: "[raptor-runnerjs] [object Object]"
03-06 18:14:31.755  8014  8029 I GeckoConsole: [raptor-runnerjs] [object Object]
03-06 18:14:31.788  8014  8080 E GeckoConsole: [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:613"]
03-06 18:14:31.789  8014  8029 E GeckoConsole: [JavaScript Error: "UnknownError: The operation failed for reasons unrelated to the database itself and not covered by any other error code." {file: "resource://gre/modules/ExtensionStorageIDB.jsm" line: 812}]

This was found using this debugging patch - had we cleared the extension data we would have seen a "cleared storage" log message: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bcc0fce02e7fdda70b032ed30903018e318bc76f&selectedJob=292022290

I've read that some people have seen this UnknownErr error when the disk drives are faulty but that's not happening here since all of our devices have this error and it seems to be spread quite uniformly across all of them. The failure does occur much more often on G5 than P2 so it's possible that something specific to the devices is making the error more prominent.

I don't see a reason why we might be clearing the extension storage since the set() function overwrites existing values with the new ones. Maybe this clearing is needed to get rid of settings which are not in the new settings. In either case, I'm testing to see what happens if we remove this clear here - it works locally without any problems: https://treeherder.mozilla.org/#/jobs?repo=try&revision=089bd80b6ca40d2c8434c1a9d94a4b539993ee21

Hi Greg, your last try push has 2 failures in 390 retriggers and 2 lint failures. Based on that, would the changes in your try push reduce the failures on integration?

There are 358 total failures in the last 7 days, the large majority on android-hw-g5-7-0-arm7-api-16 pgo, but also some failures on linux, mac and windows.

Whiteboard: [stockwell disabled][ci-costs-2020:done] → [stockwell disabled][ci-costs-2020:done][stockwell needswork:owner]
Depends on: 1620827
See Also: → 1620828

Hi :apavel, I'm a bit surprised that the change worked as well as it did. The intermittent failure rate seems to have dropped from 4% to 0.5% for Moto G5 at least. I've filed bug 1620827 and provided a patch with this fix. I haven't seen anything broken in all the tests I've done so I'm relatively confident that the clear isn't needed. I also can't find any historical information about why it's there, so I think it was added as a "just in case" scenario.

I've also filed bug 1620828 since we shouldn't be failing with an unknown error when we try to clear the local storage in the web extension.

Thanks for working on this Greg!

Flags: needinfo?(dave.hunt)

Hi :malexandru, I don't think it could be caused by those changes.

For a check, I ran those tests on my try push that used this patch and there were no failures: https://treeherder.mozilla.org/#/jobs?repo=try&revision=db6df2226e158bb9ed1d4944efc8baba8f257c2c

Flags: needinfo?(gmierz2)

For the tp6 issues caused by Bug 1614565
I open and submitted a fix in :
Bug 1621233 - [mozproxy] Fix mitm version detection

Flags: needinfo?(fstrugariu)
Flags: needinfo?(hskupin)

(In reply to Greg Mierzwinski [:sparky] from comment #95)

I consistently see that malformed URI error in both passing and failing tests so I think it's a red herring.

It's happening for all the tests because we are trying to post the attaching of the event listener to the control server before the test config has been loaded! I will fix that be refactoring the initialization routine of the extension.

With my latest try build everything is green on try and I don't know why. On mozilla-central we still have a lot of failures. There are no specific changes implemented, which could have caused that.

Here another try build with the latest changes:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bf4585d950adb2dbe951ed5a91ece99e26f2b270

Comment on attachment 9130987 [details]
Bug 1609295 - [raptor] Hardening gen_test_config.

Revision D65450 was moved to bug 1622213. Setting attachment 9130987 [details] to obsolete.

Attachment #9130987 - Attachment is obsolete: true

Comment on attachment 9130988 [details]
Bug 1609295 - [raptor] Report explicit Runtime Error instead of generic 'Application timed out'.

Revision D65451 was moved to bug 1622213. Setting attachment 9130988 [details] to obsolete.

Attachment #9130988 - Attachment is obsolete: true

With my patch from the other bug landed the summary needs to be updated.

Summary: Intermittent raptor-perftest Critical: Tests failed to finish! Application timed out. → Intermittent raptor-perftest Critical: Test failed to finish. Application timed out after 141 seconds

The remaining problem here seem to be bug 1622610 now.

As of Bug 1623360 - Re-enable raptor tests for the 2017 reference hardware can we backout patch https://hg.mozilla.org/mozilla-central/rev/93761375e661

Flags: needinfo?(apavel)

I would suggest that we do a try build first.

See Also: → 1624077

I receive hunks failed on taskcluster/ci/test/raptor.yml.rej when I try to backout this revision. (In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #125)

I would suggest that we do a try build first.

Flags: needinfo?(apavel) → needinfo?(fstrugariu)

Since I couldn’t do the backout becuase of the hunk failures, I made the changes from the backout and pushed to try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=99303d1bd68560aae117d7e23b9419d3a1bc8ea6

(In reply to Razvan Maries from comment #128)

Since I couldn’t do the backout becuase of the hunk failures, I made the changes from the backout and pushed to try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=99303d1bd68560aae117d7e23b9419d3a1bc8ea6

It looks great for Rap jobs, but seems to still has a higher failure rate for Fission. I would suggest we only re-enable for non-Fission yet, and do the Fission work later. Bebe, can you take care of that in your newly filed bug? Please also mark it blocking this one so that we have a reference. Thanks.

Looks like we use different timeouts for products, platforms and build types. To bundle all classifications on this bug lets get rid of the 141s timeout here.

Summary: Intermittent raptor-perftest Critical: Test failed to finish. Application timed out after 141 seconds → Intermittent raptor-perftest Critical: Test failed to finish. Application timed out after <random> seconds
Depends on: 1624077
See Also: 1624077
Depends on: 1623360

Thanks @Razvan for the help ended up to create a patch in Bug 1623360

Flags: needinfo?(fstrugariu)

bebe, sparky, dave: We have seen a spike in these failures for the moto g5 devices. It slowly began on Friday and spiked on Saturday. If it continues, I'm sure it will be the top failure today. You have all been involved in raptor changes during the period. Could you help figure what the root cause of this is?

Bug 1600193 - Reduce page load timeouts for some raptor benchmark tests seems like a candidate but the other recent changes are possibilities also.

Flags: needinfo?(gmierz2)
Flags: needinfo?(fstrugariu)
Flags: needinfo?(dave.hunt)

:bc, thanks for bringing this up. I'm not sure how bug 1600193 could have caused this since it only modifies the timeout for benchmarks and I don't see any benchmark timeouts in the test failure list: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-03-23&endday=2020-03-30&tree=trunk&bug=1609295

:whimboo thinks he might have found the issue to many of these timeouts here: https://bugzilla.mozilla.org/show_bug.cgi?id=1625892

I've ni?'ed whimboo about this uptick as well.

Flags: needinfo?(gmierz2) → needinfo?(hskupin)

Defer to :whimboo and others. Thanks.

Flags: needinfo?(dave.hunt)

From the logcat the raptor extension on cycle 20 failed to start

03-30 11:00:11.384 15774 15774 D GeckoNetworkManager: New network state: UP, WIFI, WIFI
03-30 11:00:11.657 15774 15790 I GeckoConsole: 1585566011656	addons.xpi-utils	WARN	addMetadata: Add-on raptor@mozilla.org is invalid: Error: Error while loading 'file:///storage/emulated/0/raptor/profile/extensions/raptor@mozilla.org/manifest.json' (NS_ERROR_FILE_ACCESS_DENIED)(resource://gre/modules/Extension.jsm:559:20) JS Stack trace: readJSON/</<@Extension.jsm:559:20
03-30 11:00:11.657 15774 15790 I GeckoConsole: onStopRequest@NetUtil.jsm:128:18
03-30 11:00:11.657 15774 15790 I GeckoConsole: awaitPromise@XPIProvider.jsm:226:15
03-30 11:00:11.657 15774 15790 I GeckoConsole: syncLoadManifest@XPIInstall.jsm:754:22
03-30 11:00:11.657 15774 15790 I GeckoConsole: addMetadata@XPIDatabase.jsm:2783:32
03-30 11:00:11.657 15774 15790 I GeckoConsole: processFileChanges@XPIDatabase.jsm:3219:26
03-30 11:00:11.657 15774 15790 I GeckoConsole: checkForChanges@XPIProvider.jsm:2955:55
03-30 11:00:11.657 15774 15790 I GeckoConsole: startup@XPIProvider.jsm:2415:12
03-30 11:00:11.657 15774 15790 I GeckoConsole: callProvider@AddonManager.jsm:215:31
03-30 11:00:11.657 15774 15790 I GeckoConsole: _startProvider@AddonManager.jsm:587:17
03-30 11:00:11.657 15774 15790 I GeckoConsole: startup@AddonManager.jsm:811:14
03-30 11:00:11.657 15774 15790 I GeckoConsole: startup@AddonManager.jsm:3482:26
03-30 11:00:11.657 15774 15790 I GeckoConsole: observe@addonManager.js:87:29
03-30 11:00:11.658 15774 15790 I Gecko   : 1585566011656	addons.xpi-utils	WARN	addMetadata: Add-on raptor@mozilla.org is invalid: Error: Error while loading 'file:///storage/emulated/0/raptor/profile/extensions/raptor@mozilla.org/manifest.json' (NS_ERROR_FILE_ACCESS_DENIED)(resource://gre/modules/Extension.jsm:559:20) JS Stack trace: readJSON/</<@Extension.jsm:559:20
03-30 11:00:11.658 15774 15790 I Gecko   : onStopRequest@NetUtil.jsm:128:18
Flags: needinfo?(fstrugariu)

We should wait until bug 1625892 has been landed. If there is a hick-up when sending/receiving messages between the different background scripts of the Raptor webextension, it might indeed cause a deadlock. Especially when that happens before we set the page timeout alarm. I fixed a couple of missing awaits over on bug 1625892.

Depends on: 1625892
Flags: needinfo?(hskupin)

I actually triggered a perma failure with the patch on the other bug, which always leads to this failure:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6548bc824524d4364f907eaf8072b9c912e2e95b&selectedJob=295444530

Also I can reproduce it locally with my Moto G7 device running Android 10. GeckoView Example starts for some seconds, and I can see a white screen. Then the application closes itself. This can even be reproduced when starting it manually. Only when I remove the profile location override in org.mozilla.geckoview_example-geckoview-config.yaml which points to /sdcard/raptor/profile/ the app starts successfully.

The adb locat shows the following failure locally:

03-30 22:56:29.946  2210  2350 I ActivityManager: Start proc 5790:org.mozilla.geckoview_example:tab0/u0a270 for service {org.mozilla.geckoview_example/org.mozilla.gecko.process.GeckoChildProcessServices$tab0}
03-30 22:56:30.010  2210  3866 V WindowManager: Changing focus from Window{8a3a85a u0 org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity} to null displayId=0
03-30 22:56:30.608  2210  4818 I WindowManager: Window drawn AppWindowToken{b12c7f0 token=Token{4d9d833 ActivityRecord{6bbc3a2 u0 org.mozilla.geckoview_example/.GeckoViewActivity t2515}}}
03-30 22:56:30.660  2210  2983 W InputDispatcher: channel '8a3a85a org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity (server)' ~ Consumer closed input channel or an error occurred.  events=0x9
03-30 22:56:30.660  2210  2983 E InputDispatcher: channel '8a3a85a org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity (server)' ~ Channel is unrecoverably broken and will be disposed!
03-30 22:56:30.665  2210 12573 I ActivityManager: Process org.mozilla.geckoview_example (pid 5687) has died: cch CRE
03-30 22:56:30.667  2210  7101 I WindowManager: WIN DEATH: Window{8a3a85a u0 org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity}
03-30 22:56:30.667  2210  7101 W InputDispatcher: Attempted to unregister already unregistered input channel '8a3a85a org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity (server)'
03-30 22:56:39.948  2210  2349 W ActivityManager: Process ProcessRecord{248445b 5790:org.mozilla.geckoview_example:tab0/u0a270} failed to attach

James or Nick, any idea what this could be or how to further debug it?

Flags: needinfo?(snorp)
Flags: needinfo?(nalexander)

In the failures above, they all appear to be timeouts. The test harness eventually kills it.

03-30 21:39:56.165  7686  7686 D AndroidRuntime: Calling main entry com.android.commands.am.Am
03-30 21:39:56.169  1464  2418 I ActivityManager: Force stopping org.mozilla.geckoview_example appid=10118 user=0: from pid 7686
03-30 21:39:56.169  1464  2418 I ActivityManager: Killing 7451:org.mozilla.geckoview_example/u0a118 (adj 0): stop org.mozilla.geckoview_example
Flags: needinfo?(snorp)

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #149)

In the failures above, they all appear to be timeouts. The test harness eventually kills it.

No, this is not when running GeckoView_Example via Raptor, but starting manually on the device. It's running for some seconds, and then closes automatically with the above entries in the adb logcat. I can only get around that when removing the custom profile location as pushed via the YAML file, which points to /sdcard/raptor/profile/.

Flags: needinfo?(snorp)

Ah, I think it's well-known that you can't use the raptor profile when launching the app "normally". Lots of folks have had this problem if there is a rogue yaml config left by raptor. In the logcats from automation you can see it does launch and load pages, but maybe has a problem reporting the results?

03-30 19:38:46.600  7928  7944 I GeckoConsole: [raptor-runnerjs] have 1 results for loadtime, as expected
03-30 19:38:46.630  7928  7944 E GeckoConsole: [JavaScript Error: "Error: Invalid tab ID 0" {file: "undefined" line: 0}]
Flags: needinfo?(snorp)

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #152)

Ah, I think it's well-known that you can't use the raptor profile when launching the app "normally". Lots of folks have had this problem if there is a rogue yaml config left by raptor. In the logcats from automation you can see it does launch and load pages, but maybe has a problem reporting the results?

I don't think the problem is the profile. Even when I completely empty /sdcard/raptor/profile geckoview_example doesn't start up. The yaml file is located at /data/local/tmp/org.mozilla.geckoview_example-geckoview-config.yaml. Interestingly I do not have access to that file when I try to check its content via adb shell cat %filename%.

Could that be the reason why it fails to start because files in that folder cannot be read? But I wonder why Raptor was able to push that file before to that location:

19:23:52     INFO -  adb command_output: adb -s ZY226X76LJ wait-for-device push /var/folders/c2/q_47wqh13n902fj9_zxczclr0000gn/T/tmpc6PSTw/org.mozilla.geckoview_example-geckoview-config.yaml /data/local/tmp/org.mozilla.geckoview_example-geckoview-config.yaml, timeout: None, timedout: None, exitcode: 0, output: /var/folders/c2/q_47wqh13n902fj9_zxczclr0000gn/T/tmpc6PSTw/org.mozilla.geckoview_example-geckoview-config.yaml: 1 file pushed, 0 skipped. 0.0 MB/s (127 bytes in 0.006s)

Also AFAICR I had access to files in that folder yesterday, and when I removed the --profile option, geckoview_example also did startup because it was using its own default profile location.

Flags: needinfo?(snorp)

Note that my patches from bug 1627253 seem to help a lot! We should be able to reduce the amount of intermittent failures a lot with this signature soon.

Depends on: 1627875

Please note that we have a massive drop of those failure now since we split off failures for loading the Raptor web extension (bug 1627867). As result only a hand-full of failures remain here! That's great to see.

As such I won't dig any further for now and we can de-prioritize the bug to P5.

Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(nalexander)
Priority: P1 → P3
See Also: → 1627867
Priority: P3 → P5

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #153)

Could that be the reason why it fails to start because files in that folder cannot be read? But I wonder why Raptor was able to push that file before to that location:

There could be something to this. We target SDK 29 now, which changes how external storage works for devices running Q and higher.

Flags: needinfo?(snorp)
No longer blocks: 1612795
No longer depends on: 1624077
No longer depends on: 1622610

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #160)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #153)

Could that be the reason why it fails to start because files in that folder cannot be read? But I wonder why Raptor was able to push that file before to that location:

There could be something to this. We target SDK 29 now, which changes how external storage works for devices running Q and higher.

Are you referring to scoped storage (https://developer.android.com/training/data-storage#scoped-storage)? Does that mean we should better push the GV profile not to /sdcard/raptor/profile but the app-specific folder on the external storage? I assume otherwise we might have to set some specific permissions?

Flags: needinfo?(snorp)

Yup, scoped storage. Yes, I would try to use the app-specific external storage directory on those devices.

Flags: needinfo?(snorp)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #153)

19:23:52     INFO -  adb command_output: adb -s ZY226X76LJ wait-for-device push /var/folders/c2/q_47wqh13n902fj9_zxczclr0000gn/T/tmpc6PSTw/org.mozilla.geckoview_example-geckoview-config.yaml /data/local/tmp/org.mozilla.geckoview_example-geckoview-config.yaml, timeout: None, timedout: None, exitcode: 0, output: /var/folders/c2/q_47wqh13n902fj9_zxczclr0000gn/T/tmpc6PSTw/org.mozilla.geckoview_example-geckoview-config.yaml: 1 file pushed, 0 skipped. 0.0 MB/s (127 bytes in 0.006s)

Permissions are real on /data/local. Typically /data/local/tmp/ is set up as world readable/writable either by us or by default... I forget which.. but that means an adb push to /data/local/tmp will succeed but the file will be owned by shell and with recent android updates it may not be readable/writable by other users. If you need someone else to read it, it should have its permissions changed to be world readable/writable.

(In reply to Noemi Erli[:noemi_erli] from comment #165)

This seems to be permafailing on Android 7.0 MotoG5 and backfill goes way back till:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception&searchStr=android%2C7.0%2Cmotog5%2Craptor&fromchange=c224327f2bddd5f1a00b2c8edc6b098188884202&selectedJob=298218678

If that is the case please file that as a separate bug for the f68 (fennec) jobs. It's not related to this bug. Mark it as blocking this bug. Thanks.

Florin, could this be triggered by the changes in https://hg.mozilla.org/mozilla-central/rev/f805ca6a339a82357e97e1c79a0a00097bbd187b?

Very unlikely.

Flags: needinfo?(fstrugariu) → needinfo?(nerli)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #167)

(In reply to Noemi Erli[:noemi_erli] from comment #165)

This seems to be permafailing on Android 7.0 MotoG5 and backfill goes way back till:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception&searchStr=android%2C7.0%2Cmotog5%2Craptor&fromchange=c224327f2bddd5f1a00b2c8edc6b098188884202&selectedJob=298218678

If that is the case please file that as a separate bug for the f68 (fennec) jobs. It's not related to this bug. Mark it as blocking this bug. Thanks.

This is bug 1631344 now. Can we please fix the classifications for those fennec 68 jobs? Thanks.

Flags: needinfo?(nerli) → needinfo?(sheriffs)

Sure, will do.

Flags: needinfo?(sheriffs)
Depends on: 1632118

With my patch on bug 1630037 this bug might turn into Process has been unexpectedly closed (like bug 1632194). If that happens I would suggest that we do not morph the summary of this bug, but start a new one, and close this bug as fixed.

Ok, so we still have conditions when this happens. See:

[task 2020-04-26T23:45:43.360Z] 23:45:43 INFO - raptor-control-server Info: received webext_status: ended page cycle 1
[task 2020-04-27T00:03:43.575Z] 00:03:43 CRITICAL - raptor-perftest Critical: Test failed to finish. Application timed out after 1071 seconds

I have also seen those stalled test runs while the browser is still alive with Fennec when working on bug 1631344.

That means we have to keep this bug open.

Depends on: 1631344

We just got an interesting case for Windows 10 x64 2017 Ref HW opt:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=300179917&repo=mozilla-central&lineNumber=1536

[task 2020-04-30T10:37:36.855Z] 10:37:36     INFO -  PID 8348 | console.info: "[raptor-runnerjs] pageload test finished"
[task 2020-04-30T10:37:36.857Z] 10:37:36     INFO -  raptor-control-server Info: received request to shutdown the browser
[task 2020-04-30T10:37:36.857Z] 10:37:36     INFO -  raptor-control-server Info: shutting down browser (pid: 8348)
[task 2020-04-30T10:37:41.011Z] 10:37:41 CRITICAL -  raptor-perftest Critical: Test failed to finish. Application timed out after 141 seconds

The control server mostly tried to kill the application but something prevented it from happening?

But there are also several questionable pauses:

[task 2020-04-30T10:35:42.619Z] 10:35:42     INFO -  PID 8348 | console.info: "[raptor-runnerjs] using page timeout: 60000ms"
[task 2020-04-30T10:36:27.793Z] 10:36:27     INFO -  PID 8348 | console.info: "[raptor-runnerjs] wrote settings to ext local storage"

Does it really take that long to write the settings to the local storage?

[task 2020-04-30T10:36:29.863Z] 10:36:29     INFO -  raptor-control-server Info: received webext_status: ended page cycle 1
[task 2020-04-30T10:37:27.014Z] 10:37:27     INFO -  PID 8348 | console.info: "[raptor-pageloadjs] pageloadjs raptorContentHandler!"

Why this 1 minute pause between the ended page cycle and the next one?

(In reply to Bogdan Tara[:bogdan_tara | bogdant] from comment #176)

Could this recent spike be caused by bug 1627175?

No idea, but looks like the spike is gone. Only a single failure over the weekend. We can keep an eye on it over the next days.

Summary: Intermittent raptor-perftest Critical: Test failed to finish. Application timed out after <random> seconds → Intermittent FIREFOX raptor-perftest Critical: Test failed to finish. Application timed out after <random> seconds

I'm going to look into the windows ref 2017 failures (I have one machine available locally).

:aerickson, the failures on the windows10-64-ref-hw-2017 hardware seem to have been slowly increasing over the last few weeks without any initial spikes - I'm wondering if you have any ideas/thoughts about why that is? I'm also curious to know if there may be a missing update on them, or that they've been updated recently.

Flags: needinfo?(aerickson)

:mcornmesser handles the windows bitbar stuff. I don't see any recent changes to them (per https://changelog.dev.mozaws.net/).

Flags: needinfo?(aerickson) → needinfo?(mcornmesser)

(In reply to Greg Mierzwinski [:sparky] from comment #184)

:aerickson, the failures on the windows10-64-ref-hw-2017 hardware seem to have been slowly increasing over the last few weeks without any initial spikes - I'm wondering if you have any ideas/thoughts about why that is? I'm also curious to know if there may be a missing update on them, or that they've been updated recently.

The performance on this degrade over time. Sound like these have hit that tipping point. I will contact Bitbar and ask them to do a fresh install for each of these.

Flags: needinfo?(mcornmesser)

Bitbar will do a test deploy tomorrow, and if that goes well they will redeploy the rest. I am hoping to institute self recovery function in the future so that we can avoid this from happening.

That's great - thanks a lot :markco!

First deploy went well. Bitbar will go through and redeploy the rest doing 2 at a time.

Looks like the refresh has helped here - thanks again :markco!

This is still failing, there are 41 total failures in the last 7 days on linux64-shippable-qr opt and windows10-64-shippable-qr opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307501342&repo=mozilla-central&lineNumber=804

Flags: needinfo?(gmierz2)

These mostly appear to be fission. I was looking at these while testing a patch for android testing. I noticed

raptor-fission.fennec.opt.log:06-28 14:46:18.837  4654  4692 E Web Content: [JavaScript Error: "TypeError: can't access property "querySelector", contentDocument is null" {file: "http://127.0.0.1:47379/Speedometer/resources/benchmark-runner.js" line: 51>
raptor-fission.fennec.opt.log:06-28 14:53:18.302  4582  4631 E GeckoConsole: [JavaScript Error: "ERROR: [raptor-runnerjs] raptor-page-timeout on http://127.0.0.1:47379/Speedometer/index.html?raptor"]

I'm running with a bunch of debugging output enabled and I'm not sure how well this translates to the desktop version but I think "TypeError: can't access property "querySelector", contentDocument is null" is something to look into. I don't see this error in my non-fission raptor-speedometer runs.

(In reply to Bob Clary [:bc] from comment #194)

These mostly appear to be fission. I was looking at these while testing a patch for android testing. I noticed

raptor-fission.fennec.opt.log:06-28 14:46:18.837  4654  4692 E Web Content: [JavaScript Error: "TypeError: can't access property "querySelector", contentDocument is null" {file: "http://127.0.0.1:47379/Speedometer/resources/benchmark-runner.js" line: 51>
raptor-fission.fennec.opt.log:06-28 14:53:18.302  4582  4631 E GeckoConsole: [JavaScript Error: "ERROR: [raptor-runnerjs] raptor-page-timeout on http://127.0.0.1:47379/Speedometer/index.html?raptor"]

I'm running with a bunch of debugging output enabled and I'm not sure how well this translates to the desktop version but I think "TypeError: can't access property "querySelector", contentDocument is null" is something to look into. I don't see this error in my non-fission raptor-speedometer runs.

:cpeterson, it looks like there's a browser issue when we run our tests with fission enabled (primarily on windows10-64-shippable-qr). Would you be able to find someone to look into this?

Flags: needinfo?(gmierz2) → needinfo?(cpeterson)

(In reply to Bob Clary [:bc] from comment #194)

These mostly appear to be fission. I was looking at these while testing a patch for android testing. I noticed

raptor-fission.fennec.opt.log:06-28 14:46:18.837  4654  4692 E Web Content: [JavaScript Error: "TypeError: can't access property "querySelector", contentDocument is null" {file: "http://127.0.0.1:47379/Speedometer/resources/benchmark-runner.js" line: 51>
raptor-fission.fennec.opt.log:06-28 14:53:18.302  4582  4631 E GeckoConsole: [JavaScript Error: "ERROR: [raptor-runnerjs] raptor-page-timeout on http://127.0.0.1:47379/Speedometer/index.html?raptor"]

Bob, what is "raptor-fission.fennec.opt"? We should not be running any Fission tests on Android.

:cpeterson, it looks like there's a browser issue when we run our tests with fission enabled (primarily on windows10-64-shippable-qr). Would you be able to find someone to look into this?

Sure. I'll send this bug to Fission triage.

Fission Milestone: --- → ?
Flags: needinfo?(cpeterson) → needinfo?(bob)

Those lines are from a log from a local test where I noticed that my android tests failed on raptor with fission which matched the information from the intermittent failures log.

Flags: needinfo?(bob)

This (ignoring the android tests because those should not be run with Fission) seems to be happening even without Fission. Dave, could you please find someone to debug this?

Flags: needinfo?(dave.hunt)

(In reply to Neha Kochar [:neha] from comment #200)

This (ignoring the android tests because those should not be run with Fission) seems to be happening even without Fission. Dave, could you please find someone to debug this?

:neha, the largest contributor to these timeouts is fission tests. Ignoring the youtube playback failures which are caused by bug 1649819, there are only 5/49 tests in the last week that have failed in this way on non-fission tests. One of them is not related to firefox (a mis-classification), two others are windows10 ref hardware (need refreshes), and two others are talos tests which are also mis-classified. So the recent spike in failures is certainly related to Fission.

Flags: needinfo?(dave.hunt)
Flags: needinfo?(nkochar)

Thanks Greg. Randell, could you please debug this further for Fission failures?

Flags: needinfo?(nkochar) → needinfo?(rjesup)
Depends on: 1649819

In the last 7 days there were 93 failures. This happens on linux64-shippable-qr, windows10-64-shippable-qr on opt build types.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=309398608&repo=mozilla-central&lineNumber=850

This is generating a high load of failures. Can you take a look pls

Flags: needinfo?(gmierz2)

:bebe, if you read the comments above, :jesup should be looking into this.

EDIT: I've pinged him on riot about this.

Flags: needinfo?(gmierz2)

Neha says Randell is investigating now. The test failure rate seems to be increasing.

Looks like some test callback is never firing.

Tracking for Fission M6a Nightly.

Assignee: nobody → rjesup
Fission Milestone: ? → M6a
Status: NEW → ASSIGNED
Fission Milestone: M6a → M6b
Priority: P5 → P3

(In reply to Bob Clary [:bc] from comment #194)

I'm running with a bunch of debugging output enabled and I'm not sure how well this translates to the desktop version but I think "TypeError: can't access property "querySelector", contentDocument is null" is something to look into. I don't see this error in my non-fission raptor-speedometer runs.

That sounds a bit like an iframe is somehow ending up in the wrong process, or some chrome JS is expecting to be able to inspect inside of a cross-origin iframe, or something like that. It would be very interesting to get the stacks of these exceptions, and perhaps turn on this kind of logging on treeherder.

I think my new version of mozdevice is even more verbose than before and might show this. Unfortunately I was backed out this morning but I'll see if I can't reproduce it locally and also see if the normal verbose output raptor gets would see it.

Flags: needinfo?(bob)

pernosco would help, but IIRC it doesn't do raptor (edited)

Flags: needinfo?(rjesup)

Note that the failures are mostly cold-yahoo-mail-fis and cold-wikipedia-fis (though non-fission and non-cold failures happen, plus occasional failures on other tests)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #218)

Note that the failures are mostly cold-yahoo-mail-fis and cold-wikipedia-fis (though non-fission and non-cold failures happen, plus occasional failures on other tests)

That might explain why I don't see any test results for "raptor-tp6-wikipedia-firefox-cold-loadtime opt fission webrender" in Perfherder (bug 1653059).

See Also: → 1653059
Pushed by gmierz2@outlook.com: https://hg.mozilla.org/integration/autoland/rev/d2c74fe4f71f Disable yahoo-mail and wikipedia on fission with webrender enabled. r=perftest-reviewers,AlexandruIonescu
Pushed by gmierz2@outlook.com: https://hg.mozilla.org/integration/autoland/rev/89daf9e89fad Disable yahoo-mail and wikipedia on fission with webrender enabled. r=perftest-reviewers,AlexandruIonescu

Fixed and pushed again.

Flags: needinfo?(gmierz2)

Locally I see in the fission enabled speedometer test running against a Pixel 2:

./mach raptor-test --test=raptor-speedometer --app=geckoview --binary=org.mozilla.geckoview_example --activity=org.mozilla.geckoview_example.GeckoViewActivity  --enable-fission
...
07-28 16:46:06.735 23789 23846 W Web Content: [JavaScript Warning: "Loading failed for the <script> with source “http://127.0.0.1:52839/resources/runner.js”." {file: "http://127.0.0.1:52839/Speedometer/index.html?raptor" line: 1}]
07-28 16:46:07.345 23789 23846 E Web Content: [JavaScript Error: "TypeError: can't access property "querySelector", contentDocument is null" {file: "http://127.0.0.1:52839/Speedometer/resources/benchmark-runner.js" line: 51}]

The line in benchmark-runner.js's BenchmarkRunner.prototype.waitForElement is using contentDocument from this._frame.contentDocument which is null.

I tried

 BenchmarkRunner.prototype.waitForElement = function (selector) {
     var promise = new SimplePromise;
-    var contentDocument = this._frame.contentDocument;
+    var frame = this._frame;
 
     function resolveIfReady() {
-        var element = contentDocument.querySelector(selector);
-        if (element)
-            return promise.resolve(element);
+        var contentDocument = frame.contentDocument;
+        if (contentDocument) {
+          var element = contentDocument.querySelector(selector);
+          if (element)
+              return promise.resolve(element);
+        }
         setTimeout(resolveIfReady, 50);
     }

but it did not fix the overall issue as the test continued to fail timing out while waiting for results.

Flags: needinfo?(bob)

(In reply to Bob Clary [:bc] from comment #231)

Locally I see in the fission enabled speedometer test running against a Pixel 2:

We shouldn't be running with Fission enabled against mobile. I'm not even sure if this is supported.

It isn't supported on mobile yet. That'll be a while

It seems like there are some bad interactions between webrender and fission going on. 11/14 of the windows*-qr platform failures are fission tests.

:jesup would you have any updates on the yahoo-mail/wikipedia fission tests?

Flags: needinfo?(rjesup)

Not really; it seems highly timing dependent - turning on MOZ_LOGs for Process:5,DocumentChannel:5,DocLoader:4,BrowsingContext:4 causes them to permafail. yahoo-mail and wikipedia were disabled.

You could add
self.runner.env["MOZ_LOG"] = "DocumentChannel:5,DocLoader:4,BrowsingContext:4,Process:4"
to desktop.py and see what happens with other tests.

Flags: needinfo?(rjesup)

:bebe, :alexandrui, could either of you try the wikipedia/yahoo-mail tests with a new recording and see if that helps? You can reproduce those failures locally by providing the --enable-fission and the --enable-webrender flags together.

Flags: needinfo?(fstrugariu)
Flags: needinfo?(aionescu)

I will

Flags: needinfo?(fstrugariu)

Sparky, I understand correctly that I should be able to reproduce the Application timed out after <random> seconds on wikipedia/yahoo-mail with the 2 flags enabled and
self.runner.env["MOZ_LOG"] = "DocumentChannel:5,DocLoader:4,BrowsingContext:4,Process:4"
should fix the issue?

Flags: needinfo?(gmierz2)

:alexandrui, yup you only need those two flags to reproduce it locally. The MOZ_LOG settings are for debugging Firefox and are not needed to reproduce the problem.

You should re-record wikipedia and yahoo-mail and test locally and on try to see if the new recordings fix the issue or if the problem still exists.

Flags: needinfo?(gmierz2)

Unfortunately first re-recording didn't fix the timeout. I can try a couple more times, but I don't see the benefit of trying more than that.

Flags: needinfo?(aionescu)

Yahoo-news re-recording with mitm5 and the 2 commands seems to be working locally on linux64.

(In reply to Alexandru Ionescu (needinfo me) [:alexandrui] from comment #242)

Yahoo-news re-recording with mitm5 and the 2 commands seems to be working locally on linux64.

It's yahoo-mail that is failing not yahoo-news.

Ahhh, sorry. Will do it again

Flags: needinfo?(aionescu)

It worked on yahoo-mail re-recording.
First replay attempt seemed to have loaded the page on first page cycle, but for some reason the webext didn't got that and blocked itself on timeout. The second replay attempt finished with exit code 0.
Note that my OS is ubuntu.

Flags: needinfo?(aionescu) → needinfo?(gmierz2)

:alexandrui, that's great! It was perma-failing on my linux ubuntu system so it sounds like a re-recording will fix the issue. Can you push to try with the new recording to see if it sticks there? We'll have to try re-recording the wikipedia page too now.

Flags: needinfo?(gmierz2) → needinfo?(aionescu)

Greg, both wikipedia and yahoo-mail mitm5 recordings landed a couple of weeks ago. I did another try with wikipedia and worked.
Here's a try on fission with web-render enabled for yahoo-mail and wikipedia: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca084af2379edc92682e7f34a8aa3357e4659675

Flags: needinfo?(aionescu)

Hmm, that try run still looks problematic with the new recordings.

Sparky, none of the 2 websites are perma anymore, still, they are intermittent at a pretty high rate.
I would try and re-record those, but the thing here is we have the same intermittent issue with some websites on mitm5 (Bug 1654190). I would treat separately the perma issue from the intermittent, but I wouldn't land this high rate of intermittent either.
Given the priority of this and my tasks this month are pretty full, I don't see this issue fixed too soon.

What are your thoughts?

Flags: needinfo?(gmierz2)

(In reply to Alexandru Ionescu (needinfo me) [:alexandrui] from comment #251)

Sparky, none of the 2 websites are perma anymore, still, they are intermittent at a pretty high rate.
I would try and re-record those, but the thing here is we have the same intermittent issue with some websites on mitm5 (Bug 1654190). I would treat separately the perma issue from the intermittent, but I wouldn't land this high rate of intermittent either.
Given the priority of this and my tasks this month are pretty full, I don't see this issue fixed too soon.

What are your thoughts?

Yea we definitely shouldn't re-enable the test, they are failing far too often, please abandon your patch for it. No need for going much further here for now, but if you have a chance to try these sites with a new mitm5 recording that would be a good check to do.

Flags: needinfo?(gmierz2)

In the last 7 days there have been 29 occurrences, mostly on windows10-64 opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313715505&repo=autoland&lineNumber=2133

Sparky, could you, please, take a look?

Flags: needinfo?(gmierz2)
Attachment #9168826 - Attachment is obsolete: true

:markco, it looks like the windows10-64-ref-hw-2017 machines need another refresh, they are failing more often lately. Could you look into that please?

Flags: needinfo?(gmierz2) → needinfo?(mcornmesser)

(In reply to Greg Mierzwinski [:sparky] from comment #256)

:markco, it looks like the windows10-64-ref-hw-2017 machines need another refresh, they are failing more often lately. Could you look into that please?

I have contacted Bitbar and asked for a redeployment. Also I am looking into deploying an automated restore function, so that we can avoid hitting this in the future.

Flags: needinfo?(mcornmesser)

Confirmed with Bitbar. They will redeploy these this week.

I'm looking at this; I've found that turning on debugs makes these permafails -> timing issue, likely with the harness

(In reply to Mark Cornmesser [:markco] from comment #258)

Confirmed with Bitbar. They will redeploy these this week.

Thanks :markco!

Fission Milestone: M6b → M6c

:jesup are you still actively working on this issue?

Flags: needinfo?(rjesup)
Assignee: rjesup → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(rjesup)

Moving to later milestone M7 because browsertime is expected to replace raptor.

Fission Milestone: M6c → M7

:markco, it looks like there are a lot of failures on the windows10-64-ref-hw-2017 machines, could we get those refreshed?

Flags: needinfo?(mcornmesser)

(In reply to Greg Mierzwinski [:sparky] from comment #278)

:markco, it looks like there are a lot of failures on the windows10-64-ref-hw-2017 machines, could we get those refreshed?

Just spoke with them Bitbar. They are going to start the refresh tomorrow.

Flags: needinfo?(mcornmesser)

There are 48 total failures in the last 7 days on opt:

  • macosx1014-64-shippable-qr
  • windows10-64-ref-hw-2017
  • windows10-64-shippable
  • windows10-64-shippable-qr

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=323679148&repo=mozilla-central&lineNumber=839

[task 2020-12-05T23:44:14.691Z] 23:44:14 INFO - raptor-webext-desktop Info: starting firefox
[task 2020-12-05T23:44:14.691Z] 23:44:14 INFO - Application command: /Users/cltbld/tasks/task_1607202651/build/application/Firefox Nightly.app/Contents/MacOS/firefox --allow-downgrade -foreground -profile /var/folders/9q/mkr7zwds17zf6v3j0g14fg44000017/T/tmpB6ViLL.mozrunner
[task 2020-12-05T23:44:17.786Z] 23:44:17 INFO - raptor-control-server Info: received webext_loaded: raptor runner.js is loaded!
[task 2020-12-05T23:44:17.867Z] 23:44:17 INFO - raptor-control-server Info: received webext_status: testing on Firefox 85.0a1 20201205214636
[task 2020-12-05T23:44:17.901Z] 23:44:17 INFO - raptor-control-server Info: received webext_status: test name is: raptor-tp6-cnn-ampstories-firefox-cold
[task 2020-12-05T23:44:17.907Z] 23:44:17 INFO - raptor-control-server Info: received webext_status: test settings url is: http://127.0.0.1:50104/json/raptor-tp6-cnn-ampstories-firefox-cold.json
[task 2020-12-05T23:44:17.923Z] 23:44:17 INFO - raptor-control-server Info: received webext_status: starting raptorRunner
[task 2020-12-05T23:44:17.965Z] 23:44:17 INFO - PID 2389 | console.info: "[raptor-runnerjs] getting test settings from control server"
[task 2020-12-05T23:44:17.966Z] 23:44:17 INFO - raptor-control-server Info: reading test settings from json/raptor-tp6-cnn-ampstories-firefox-cold.json
[task 2020-12-05T23:44:17.966Z] 23:44:17 INFO - raptor-control-server Info: sent test settings to webext runner
[task 2020-12-05T23:44:17.966Z] 23:44:17 INFO - PID 2389 | console.info: "[raptor-runnerjs] test settings received: {"raptor-options": {"expected_browser_cycles": 25, "subtest_unit": "ms", "alert_threshold": 5.0, "type": "pageload", "page_cycles": 1, "subtest_lower_is_better": true, "alert_on": ["dcf", "fcp", "fnbpaint", "loadtime"], "test_url": "https://cnn.com/ampstories/us/why-hurricane-michael-is-a-monster-unlike-any-other\", "page_timeout": 72000, "host": "127.0.0.1", "measure": {"fcp": true, "dcf": true, "fnbpaint": true, "loadtime": true}, "cold": true, "lower_is_better": true, "unit": "ms"}}"
[task 2020-12-05T23:44:17.966Z] 23:44:17 INFO - PID 2389 | console.info: "[raptor-runnerjs] test URL: https://cnn.com/ampstories/us/why-hurricane-michael-is-a-monster-unlike-any-other"
[task 2020-12-05T23:44:17.966Z] 23:44:17 INFO - PID 2389 | console.info: "[raptor-runnerjs] using page timeout: 72000ms"
[task 2020-12-05T23:44:17.975Z] 23:44:17 INFO - PID 2389 | console.info: "[raptor-runnerjs] wrote settings to ext local storage"
[task 2020-12-05T23:44:17.975Z] 23:44:17 INFO - PID 2389 | console.info: "[raptor-runnerjs] pageload test start"
[task 2020-12-05T23:44:18.008Z] 23:44:18 INFO - raptor-control-server Info: received webext_status: * pausing 30 seconds to let browser settle... *
[task 2020-12-05T23:44:48.038Z] 23:44:48 INFO - raptor-control-server Info: received webext_status: opening new tab
[task 2020-12-05T23:44:48.072Z] 23:44:48 INFO - raptor-control-server Info: received webext_status: opened new empty tab: 2
[task 2020-12-05T23:44:48.121Z] 23:44:48 INFO - raptor-control-server Info: received webext_status: found active tab with id 2
[task 2020-12-05T23:44:48.122Z] 23:44:48 INFO - raptor-control-server Info: received webext_status: running 1 pagecycles of https://cnn.com/ampstories/us/why-hurricane-michael-is-a-monster-unlike-any-other
[task 2020-12-05T23:44:49.134Z] 23:44:49 INFO - raptor-control-server Info: received webext_status: begin page cycle 1
[task 2020-12-05T23:44:49.134Z] 23:44:49 INFO - raptor-control-server Info: received webext_status: update tab 2 for https://cnn.com/ampstories/us/why-hurricane-michael-is-a-monster-unlike-any-other
[task 2020-12-05T23:44:49.149Z] 23:44:49 INFO - raptor-control-server Info: received webext_status: tab 2 updated
[task 2020-12-05T23:44:49.149Z] 23:44:49 INFO - raptor-control-server Info: received webext_status: ended page cycle 1
[task 2020-12-05T23:47:29.346Z] 23:47:29 CRITICAL - raptor-perftest Critical: Test failed to finish. Application timed out after 182 seconds
[task 2020-12-05T23:47:29.346Z] 23:47:29 INFO - raptor-webext Info: removing webext /Users/cltbld/tasks/task_1607202651/build/tests/raptor/raptor/webextension/../../webext/raptor
[task 2020-12-05T23:47:29.346Z] 23:47:29 INFO - raptor-control-server Info: shutting down control server
[task 2020-12-05T23:47:29.509Z] 23:47:29 INFO - raptor-webext Info: finished
[task 2020-12-05T23:47:29.705Z] 23:47:29 ERROR - Return code: 1
[task 2020-12-05T23:47:29.705Z] 23:47:29 WARNING - setting return code to 1
[task 2020-12-05T23:47:29.705Z] 23:47:29 INFO - Copying Raptor results to upload dir:
[task 2020-12-05T23:47:29.705Z] 23:47:29 INFO - /Users/cltbld/tasks/task_1607202651/build/blobber_upload_dir/perfherder-data.json
[task 2020-12-05T23:47:29.706Z] 23:47:29 INFO - Copying raptor results from /Users/cltbld/tasks/task_1607202651/build/raptor.json to /Users/cltbld/tasks/task_1607202651/build/blobber_upload_dir/perfherder-data.json
[task 2020-12-05T23:47:29.706Z] 23:47:29 CRITICAL - Error copying results /Users/cltbld/tasks/task_1607202651/build/raptor.json to upload dir /Users/cltbld/tasks/task_1607202651/build/blobber_upload_dir/perfherder-data.json
[task 2020-12-05T23:47:29.706Z] 23:47:29 INFO - [Errno 2] No such file or directory: u'/Users/cltbld/tasks/task_1607202651/build/raptor.json'
[task 2020-12-05T23:47:29.706Z] 23:47:29 INFO - Running post-action listener: _package_coverage_data
[task 2020-12-05T23:47:29.706Z] 23:47:29 INFO - Running post-action listener: _resource_record_post_action
[task 2020-12-05T23:47:29.706Z] 23:47:29 INFO - Running post-action listener: process_java_coverage_data
[task 2020-12-05T23:47:29.706Z] 23:47:29 INFO - Running post-action listener: stop_device
[task 2020-12-05T23:47:29.706Z] 23:47:29 INFO - [mozharness: 2020-12-05 23:47:29.706920Z] Finished run-tests step (success)
[task 2020-12-05T23:47:29.707Z] 23:47:29 INFO - Running post-run listener: _resource_record_post_run
[task 2020-12-05T23:47:29.790Z] 23:47:29 INFO - Total resource usage - Wall time: 224s; CPU: 9.0%; Read bytes: 127975424; Write bytes: 547614720; Read time: 4167; Write time: 1632
[task 2020-12-05T23:47:29.791Z] 23:47:29 INFO - TinderboxPrint: CPU usage<br/>9.0%
[task 2020-12-05T23:47:29.791Z] 23:47:29 INFO - TinderboxPrint: I/O read bytes / time<br/>127,975,424 / 4,167
[task 2020-12-05T23:47:29.791Z] 23:47:29 INFO - TinderboxPrint: I/O write bytes / time<br/>547,614,720 / 1,632
[task 2020-12-05T23:47:29.791Z] 23:47:29 INFO - TinderboxPrint: CPU idle<br/>813.8 (90.8%)
[task 2020-12-05T23:47:29.791Z] 23:47:29 INFO - TinderboxPrint: CPU system<br/>16.6 (1.9%)
[task 2020-12-05T23:47:29.791Z] 23:47:29 INFO - TinderboxPrint: CPU user<br/>66.1 (7.4%)
[task 2020-12-05T23:47:29.791Z] 23:47:29 INFO - TinderboxPrint: Swap in / out<br/>540,459,008 / 0
[task 2020-12-05T23:47:29.791Z] 23:47:29 INFO - install-chromium-distribution - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2020-12-05T23:47:29.792Z] 23:47:29 INFO - install - Wall time: 30s; CPU: 38.0%; Read bytes: 420529664; Write bytes: 406564864; Read time: 26199; Write time: 616
[task 2020-12-05T23:47:29.793Z] 23:47:29 INFO - run-tests - Wall time: 196s; CPU: 4.0%; Read bytes: 120893440; Write bytes: 113524736; Read time: 4088; Write time: 967
[task 2020-12-05T23:47:29.836Z] 23:47:29 WARNING - returning nonzero exit status 1
[taskcluster 2020-12-05T23:47:29.863Z] Exit Code: 1
[taskcluster 2020-12-05T23:47:29.863Z] User Time: 1m6.219209s
[taskcluster 2020-12-05T23:47:29.863Z] Kernel Time: 18.58191s
[taskcluster 2020-12-05T23:47:29.863Z] Wall Time: 5m3.585084537s
[taskcluster 2020-12-05T23:47:29.863Z] Result: FAILED
[taskcluster 2020-12-05T23:47:29.863Z] === Task Finished ===
[taskcluster 2020-12-05T23:47:29.863Z] Task Duration: 5m3.589058357s
[taskcluster 2020-12-05T23:50:01.959Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2021-12-05T21:49:28.718Z
[taskcluster 2020-12-05T23:50:07.473Z] Uploading artifact public/test_info/resource-usage.json from file build/blobber_upload_dir/resource-usage.json with content encoding "gzip", mime type "application/json" and expiry 2021-12-05T21:49:28.718Z
[taskcluster:error] exit status 1

Greg can you take a look?

Flags: needinfo?(gmierz2)

:markco, did bitbar complete the refresh?

Flags: needinfo?(gmierz2) → needinfo?(mcornmesser)
Flags: needinfo?(gmierz2)

(In reply to Greg Mierzwinski [:sparky] from comment #282)

:markco, did bitbar complete the refresh?

Yes. It was completed last week.

Flags: needinfo?(mcornmesser)

:bebe, did you have a chance to look into this?

Flags: needinfo?(gmierz2) → needinfo?(fstrugariu)

In the last 7 days there have been 49 occurrences, most on windows10-64 opt, some on macosx1014-64-shippable-qr opt.

Recent failure: https://treeherder.mozilla.org/logviewer?job_id=325043106&repo=mozilla-central&lineNumber=1820

There are 33 total failures in the last 7 days on windows10-64-shippable-qr, windows10-64-ref-hw-2017, macosx1014-64-shippable-qr opt.
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=325597814&repo=mozilla-central&lineNumber=1046

[task 2020-12-30T23:25:16.737Z] 23:25:16 INFO - Application command: C:\Users\task_1609362602\build\application\firefox\firefox.exe --allow-downgrade --wait-for-browser -profile c:\users\task_1609362602\appdata\local\temp\tmpx52ucs\profile
[task 2020-12-30T23:25:19.219Z] 23:25:19 INFO - raptor-control-server Info: received webext_loaded: raptor runner.js is loaded!
[task 2020-12-30T23:25:19.259Z] 23:25:19 INFO - raptor-control-server Info: received webext_status: testing on Firefox 86.0a1 20201230214743
[task 2020-12-30T23:25:19.269Z] 23:25:19 INFO - raptor-control-server Info: received webext_status: test name is: raptor-tp6-netflix-firefox-cold
[task 2020-12-30T23:25:19.274Z] 23:25:19 INFO - raptor-control-server Info: received webext_status: test settings url is: http://127.0.0.1:49896/json/raptor-tp6-netflix-firefox-cold.json
[task 2020-12-30T23:25:19.276Z] 23:25:19 INFO - raptor-control-server Info: received webext_status: starting raptorRunner
[task 2020-12-30T23:25:19.281Z] 23:25:19 INFO - PID 7964 | console.info: "[raptor-runnerjs] getting test settings from control server"
[task 2020-12-30T23:25:19.300Z] 23:25:19 INFO - raptor-control-server Info: reading test settings from json/raptor-tp6-netflix-firefox-cold.json
[task 2020-12-30T23:25:19.300Z] 23:25:19 INFO - raptor-control-server Info: sent test settings to webext runner
[task 2020-12-30T23:25:19.305Z] 23:25:19 INFO - PID 7964 | console.info: "[raptor-runnerjs] test settings received: {"raptor-options": {"expected_browser_cycles": 25, "subtest_unit": "ms", "alert_threshold": 2.0, "type": "pageload", "page_cycles": 1, "subtest_lower_is_better": true, "alert_on": ["fcp", "loadtime"], "test_url": "https://www.netflix.com/title/80117263\", "page_timeout": 60000, "host": "127.0.0.1", "measure": {"fcp": true, "dcf": true, "fnbpaint": true, "loadtime": true}, "cold": true, "lower_is_better": true, "unit": "ms"}}"
[task 2020-12-30T23:25:19.305Z] 23:25:19 INFO - PID 7964 | console.info: "[raptor-runnerjs] test URL: https://www.netflix.com/title/80117263"
[task 2020-12-30T23:25:19.305Z] 23:25:19 INFO - PID 7964 | console.info: "[raptor-runnerjs] using page timeout: 60000ms"
[task 2020-12-30T23:25:19.365Z] 23:25:19 INFO - PID 7964 | console.info: "[raptor-runnerjs] wrote settings to ext local storage"
[task 2020-12-30T23:25:19.365Z] 23:25:19 INFO - PID 7964 | console.info: "[raptor-runnerjs] pageload test start"
[task 2020-12-30T23:25:19.375Z] 23:25:19 INFO - raptor-control-server Info: received webext_status: * pausing 1 seconds to let browser settle... *
[task 2020-12-30T23:25:20.382Z] 23:25:20 INFO - raptor-control-server Info: received webext_status: opening new tab
[task 2020-12-30T23:25:20.421Z] 23:25:20 INFO - raptor-control-server Info: received webext_status: opened new empty tab: 2
[task 2020-12-30T23:25:20.421Z] 23:25:20 INFO - raptor-control-server Info: received webext_status: found active tab with id 2
[task 2020-12-30T23:25:20.426Z] 23:25:20 INFO - raptor-control-server Info: received webext_status: running 1 pagecycles of https://www.netflix.com/title/80117263
[task 2020-12-30T23:25:21.432Z] 23:25:21 INFO - raptor-control-server Info: received webext_status: begin page cycle 1
[task 2020-12-30T23:25:21.437Z] 23:25:21 INFO - raptor-control-server Info: received webext_status: update tab 2 for https://www.netflix.com/title/80117263
[task 2020-12-30T23:25:21.456Z] 23:25:21 INFO - raptor-control-server Info: received webext_status: tab 2 updated
[task 2020-12-30T23:25:21.458Z] 23:25:21 INFO - raptor-control-server Info: received webext_status: ended page cycle 1
[task 2020-12-30T23:27:49.399Z] 23:27:49 CRITICAL - raptor-perftest Critical: Test failed to finish. Application timed out after 141 seconds
[task 2020-12-30T23:27:49.399Z] 23:27:49 INFO - raptor-mitmproxy Info: Stopping mitmproxy playback, killing process 8064
[task 2020-12-30T23:27:49.399Z] 23:27:49 INFO - raptor-mitmproxy Info: Sending CTRL_BREAK_EVENT to mitmproxy
[task 2020-12-30T23:27:51.400Z] 23:27:51 INFO - raptor-mitmproxy Info: Successfully killed the mitmproxy playback process
[task 2020-12-30T23:27:51.400Z] 23:27:51 INFO - raptor-mitmproxy Info: Turning off the browser proxy
[task 2020-12-30T23:27:51.400Z] 23:27:51 INFO - raptor-mitmproxy Info: writing: C:\Users\task_1609362602\build\application\firefox\distribution\policies.json
[task 2020-12-30T23:27:51.401Z] 23:27:51 INFO - raptor-mitmproxy Info: Reading confidence values from: C:\Users\task_1609362602\build\blobber_upload_dir\mitm_netlocs_dump.json
[task 2020-12-30T23:27:51.401Z] 23:27:51 INFO - perftest-results-handler Info: RaptorResultsHandler.add_supporting_data received mozproxy data
[task 2020-12-30T23:27:51.401Z] 23:27:51 INFO - raptor-webext Info: removing webext C:\Users\task_1609362602\build\tests\raptor\raptor\webextension....\webext\raptor
[task 2020-12-30T23:27:51.401Z] 23:27:51 INFO - raptor-perftest Info: Removing temporary directory: c:\users\task_1609362602\appdata\local\temp\tmpz_k2dt
[task 2020-12-30T23:27:51.828Z] 23:27:51 INFO - raptor-perftest Info: Removing temporary directory: c:\users\task_1609362602\appdata\local\temp\tmpx52ucs
[task 2020-12-30T23:27:52.263Z] 23:27:52 INFO - raptor-control-server Info: shutting down control server
[task 2020-12-30T23:27:52.626Z] 23:27:52 INFO - raptor-webext Info: finished
[task 2020-12-30T23:27:52.675Z] 23:27:52 ERROR - Return code: 1
[task 2020-12-30T23:27:52.675Z] 23:27:52 WARNING - setting return code to 1
[task 2020-12-30T23:27:52.675Z] 23:27:52 INFO - Copying Raptor results to upload dir:
[task 2020-12-30T23:27:52.675Z] 23:27:52 INFO - C:\Users\task_1609362602\build\blobber_upload_dir\perfherder-data.json
[task 2020-12-30T23:27:52.675Z] 23:27:52 INFO - Copying raptor results from C:\Users\task_1609362602\build\raptor.json to C:\Users\task_1609362602\build\blobber_upload_dir\perfherder-data.json
[task 2020-12-30T23:27:52.677Z] 23:27:52 CRITICAL - Error copying results C:\Users\task_1609362602\build\raptor.json to upload dir C:\Users\task_1609362602\build\blobber_upload_dir\perfherder-data.json
[task 2020-12-30T23:27:52.677Z] 23:27:52 INFO - [Errno 2] No such file or directory: u'C:\Users\task_1609362602\build\raptor.json'
[task 2020-12-30T23:27:52.677Z] 23:27:52 INFO - Running post-action listener: _package_coverage_data
[task 2020-12-30T23:27:52.677Z] 23:27:52 INFO - Running post-action listener: _resource_record_post_action
[task 2020-12-30T23:27:52.677Z] 23:27:52 INFO - Running post-action listener: process_java_coverage_data
[task 2020-12-30T23:27:52.677Z] 23:27:52 INFO - Running post-action listener: stop_device
Florin, please take a look.

Flags: needinfo?(fstrugariu)
Flags: needinfo?(fstrugariu)

tested this locally but can't make it fail.
As we plan to migrate all tests to windows let's revisit this in 2 weeks

[:apavel] As the failure count is low is that OK?

Flags: needinfo?(fstrugariu) → needinfo?(apavel)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #292)

tested this locally but can't make it fail.
As we plan to migrate all tests to windows let's revisit this in 2 weeks

[:apavel] As the failure count is low is that OK?

Looks like failure rate is going down, sounds good. Thank you!

Flags: needinfo?(apavel)

we are still migrating windows tests to browsertime.
and fail rate is still low I would still wait for the migration

Flags: needinfo?(fstrugariu)
Whiteboard: [stockwell disabled][ci-costs-2020:done][stockwell needswork:owner] → [stockwell disabled][ci-costs-2020:done][stockwell needswork:owner] fission-perf

Hi Florin, this seems to be spiking again, ca you take a look?

70 total failures in the last 7 days on

  • windows10-64-ref-hw-2017 opt
  • windows10-64-shippable opt
  • windows10-64-shippable-qr opt
  • linux1804-64 opt
  • macosx1014-64-shippable-qr opt
Flags: needinfo?(fstrugariu)

it's interesting that this alert is more reproducible on mozilla-central.

can't reproduce this locally because of Bug 1686708 - [Windows] Nightly does not start after updating to today version (Cisco AMP)

Flags: needinfo?(fstrugariu)

netflix cold fails permanently on Windows 10 x64 shippable with fission enabled (for ~10 days now after a few days with ~80% failure rate and ~50% before).
Florin, can this be investigated?

Flags: needinfo?(fstrugariu)

This appears to be a timeout; netflix never does more than 1 iteration (of 25).

Sparky: are we turning this off shortly?

Flags: needinfo?(gmierz2)

Yup, it should be turned off in a week. I'm looking into some of these timeouts anyway though since we will still have some tests running with webext.

Flags: needinfo?(gmierz2)
Flags: needinfo?(fstrugariu)

The number of failures is dropping now since we've disabled raptor webext pageload tests. I'll wait until next week before taking any action here.

:sparky does this still block Fission M7? are there any remaining Fission perf tests using raptor-webext?

Flags: needinfo?(gmierz2)

No, it shouldn't block that fission milestone anymore.

Flags: needinfo?(gmierz2)

(In reply to Greg Mierzwinski [:sparky] from comment #319)

No, it shouldn't block that fission milestone anymore.

Thanks. In that case, I will move this bug from Fission Beta experiment milestone M7 to Release experiment milestone M8, so we review this before shipping Fission MVP.

Fission Milestone: M7 → M8

This bug should not be blocking fission anymore. We also don't run very many tests in raptor-webext now.

Fission Milestone: M8 → ---

(In reply to Greg Mierzwinski [:sparky] from comment #324)

This bug should not be blocking fission anymore. We also don't run very many tests in raptor-webext now.

Thanks. In that case, I will set this bug's Fission Milestone to "Future", meaning the bug is related to Fission (IIUC?) but doesn't block shipping Fission MVP.

Fission Milestone: --- → Future

Sounds good, it may get closed by one of the bots since we don't have this failure anymore (or very rarely). This was a catch-all bug for all failures with this signature and, at one point, fission tests had the largest chunk of failures.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: