Closed Bug 1568050 Opened 5 years ago Closed 5 years ago

Intermittent raptor-main Critical: Tests failed to finish! Application timed out.

Categories

(Testing :: Raptor, defect, P1)

Version 3
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1605991

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [comment 75][stockwell disable-recommended])

Attachments

(2 files)

Filed by: rmaries [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=257742678&repo=mozilla-central
Full log: https://queue.taskcluster.net/v1/task/D0lQPKNfRAaQXjScbx2DwQ/runs/0/artifacts/public/logs/live_backing.log


21:04:45 CRITICAL - raptor-main Critical: Tests failed to finish! Application timed out.
21:04:45 ERROR - raptor-main Error: Test failed to finish. Application timed out after 170 seconds
21:04:45 INFO - raptor-results-handler Info: summarizing raptor test results
21:04:45 ERROR - raptor-output Error: no raptor test results found for raptor-tp6m-wikipedia-geckoview-cold
21:04:45 INFO - raptor-output Info: error: no raptor test results found, so no need to combine browser cycles
21:04:45 ERROR - raptor-output Error: no summarized raptor results found for raptor-tp6m-wikipedia-geckoview-cold
21:04:45 CRITICAL - raptor-results-handler Critical: PERFHERDER_DATA was seen 0 times, expected 1.
21:04:45 INFO - raptor-main Info: removing test folder for raptor: /sdcard/raptor
21:04:46 INFO - adb shell_output: adb -s ZY3222CFLM wait-for-device shell rm -r /sdcard/raptor; echo adb_returncode=$?, timeout: None, root: False, timedout: None, exitcode: 0, output:
21:04:46 INFO - raptor-control-server Info: shutting down control server
21:04:46 INFO - raptor-main Info: finished
21:04:46 ERROR - raptor-main Critical: TEST-UNEXPECTED-FAIL: no raptor test results were found for raptor-tp6m-wikipedia-geckoview-cold
21:04:46 ERROR - Return code: 1

:bebe, could you (or someone on the SV team) please have a look and see if this is valid / if there is something to investigate here? Thanks!

Flags: needinfo?(fstrugariu)

There are 2 errors here:

  1. The app "goes unresponsive/crashes" during startup
    * https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258538146&repo=mozilla-central&lineNumber=1880
  2. The app "goes unresponsive" during the test run
    * https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258713871&repo=mozilla-central&lineNumber=1467

We would need to investigate both cases better.

Flags: needinfo?(marian.raiciof)
Flags: needinfo?(fstrugariu)
Flags: needinfo?(alexandru.ionescu)

I have investigated two logs so far, one for a passing test and one for a failing test.
It seems that the failure reason is related to raptor's runner.js file which is not loaded on the failing test.
The flow is blocked for ~ 3 minutes and then it fails with the timeout.

Passing test + logs:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedJob=257742662&revision=6d98669f6869e4f050ba95d943c1577800e00462
https://taskcluster-artifacts.net/AssWdt-kSuyE74mgbGTV5A/0/public/logs/live_backing.log

Failing test + logs:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedJob=257742678&revision=6d98669f6869e4f050ba95d943c1577800e00462
https://taskcluster-artifacts.net/D0lQPKNfRAaQXjScbx2DwQ/0/public/logs/live_backing.log

Side by side comparison of these 2 logs:
https://drive.google.com/file/d/1Y83cyZglXCtkkslg65RmwquAvLLSEkAg/view?usp=sharing

Flags: needinfo?(marian.raiciof)
Flags: needinfo?(alexandru.ionescu)

Over the last 7 days there are 50 failures present on this bug. These happen on: windows7-32-shippable, android-hw-p2-8-0-arm7-api-16, android-hw-g5-7-0-arm7-api-16.

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=260469132&repo=autoland&lineNumber=1380

Flags: needinfo?(rwood)
Summary: Intermittent tier 2 raptor-main Critical: Tests failed to finish! Application timed out. → Intermittent raptor-main Critical: Tests failed to finish! Application timed out.

Joel, any idea what to do here? Most failures are on android-hw machines and it has 212 failures in the last 30 days.
Marian, could you write a patch for the failures? Is this something you actively work on?

Flags: needinfo?(marian.raiciof)
Flags: needinfo?(jmaher)

these are only on the cold tests, 212 failures in the last 7 days is serious, I see good discussion in this bug and the right people on it, just not sure if this is a priority to fix (it is a P5, but one of the most frequent intermittent failures we have).

An alternative is to run cold load as tier-2 (m-c/try only) for the g5

:davehunt, can you ensure this is getting focus this week?

Flags: needinfo?(jmaher) → needinfo?(dave.hunt)

From the earlier comments it sounds like we're timing out waiting for the application to open, or that the application is crashing. Can we investigate to see if there are any patterns in the devices that are failing? Perhaps we have one or two defective devices in the lab. I also wonder if there's additional debug we could gather when we're expecting the application to be running. As this is happening only for the cold page load tests, perhaps we're trying to launch the application before the previous process has completely terminated? :bc do you have any suggestions for how we might proceed here?

:rwood if there's no progress or improvement by the end of this week, could you arrange for temporarily dropping g5 cold tests to tier 2 and disable them running on autoland/inbound.

Flags: needinfo?(dave.hunt) → needinfo?(bob)

There is clear that we have a problem with runner.js file :

I analyzed the logcat from android device and we have some issues:

  • NetworkError when fetching resources
  • Javascript loading failures
  • JavaScript SyntaxErrors - malformed URI for runner.js
  • we receive "starting raptorRunner", so raptorRunner() function is reached, but then
    we end up with -> JavaScript Error: "ReferenceError: getTestConfig is not defined"
    and from that point the script is not moving forward to

raptorLog(test name is: ${config.test_name});
raptorLog(test settings url is: ${config.test_settings_url});

More details here:
https://searchfox.org/mozilla-central/source/testing/raptor/webext/raptor/runner.js#669-673


JOB DETAILS:
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=260469132&revision=ecbdc21c19e670e9881aa2a0238f2f8e95a2a521

FULL LOGCAT:
https://taskcluster-artifacts.net/bKPPZN_vTB2ErmLk-7pCbg/0/public/test_info/logcat-ZY322FZT7B.log

LOGCAT EXCERPT:

08-07 16:10:00.841 7316 7331 D GeckoThread: State changed to RUNNING
08-07 16:10:00.853 7295 7310 D GeckoViewRemoteDebugger: USB remote debugger - listening on org.mozilla.geckoview_example/firefox-debugger-socket
08-07 16:10:00.854 7295 7310 D GeckoViewConsole: enabled = true
08-07 16:10:00.906 7295 7310 D GeckoViewModule: queue GeckoView:LoadUri, data={"uri":"about:blank","flags":0}
08-07 16:10:00.907 7295 7310 D GeckoViewModule: queue GeckoView:SetFocused, data={"focused":true}
08-07 16:10:00.908 7295 7310 D GeckoViewModule: queue GeckoView:SetActive, data={"active":true}
08-07 16:10:00.908 7295 7310 D GeckoViewModule: queue GeckoView:SetFocused, data={"focused":true}

--->>>> 08-07 16:10:01.024 7295 7310 E GeckoConsole: [JavaScript Error: "NetworkError when attempting to fetch resource."]

08-07 16:10:01.024 7295 7310 E GeckoConsole: get@resource://services-settings/RemoteSettingsClient.jsm:306:12

--->>>> 08-07 16:10:01.025 7295 7310 W GeckoConsole: [JavaScript Warning: "Loading failed for the <script> with source “moz-extension://e7ef8f4c-4c1c-4b46-883a-b13de60c5407/auto_gen_test_config.jsâ€." {file: "moz-extension://e7ef8f4c-4c1c-4b46-883a-b13de60c5407/_generated_background_page.html" line: 5}]

08-07 16:10:01.034 7295 7310 I Gecko : console.log: "[raptor-runnerjs] Attaching event listener successful!"
08-07 16:10:01.037 7295 7310 I GeckoConsole: [raptor-runnerjs] Attaching event listener successful!

--->>>> 08-07 16:10:01.038 7295 7310 E GeckoConsole: [JavaScript Error: "SyntaxError: The URI is malformed." {file: "moz-extension://e7ef8f4c-4c1c-4b46-883a-b13de60c5407/runner.js" line: 625}]

08-07 16:10:01.038 7295 7310 E GeckoConsole: postToControlServer@moz-extension://e7ef8f4c-4c1c-4b46-883a-b13de60c5407/runner.js:625:10
08-07 16:10:01.038 7295 7310 E GeckoConsole: @moz-extension://e7ef8f4c-4c1c-4b46-883a-b13de60c5407/runner.js:737:22

--->>>> 08-07 16:10:01.053 7295 7310 I Gecko : console.log: "[raptor-runnerjs] starting raptorRunner"
08-07 16:10:01.054 7295 7310 I GeckoConsole: [raptor-runnerjs] starting raptorRunner

--->>>> 08-07 16:10:01.054 7295 7310 E GeckoConsole: [JavaScript Error: "ReferenceError: getTestConfig is not defined" {file: "moz-extension://e7ef8f4c-4c1c-4b46-883a-b13de60c5407/runner.js" line: 671}]
--->>>> 08-07 16:10:01.054 7295 7310 E GeckoConsole: raptorRunner@moz-extension://e7ef8f4c-4c1c-4b46-883a-b13de60c5407/runner.js:671:16

08-07 16:10:01.082 7295 7310 D GeckoViewNavigation: onLocationChange
08-07 16:10:01.083 7295 7310 D GeckoViewProgress: onLocationChange: location=about:blank, flags=0
08-07 16:10:01.084 7295 7310 D GeckoViewProgress: onSecurityChange
08-07 16:10:01.087 7316 7331 D GeckoViewContent[C]: onInit

Flags: needinfo?(marian.raiciof)

On another job:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedJob=261228363&revision=1ca06cd3e226f80adf2b1b8837ed9a475fa90baa

We have this logcat :
https://taskcluster-artifacts.net/bQUQ7nTzQ2qIf_rxHMesZQ/0/public/test_info/logcat-ZY322MQ97M.log

After analysing the GeckoConsole lines i found this error:
GeckoConsole: [JavaScript Error: "Error while loading 'file:///storage/emulated/0/raptor/profile/extensions/raptor@mozilla.org/manifest.json' (NS_ERROR_FILE_ACCESS_DENIED)" {file: "resource://gre/modules/Extension.jsm" line: 522}]

If this manifest.json file is the same with:
https://searchfox.org/mozilla-central/source/testing/raptor/webext/raptor/manifest.json#12

We might lose the auto_gen_test_config.js which has the getTestConfig() definition, used by runner.js

Here is a copy of auto_gen_test_config.js from my local environment:
// this file is auto-generated by raptor, do not edit directly
function getTestConfig() {
return {"browser": "geckoview",
"cs_port": "52047",
"test_name": "raptor-tp6m-google-geckoview",
"test_settings_url": "http://127.0.0.1:52047/json/raptor-tp6m-google-geckoview.json",
"post_startup_delay": "30000",
"benchmark_port": "0",
"host": "127.0.0.1",
"debug_mode": "0",
"browser_cycle": "1"};
}

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC; 🏖 PTO August 15-26 from comment #14)

:rwood if there's no progress or improvement by the end of this week, could you arrange for temporarily dropping g5 cold tests to tier 2 and disable them running on autoland/inbound.

I'll make a patch for ^

Flags: needinfo?(rwood)
Depends on: 1574943
Whiteboard: [stockwell disable-recommended]

I don't have a g5 to test locally so I'm kind of flying in the dark here.

My first thought was about memory. The g5 devices appear to be from two cohorts: 1867MB and 2884MB. It seems the 1867MB devices have a higher failure rate.

The p2s all appear to have 3651MB. aerickson: Can you check with sakari or devarsh about the memory sizes on the devices?

Second thought has to do with the performance tweaks particularly the setting virtual memory parameters. Perhaps this is causing the low mem devices to fail due to swap issues?

Flags: needinfo?(bob) → needinfo?(aerickson)

Motorola G5's do come in two varieties: 16GB disk with 2GB RAM and 32GB disk with 3GB RAM.
All Pixel2's come with 4GB or RAM. Varieties: 64GB disk with 4GB RAM, 128GB disk with 4GB RAM

I've asked Bitbar for a report on the RAM for each of our G5's.

There are absolutely nodes that have a lower success ratio than others.

gecko-t-bitbar-gw-perf-g5.motog5-01  {sr: [          ]   0.0%, suc:  0, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-02  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-03  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-04  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-05  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-06  {sr: [          ]   5.3%, suc:  1, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-07  {sr: [          ]   0.0%, suc:  0, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-09  {sr: [==========] 100.0%, suc: 19, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-10  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-11  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-12  {sr: [==========] 100.0%, suc: 19, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-13  {sr: [=======   ]  78.9%, suc: 15, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-14  {sr: [          ]   5.3%, suc:  1, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-16  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-17  {sr: [==========] 100.0%, suc: 19, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-18  {sr: [==========] 100.0%, suc: 19, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-19  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-20  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-21  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-22  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-23  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-24  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-25  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-26  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-27  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-28  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-29  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-30  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-31  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-32  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-33  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-34  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-35  {sr: [========  ]  84.2%, suc: 16, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-36  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-37  {sr: [========= ]  94.7%, suc: 18, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-38  {sr: [========  ]  89.5%, suc: 17, cmp: 19, exc:  0, rng:  1}
gecko-t-bitbar-gw-perf-g5.motog5-39  {sr: [=======   ]  78.9%, suc: 15, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
Elapsed Time: 33.85214686393738

Here are the memory sizes for the g5s:

DEVICE_DISPLAY_NAME:motog5-01 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-02 MemTotal:1912812
DEVICE_DISPLAY_NAME:motog5-03 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-04 MemTotal:1912808
DEVICE_DISPLAY_NAME:motog5-05 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-06 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-07 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-08 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-09 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-10 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-11 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-12 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-13 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-14 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-15 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-16 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-17 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-18 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-19 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-20 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-21 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-22 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-23 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-24 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-25 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-26 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-27 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-28 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-29 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-30 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-31 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-32 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-33 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-34 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-35 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-36 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-37 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-38 MemTotal:2953344
DEVICE_DISPLAY_NAME:motog5-39 MemTotal:1912804
DEVICE_DISPLAY_NAME:motog5-40 MemTotal:1912804
Flags: needinfo?(aerickson)
Attached file motog5 groups

we have 30 2G motog5 and 10 3G motog5. When I have some time to dig through the results, I'll try to get a picture of how the memory affects the success/failure rate for these two groups of devices. Perhaps we can segregate the tests so that tests which require more memory run on the high memory devices only.

davehunt, rwood: thoughts?

Flags: needinfo?(rwood)
Flags: needinfo?(dave.hunt)

(In reply to Bob Clary [:bc:] from comment #25)

we have 30 2G motog5 and 10 3G motog5. When I have some time to dig through the results, I'll try to get a picture of how the memory affects the success/failure rate for these two groups of devices. Perhaps we can segregate the tests so that tests which require more memory run on the high memory devices only.

davehunt, rwood: thoughts?

Interesting, thanks :bc. IMO instead of segregating the tests to run on the mogog5's with higher memory, we should just only run the tests that require more memory on the P2s and not on the motog5's at all.

Flags: needinfo?(rwood)

I looked at the last week of failures for g5 and see the following

Count, test, memory
33 tp6m-8, 1867
16 tp6m-27, 1867
11 tp6m-5, 1867
7 tp6m-4, 1867
5 tp6m-7, 1867
5 tp6m-2, 1867
4 unity-webgl, 1867
4 tp6m-3, 1867
3 tp6m-9, 1867
3 tp6m-8, 2884
3 tp6m-12, 1867
2 tp6m-6, 1867
2 tp6m-24, 1867
2 tp6m-21, 1867
2 tp6m-10, 2884
1 youtube-playback, 2884
1 unity-webgl, 2884
1 tp6m-24, 2884
1 tp6m-23, 1867
1 tp6m-20, 1867
1 tp6m-17, 2884
1 tp6m-16, 1867
1 tp6m-14, 1867
1 tp6m-11, 1867

If someone else could decide which tests to disable on g5 and follow through with a patch that would be awesome.

Are these all cold load tests? At least some of them must be, as the warm tests only go up to tp6m-9 and I see tp6m-8 in there twice. The cold tests only run a single site per test, whereas the warm tests run several. We could see if redistributing the chunks helps for any warm tests affected here. For cold tests, we should consider if we have a problem if we're running out of memory opening a single site 25 times. I don't think we want to disable these tests against the moto g5, as this is our reference device for low end. Perhaps a better question is should we review our reference devices, and what should the minimum memory be?

Flags: needinfo?(dave.hunt)
Flags: needinfo?(bob)

I didn't break it down sufficiently to answer that question. I'll redo it later today and get back to you.

They were not all cold tests. The worst offenders were

 29 tp6m-8-geckoview-e10s,tp6m-8,System memory: 1867MB.
 16 tp6m-27-geckoview-cold-e10s,tp6m-c-27,System memory: 1867MB.
  8 tp6m-5-geckoview-cold-e10s,tp6m-c-5,System memory: 1867MB.
  7 tp6m-4-geckoview-cold-e10s,tp6m-c-4,System memory: 1867MB.
  5 tp6m-7-geckoview-cold-e10s,tp6m-c-7,System memory: 1867MB.
  5 tp6m-2-geckoview-e10s,tp6m-2,System memory: 1867MB.
Flags: needinfo?(bob)

Based on OF there was a spike here between August 11th and 18th, however failure rate seems to have decreased. Removing disable recommended tag.

https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-07-30&endday=2019-08-29&tree=trunk&bug=1568050

There are 17 total failures in the last 7 days on android-hw-g5-7-0-arm7-api-16 opt and pgo.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=263895372&repo=mozilla-central&lineNumber=1520

[task 2019-08-28T17:06:56.008Z] 17:06:48 INFO - raptor-main Info: starting geckoview
[task 2019-08-28T17:06:56.008Z] 17:06:49 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell am force-stop org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:06:56.008Z] 17:06:49 INFO - adb Granting important runtime permissions to org.mozilla.geckoview_example
[task 2019-08-28T17:06:56.008Z] 17:06:50 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.WRITE_EXTERNAL_STORAGE, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:06:56.008Z] 17:06:51 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.READ_EXTERNAL_STORAGE, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:06:56.008Z] 17:06:52 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.ACCESS_COARSE_LOCATION, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:06:56.008Z] 17:06:53 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.ACCESS_FINE_LOCATION, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:06:56.008Z] 17:06:54 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.CAMERA, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:06:56.008Z] 17:06:56 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell pm grant org.mozilla.geckoview_example android.permission.RECORD_AUDIO, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.519Z] 17:06:56 INFO - adb launch_application: am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.MAIN -d about:blank
[task 2019-08-28T17:09:46.520Z] 17:06:57 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.MAIN -d about:blank, timeout: None, root: False, timedout: None, exitcode: 0, output: Starting: Intent { act=android.intent.action.MAIN dat=about:blank cmp=org.mozilla.geckoview_example/.GeckoViewActivity }
[task 2019-08-28T17:09:46.520Z] 17:06:57 INFO - Status: ok
[task 2019-08-28T17:09:46.520Z] 17:06:57 INFO - Activity: org.mozilla.geckoview_example/.GeckoViewActivity
[task 2019-08-28T17:09:46.520Z] 17:06:57 INFO - ThisTime: 653
[task 2019-08-28T17:09:46.520Z] 17:06:57 INFO - TotalTime: 653
[task 2019-08-28T17:09:46.520Z] 17:06:57 INFO - WaitTime: 666
[task 2019-08-28T17:09:46.520Z] 17:06:57 INFO - Complete
[task 2019-08-28T17:09:46.520Z] 17:06:57 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell pidof org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output: 7966
[task 2019-08-28T17:09:46.520Z] 17:09:44 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell am force-stop org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.520Z] 17:09:44 CRITICAL - raptor-main Critical: Tests failed to finish! Application timed out.
[task 2019-08-28T17:09:46.520Z] 17:09:44 ERROR - raptor-main Error: Test failed to finish. Application timed out after 170 seconds
[task 2019-08-28T17:09:46.520Z] 17:09:44 INFO - raptor-main Info: removing reverse socket connections
[task 2019-08-28T17:09:46.520Z] 17:09:44 INFO - adb command_output: adb -s ZY3223F52J wait-for-device reverse --remove-all, timeout: None, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - adb shell_bool: adb -s ZY3223F52J wait-for-device shell test -d /sdcard/raptor/profile/minidumps, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - adb shell_bool: adb -s ZY3223F52J wait-for-device shell test -d /sdcard/raptor/profile/minidumps, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - adb command_output: adb -s ZY3223F52J wait-for-device pull /sdcard/raptor/profile/minidumps /tmp/tmpbI5XS4/minidumps, timeout: None, timedout: None, exitcode: 0, output: /sdcard/raptor/profile/minidumps/: 0 files pulled.
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - mozproxy Stopping mitmproxy playback, killing process 787
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - mozproxy Successfully killed the mitmproxy playback process
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - raptor-main Info: removing webext /builds/task_1567011825/workspace/build/tests/raptor/raptor/../webext/raptor
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - raptor-results-handler Info: summarizing raptor test results
[task 2019-08-28T17:09:46.520Z] 17:09:45 ERROR - raptor-output Error: no raptor test results found for raptor-tp6m-bing-geckoview-cold
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - raptor-output Info: error: no raptor test results found, so no need to combine browser cycles
[task 2019-08-28T17:09:46.520Z] 17:09:45 ERROR - raptor-output Error: no summarized raptor results found for raptor-tp6m-bing-geckoview-cold
[task 2019-08-28T17:09:46.520Z] 17:09:45 CRITICAL - raptor-results-handler Critical: PERFHERDER_DATA was seen 0 times, expected 1.
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - raptor-main Info: removing test folder for raptor: /sdcard/raptor
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell rm -r /sdcard/raptor, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.520Z] 17:09:45 INFO - adb shell_output: adb -s ZY3223F52J wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T17:09:46.520Z] 17:09:46 INFO - adb shell_bool: adb -s ZY3223F52J wait-for-device shell test -e /sdcard/raptor, timeout: None, root: False, timedout: None, exitcode: 1, output:
[task 2019-08-28T17:09:46.520Z] 17:09:46 INFO - raptor-control-server Info: shutting down control server
[task 2019-08-28T17:09:46.520Z] 17:09:46 INFO - raptor-main Info: finished
[task 2019-08-28T17:10:05.918Z] 17:09:46 ERROR - raptor-main Critical: TEST-UNEXPECTED-FAIL: no raptor test results were found for raptor-tp6m-bing-geckoview-cold
[task 2019-08-28T17:10:05.918Z] 17:09:46 ERROR - Return code: 1

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

I bisected the most recent regression regarding this bug is

changeset: 493035:e7364a10a663
user: Rob Wu <rob@robwu.nl>
date: Thu Sep 12 21:39:51 2019 +0000
summary: Bug 1570715 - Treat (deprecation) warnings as errors r=rpl

Edit: Actually this was already filed as bug 1581223

Regressed by: 1570715
No longer regressed by: 1570715
Whiteboard: [stockwell disable-recommended]
Flags: needinfo?(fstrugariu)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]
Assignee: nobody → alexandru.ionescu
Whiteboard: [stockwell disable-recommended]
Flags: needinfo?(fstrugariu)

There are 36 failures present on this bug over the last 7 days, these happen on windows10-64-ref-hw-2017, linux64-shippable, android-hw-p2-8-0-arm7-api-16, android-hw-g5-7-0-arm7-api-16

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=271930000&repo=mozilla-central&lineNumber=1516

Flags: needinfo?(alexandru.ionescu)

(In reply to Stefan Hindli [:stefan_hindli] from comment #45)

There are 36 failures present on this bug over the last 7 days, these happen on windows10-64-ref-hw-2017, linux64-shippable, android-hw-p2-8-0-arm7-api-16, android-hw-g5-7-0-arm7-api-16

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=271930000&repo=mozilla-central&lineNumber=1516

I'm working on it.

Flags: needinfo?(alexandru.ionescu)
Whiteboard: [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner][comment 46]
Whiteboard: [comment 46][stockwell disable-recommended] → [comment 46]

Over the last 7 days there are 51 failures present on this bug. These happen on windows10-aarch64, windows10-64-ref-hw-2017, macosx1014-64-shippable, android-hw-p2-8-0-arm7-api-16, android-hw-g5-7-0-arm7-api-16,

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=273422281&repo=autoland&lineNumber=1458

Flags: needinfo?(alexandru.ionescu)

I am still working on it.

Flags: needinfo?(alexandru.ionescu)
Whiteboard: [comment 46][stockwell disable-recommended] → [comment 46]

I believe recent occurences here (15th of Nov) are due to https://hg.mozilla.org/mozilla-central/rev/738801392270

Agi can you provide some insight?

Flags: needinfo?(agi)

Thanks, I think I know what's going on there, I'll fix it before pushing again.

Flags: needinfo?(agi)
Whiteboard: [comment 46][stockwell disable-recommended] → [comment 46][comment 60]
Whiteboard: [comment 46][comment 60] → [comment 46][stockwell fixed:backout]

This is spiking again. Agi are there any updates?

Flags: needinfo?(agi)
Whiteboard: [comment 46][stockwell fixed:backout] → [comment 46][stockwell needswork:owner]

I don't see the spike? https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-10-28&endday=2019-11-27&tree=trunk&bug=1568050 Also my changes have been in m-c since the 18th so I doubt it's anything related to that?

Flags: needinfo?(agi)

(In reply to :Agi | ⏰ PST | he/him from comment #66)

I don't see the spike? https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-10-28&endday=2019-11-27&tree=trunk&bug=1568050 Also my changes have been in m-c since the 18th so I doubt it's anything related to that?

You cannot see it because that is filtered for the last month.

Since your patch landed it looks like this: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-11-16&endday=2019-11-27&tree=trunk&bug=1568050

Flags: needinfo?(agi)

I still don't understand, the spike on the 24th was fixed by this https://hg.mozilla.org/integration/autoland/rev/b4755981c1382cb88fed4e4fcff3ba73779b2080. Excluding that the level of intermittents is consistent with what looks to be the normal for this test? Why do you think it's my change? I looked into some of the failures and I don't see anything pointing to Bug 1530402

Flags: needinfo?(agi)

(In reply to :Agi | ⏰ PST | he/him from comment #68)

I still don't understand, the spike on the 24th was fixed by this https://hg.mozilla.org/integration/autoland/rev/b4755981c1382cb88fed4e4fcff3ba73779b2080. Excluding that the level of intermittents is consistent with what looks to be the normal for this test? Why do you think it's my change? I looked into some of the failures and I don't see anything pointing to Bug 1530402

I don't think it's your change, I'm just asking if it is. However, you answered what I actually wanted to know, meaning: "the spike on the 24th was fixed by this https://hg.mozilla.org/integration/autoland/rev/b4755981c1382cb88fed4e4fcff3ba73779b2080"

Thank you.

Ah sorry I misread your comments, cool. Thanks for looking into this!

(In reply to :Agi | ⏰ PST | he/him from comment #70)

Ah sorry I misread your comments, cool. Thanks for looking into this!

Hehe, no problem :) thank you for helping me figure this out.

This failure message is frequently shown in logs for tasks run which are set as 'Pass'. If this non-fatal error message gets shown on a run in which an actual failure occurred this bug might be considered the original one which caused the task to fail. Currently all occurrences I am aware of are for test-android-hw-g5-7-0-arm7-api-16/pgo-raptor-tp6m-<randomchunknumber>-geckoview-cold-e10s. Examples:

https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=superseded%2Cusercancel%2Cretry%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&revision=8861d7d25b128f92873bfeedda8b31ef96c36729&selectedJob=278593443

https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=superseded%2Cusercancel%2Cretry%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&revision=90701a3e29e1a2b75d93596166a95c9a7aacf592&selectedJob=278587009

Hi Alexandru. Is there any way to disable this until you have a fix?

There are 35 total failures in the last 7 days and 570 total failures in the last 30 on android-hw-g5-7-0-arm7-api-16 pgo.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278929257&repo=mozilla-central&lineNumber=2054

Flags: needinfo?(aionescu)

Don't think so. The intermittent is caused by the raptor tp6 tests. I have a fix that might relate to this one, or at least make the occurrence lower. Just pushed to try and submitted to phab review.
Bebe, other ideas?

Flags: needinfo?(aionescu) → needinfo?(fstrugariu)
Whiteboard: [comment 46][stockwell disable-recommended] → [comment 75]
Assignee: aionescu → nobody

The intermittents here are mixed, but most often occurs CRITICAL - raptor-main Critical: Tests failed to finish! Application timed out.

looking at windows (!aarch64), the most recent 5 failures match that, most failures are on tp6-c-XX for firefox. here is a successful log snippet:
[task 2019-12-09T10:59:15.913Z] 10:59:15 INFO - raptor-mitmproxy Info: Verified mitmproxy CA certificate is installed in Firefox
[task 2019-12-09T10:59:15.913Z] 10:59:15 INFO - raptor-main Info: Playback recording date: 2019-06-21
[task 2019-12-09T10:59:15.913Z] 10:59:15 INFO - raptor-main Info: starting firefox
[task 2019-12-09T10:59:15.913Z] 10:59:15 INFO - Application command: C:\Users\task_1575887212\build\application\firefox\firefox.exe --wait-for-browser -profile c:\users\task_1575887212\appdata\local\temp\tmpte23fe.mozrunner
[task 2019-12-09T10:59:17.604Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] Attaching event listener successful!"
[task 2019-12-09T10:59:17.609Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] starting raptorRunner"
[task 2019-12-09T10:59:17.609Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] test name is: raptor-tp6-sheets-firefox-cold"
[task 2019-12-09T10:59:17.609Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] test settings url is: http://127.0.0.1:49783/json/raptor-tp6-sheets-firefox-cold.json"
[task 2019-12-09T10:59:17.609Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] raptor runner.js is loaded!"
[task 2019-12-09T10:59:17.609Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] posting to control server"
[task 2019-12-09T10:59:17.609Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] raptor runner.js is loaded!"
[task 2019-12-09T10:59:17.614Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] testing on Firefox 73.0a1 20191209095039"
[task 2019-12-09T10:59:17.614Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] getting test settings from control server"
[task 2019-12-09T10:59:17.619Z] 10:59:17 INFO - raptor-control-server Info: received webext_status: raptor runner.js is loaded!
[task 2019-12-09T10:59:17.619Z] 10:59:17 INFO - raptor-control-server Info: reading test settings from json/raptor-tp6-sheets-firefox-cold.json
[task 2019-12-09T10:59:17.619Z] 10:59:17 INFO - raptor-control-server Info: sent test settings to webext runner
[task 2019-12-09T10:59:17.624Z] 10:59:17 INFO - PID 676 | console.log: "[raptor-runnerjs] post success"

and the failing ones:
task 2019-12-09T00:00:32.148Z] 00:00:32 INFO - Application command: C:\Users\task_1575841074\build\application\firefox\firefox.exe --wait-for-browser -profile c:\users\task_1575841074\appdata\local\temp\tmpu31emg.mozrunner
[task 2019-12-09T00:03:34.255Z] 00:03:34 CRITICAL - raptor-main Critical: Tests failed to finish! Application timed out.
[task 2019-12-09T00:03:34.255Z] 00:03:34 ERROR - raptor-main Error: Test failed to finish. Application timed out after 170 seconds
[task 2019-12-09T00:03:34.255Z] 00:03:34 INFO - raptor-mitmproxy Info: Stopping mitmproxy playback, killing process 10168
[task 2019-12-09T00:03:34.274Z] 00:03:34 ERROR - raptor-mitmproxy Error: Mitmproxy exited with error code 572
[task 2019-12-09T00:03:34.274Z] 00:03:34 INFO - raptor-mitmproxy Info: Turning off the browser proxy
[task 2019-12-09T00:03:34.274Z] 00:03:34 INFO - raptor-mitmproxy Info: writing: C:\Users\task_1575841074\build\application\firefox\distribution\policies.json

the problem is that "[raptor-runnerjs] Attaching event listener successful!" is never output in the failing case and the task times out. this seems predominately on windows, if there are concerns with windows startup, maybe asking :dmajor for help debugging would be a good idea.

Hi :dmajor, can you please debug into the above?

Flags: needinfo?(fstrugariu) → needinfo?(dmajor)

Sorry, I don't know how to debug this kind of thing. Whatever is happening here is at a very different level of the application than my expertise.

Flags: needinfo?(dmajor)
Priority: P5 → P1

This is the same as Bug 1605991. Duping there.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(snorp)
Resolution: --- → DUPLICATE
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: