Closed Bug 1610279 Opened 5 years ago Closed 4 years ago

tp6m-c-1-vismet failing on Android 7.0 Moto G5 (Try)

Categories

(Testing :: Raptor, defect, P2)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: acreskey, Unassigned)

References

Details

On pushes to Try with or without additional commits, I'm consistently seeing the tp6m-c-1-vismet test fail on Android 7.0 Moto G5.

e.g.
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=e599f431034ecca4d749fda872a2de60c3c1d721&selectedJob=285499634

The error is
Error extracting the video frames from /builds/worker/workspace/jobs/4/pages/www.amazon.com/data/video/1.mp4

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285499634&repo=try&lineNumber=358

Also failing on Android 7.0 MotoG5 opt

Dave, could you set a priority for this?

Flags: needinfo?(dave.hunt)

(In reply to Ionuț Goldan [:igoldan] from comment #2)

Dave, could you set a priority for this?

Anything blocking visual metrics results should be considered P1.

Flags: needinfo?(dave.hunt)
Priority: -- → P1

I thought I saw this even with the newer browsertime from upstream, but I can't find the link. No matter, I will do an initial investigation.

Assignee: nobody → nalexander
Status: NEW → ASSIGNED

(In reply to Nick Alexander :nalexander [he/him] from comment #4)

I thought I saw this even with the newer browsertime from upstream, but I can't find the link. No matter, I will do an initial investigation.

OK, first thoughts: first, it's not clear that we should be running the (red) vismet task at all, 'cuz its dependent (orange) browsertime task did, in fact, fail. Let's put a pin in that.

Inspecting the geckodriver logs suggests that we're killing ourselves in some way:

[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -  [2020-01-20 00:11:06] INFO: [browsertime.video] Stop firefox window recorder.
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -  [2020-01-20 00:11:06] VERBOSE: [browsertime] Executing privileged async script       const cb = arguments[arguments.length-1];
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -        Promise.resolve(
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -          windowUtils.setCompositionRecording(false)
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -        )
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -          .then(() => cb());
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -  1579479066047	webdriver::server	DEBUG	-> GET /session/32de162e-2cca-4d10-a7f7-b9f5062f1ae7/moz/context
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -  1579479066054	webdriver::server	DEBUG	<- 200 OK {"value":"content"}
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -  1579479066055	webdriver::server	DEBUG	-> POST /session/32de162e-2cca-4d10-a7f7-b9f5062f1ae7/moz/context {"context":"chrome"}
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -  1579479066063	webdriver::server	DEBUG	<- 200 OK {"value":null}
[task 2020-01-20T00:12:15.111Z] 00:11:06     INFO -  1579479066065	webdriver::server	DEBUG	-> POST /session/32de162e-2cca-4d10-a7f7-b9f5062f1ae7/execute/async {"script":"      const cb = arguments[arguments.length-1];\n      Promise.resolve(\n        windowUtils.setCompositionRecording(false)\n      )\n        .then(() => cb());\n  ","args":[null]}
[task 2020-01-20T00:12:15.111Z] 00:11:51     INFO -  1579479111164	webdriver::server	DEBUG	Deleting session
[task 2020-01-20T00:12:15.111Z] 00:11:51     INFO -  1579479111164	geckodriver::android	DEBUG	Force stopping the Android package: org.mozilla.geckoview_example
[task 2020-01-20T00:12:15.111Z] 00:11:51     INFO -  1579479111164	mozdevice	DEBUG	Force stopping Android package: org.mozilla.geckoview_example
[task 2020-01-20T00:12:15.111Z] 00:11:51     INFO -  1579479111164	mozdevice	DEBUG	execute_host_command: >> "host:transport:ZY322LDKCW"
[task 2020-01-20T00:12:15.111Z] 00:11:51     INFO -  1579479111164	mozdevice	DEBUG	execute_host_command: << []
[task 2020-01-20T00:12:15.111Z] 00:11:51     INFO -  1579479111164	mozdevice	DEBUG	execute_host_command: >> "shell:am force-stop org.mozilla.geckoview_example"
[task 2020-01-20T00:12:15.111Z] 00:11:53     INFO -  1579479113783	mozdevice	DEBUG	execute_host_command: << ""
[task 2020-01-20T00:12:15.111Z] 00:11:53     INFO -  1579479113783	geckodriver::marionette	DEBUG	Android package force-stopped
[task 2020-01-20T00:12:15.111Z] 00:11:53     INFO -  1579479113783	mozdevice	DEBUG	execute_host_command: >> "host:transport:ZY322LDKCW"
[task 2020-01-20T00:12:15.111Z] 00:11:53     INFO -  1579479113783	mozdevice	DEBUG	execute_host_command: << []
[task 2020-01-20T00:12:15.111Z] 00:11:53     INFO -  1579479113783	mozdevice	DEBUG	execute_host_command: >> "shell:am clear-debug-app org.mozilla.geckoview_example"
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -  1579479114833	mozdevice	DEBUG	execute_host_command: << ""
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -  1579479114833	geckodriver::android	DEBUG	Disabled reading from configuration file
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -  1579479114834	geckodriver::android	DEBUG	Android port forward (34803 -> 2829) stopped
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -  1579479114834	webdriver::server	DEBUG	<- 500 Internal Server Error {"value":{"error":"unknown error","message":"Failed to decode response from marionette","stacktrace":""}}
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -  /builds/task_1579478599/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/promise.js:2626
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -          throw error;
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -          ^
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -  WebDriverError: Failed to decode response from marionette
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at Object.throwDecodedError (/builds/task_1579478599/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/error.js:514:15)
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at parseHttpResponse (/builds/task_1579478599/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/http.js:519:13)
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at doSend.then.response (/builds/task_1579478599/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/http.js:441:30)
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at <anonymous>
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at process._tickCallback (internal/process/next_tick.js:189:7)
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -  From: Task: WebDriver.executeScript()
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at Driver.schedule (/builds/task_1579478599/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/webdriver.js:807:17)
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at Driver.executeAsyncScript (/builds/task_1579478599/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/webdriver.js:891:17)
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at SeleniumRunner.runPrivilegedAsyncScript (/builds/task_1579478599/fetches/browsertime/node_modules/browsertime/lib/core/seleniumRunner.js:390:30)
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at <anonymous>
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -      at process._tickCallback (internal/process/next_tick.js:189:7)
[task 2020-01-20T00:12:15.111Z] 00:11:54     INFO -  raptor-main Info: removing reverse socket connections

Now, the geckodriver logging doesn't make it easy to determine the actual sequence, because there's no request ID threaded through response/request pairs. But it sure looks like there was an outstanding response/request pair. Or that the /execute/async response/request died in the middle (usually, "Failed to decode response from marionette" means a crash (often OOM, with window recorder)).

I can think of a few ways to proceed:

  1. Improve the geckodriver logging so that we get more information about in-flight response/request pairs. If we've forgotten an await somewhere in browsertime, this will help us find it. I will try to see if there is a mechanical concurrency issue of this type shortly.
  2. Actually observe crashes in browsertime jobs: Bug 1585036. This has independent value but will require some additional labour.

I realized that I didn't look at some of the logs to investigate crashes. I do think that we're seeing an OOM. From adb logcat:

01-20 00:11:51.092   535   535 E lowmemorykiller: Error writing /proc/8820/oom_score_adj; errno=22
01-20 00:11:51.094  1513  2623 W ActivityManager: Failed setting process group of 8820 to 1
...
01-20 00:11:51.105  1513  2567 I ActivityManager: Process org.mozilla.geckoview_example (pid 8820) has died
01-20 00:11:51.105  1513  2567 D ActivityManager: cleanUpApplicationRecord -- 8820
01-20 00:11:51.108  1513  2567 W ActivityManager: Force removing ActivityRecord{410e1e8 u0 org.mozilla.geckoview_example/.GeckoViewActivity t25798}: app died, no saved state

Which I think means that the GVE proc (in this case, it's 8820) was in fact killed by the OOM mechanism. However -- I don't see any evidence of crash reporting/breakpad. That means that we're unlikely to be able to do 2. in the comment above, i.e., to observe crashes.

Bas: could you suggest why we see OOM when we try to turn off (and write to disk, I guess) the window recorder rather than during regular use? That suggests that we can keep lots of frames but not process them. Any thoughts on why?

Flags: needinfo?(bas)

I'm not working on this.

Status: ASSIGNED → NEW
Assignee: nalexander → nobody
Priority: P1 → P2

This is fixed now for mobile.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bas)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.