Open Bug 1585036 Opened 5 years ago Updated 2 months ago

Add support to raptor-browsertime to check for browser crashes for desktop

Categories

(Testing :: Raptor, enhancement, P2)

Version 3
enhancement

Tracking

(Not tracked)

People

(Reporter: rwood, Assigned: aglavic)

References

(Depends on 1 open bug, Blocks 3 open bugs)

Details

(Whiteboard: [fxp])

Attachments

(2 files)

Check for and report any browser crashes after raptor-browsertime test runs.

Blocks: 1582756

:bebe can you take a look over this?

Flags: needinfo?(fstrugariu)

Browsewrtime uses gecko driver to drive Firefox

Because of this we use a "temp" profile that is created and deleted by the driver

We can use this strategy to save the profile and collect the crash data:
https://firefox-source-docs.mozilla.org/testing/geckodriver/CrashReports.html

Notes on JS: https://selenium.dev/selenium/docs/api/javascript/module/selenium-webdriver/firefox.html

Or force browsertime/driver to use the preexisting profile and not clone info from a template
https://github.com/mozilla/browsertime/blob/master/lib/firefox/webdriver/index.js#L52

Assignee: nobody → tarek
Priority: P2 → P1

I talked with Tarek today and we agreed that I will pick up this ticket.

Assignee: tarek → nalexander
Status: NEW → ASSIGNED
Flags: needinfo?(fstrugariu)
Priority: P1 → P2

(In reply to Nick Alexander :nalexander [he/him] from comment #3)

I talked with Tarek today and we agreed that I will pick up this ticket.

And now it's been deprioritized.

For whomever looks at this next, we can force crashes in various processes by using chrome JS to load URIs, like:

await commands.js.runPrivileged('window.BrowserApp.selectedTab.browser.loadURI("about:crashcontent");');
await commands.js.runPrivileged('window.BrowserApp.selectedTab.browser.loadURI("about:crashparent");');
Assignee: nalexander → nobody
Status: ASSIGNED → NEW

An update on this: locally, I have geckodriver pulling minidumps off the device. I still need to wire up the mozcrash pieces, but that looks almost trivial once the minidumps are captured.

What I don't have great ideas for is how to test this in automation. Using privileged JS (see #c4) we can trigger various crashes, but adding Raptor/Browsertime test jobs that expect crashes is pretty ugly because there are a lot of assumptions baked into the existing code about pageload. (One can see the mismatch with the way the scenario tests were wedged in, and to a lesser extent, the way the benchmark tests were implemented.) At the end of the day I will probably add a new "type" and some flags to the test descriptions to describe the expected crashes, and the whole framework will get yet more overloaded :(

For the record, we see OOM crashes with the Window Recorder under browsertime a lot. They look like this "green" job, which crashes during the second run. (I.e, doesn't write 2.mp4.) Note the browsertime error:

[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  [2020-01-31 19:02:39] VERBOSE: [browsertime.command.measure] Collecting metrics
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  [2020-01-31 19:02:39] INFO: [browsertime.video] Stop firefox window recorder.
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  [2020-01-31 19:02:39] TRACE: [browsertime] Executing privileged async script       const cb = arguments[arguments.length-1];
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -        Promise.resolve(
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -          windowUtils.setCompositionRecording(false)
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -        )
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -          .then(() => cb());
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -      at SeleniumRunner.runPrivilegedAsyncScript (/builds/task_1580497051/fetches/browsertime/node_modules/browsertime/lib/core/seleniumRunner.js:417:9)
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -      at enableWindowRecorder (/builds/task_1580497051/fetches/browsertime/node_modules/browsertime/lib/video/screenRecording/firefox/firefoxWindowRecorder.js:31:18)
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -      at FirefoxWindowRecorder.stop (/builds/task_1580497051/fetches/browsertime/node_modules/browsertime/lib/video/screenRecording/firefox/firefoxWindowRecorder.js:175:11)
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -      at Video.stop (/builds/task_1580497051/fetches/browsertime/node_modules/browsertime/lib/video/video.js:69:26)
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  [2020-01-31 19:02:39] VERBOSE: [browsertime] Executing privileged async script toggle composition recorder
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  1580497359322	webdriver::server	DEBUG	-> GET /session/d69b8b35-cbd1-4c86-9b50-d890a53b97e5/moz/context
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  1580497359324	webdriver::server	DEBUG	<- 200 OK {"value":"content"}
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  1580497359325	webdriver::server	DEBUG	-> POST /session/d69b8b35-cbd1-4c86-9b50-d890a53b97e5/moz/context {"context":"chrome"}
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  1580497359327	webdriver::server	DEBUG	<- 200 OK {"value":null}
[task 2020-01-31T19:03:09.004Z] 19:02:39     INFO -  1580497359328	webdriver::server	DEBUG	-> POST /session/d69b8b35-cbd1-4c86-9b50-d890a53b97e5/execute/async {"script":"      const cb = arguments[arguments.length-1];\n      Promise.resolve(\n        windowUtils.setCompositionRecording(false)\n      )\n        .then(() => cb());\n  ","args":[null]}
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386132	webdriver::server	DEBUG	Deleting session
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386132	geckodriver::android	DEBUG	Force stopping the Android package: org.mozilla.geckoview_example
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386132	mozdevice	DEBUG	Force stopping Android package: org.mozilla.geckoview_example
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386132	mozdevice	DEBUG	execute_host_command: >> "host:transport:FA79D1A07331"
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386132	mozdevice	DEBUG	execute_host_command: << []
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386132	mozdevice	DEBUG	execute_host_command: >> "shell:am force-stop org.mozilla.geckoview_example"
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386199	mozdevice	DEBUG	execute_host_command: << ""
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386199	geckodriver::marionette	DEBUG	Android package force-stopped
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386199	mozdevice	DEBUG	execute_host_command: >> "host:transport:FA79D1A07331"
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386199	mozdevice	DEBUG	execute_host_command: << []
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386199	mozdevice	DEBUG	execute_host_command: >> "shell:am clear-debug-app org.mozilla.geckoview_example"
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386223	mozdevice	DEBUG	execute_host_command: << ""
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386223	geckodriver::android	DEBUG	Disabled reading from configuration file
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386224	geckodriver::android	DEBUG	Android port forward (44607 -> 2829) stopped
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386224	webdriver::server	DEBUG	<- 500 Internal Server Error {"value":{"error":"unknown error","message":"Failed to decode response from marionette","stacktrace":""}}
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  [2020-01-31 19:03:06] ERROR: [browsertime] Couldn't execute async script named toggle composition recorder error:WebDriverError: Failed to decode response from marionette
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  [2020-01-31 19:03:06] ERROR: [browsertime] WebDriverError: Failed to decode response from marionette
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -      at Object.throwDecodedError (/builds/task_1580497051/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/error.js:550:15)
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -      at parseHttpResponse (/builds/task_1580497051/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/http.js:563:13)
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -      at Executor.execute (/builds/task_1580497051/fetches/browsertime/node_modules/browsertime/node_modules/selenium-webdriver/lib/http.js:489:26)
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -      at process._tickCallback (internal/process/next_tick.js:68:7)
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386225	webdriver::server	DEBUG	-> DELETE /session/d69b8b35-cbd1-4c86-9b50-d890a53b97e5
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386225	webdriver::server	DEBUG	<- 404 Not Found {"value":{"error":"invalid session id","message":"Tried to run command without establishing a connection","stacktrace":""}}
[task 2020-01-31T19:03:09.004Z] 19:03:06     INFO -  1580497386229	geckodriver	DEBUG	Listening on 127.0.0.1:54237

and the adb logcat, which I think shows the OOM killer reaping processes until the tab goes down:

01-31 19:02:39.257   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:02:39.285  6618  6634 I Gecko   : 1580497359285	Marionette	DEBUG	0 -> [0,18,"WebDriver:ExecuteScript",{"args":[5000],"script":"\nreturn (function(waitTime) {\n    try { \n            var end = wi ... () > end - start + waitTime);\n    } \n    catch(e) {\n        return true;\n    }\n})(arguments[arguments.length - 1]);\n"}]
01-31 19:02:39.290  6618  6634 I Gecko   : 1580497359290	Marionette	DEBUG	0 <- [1,18,null,{"value":true}]
01-31 19:02:39.293  6618  6634 I Gecko   : 1580497359293	Marionette	DEBUG	0 -> [0,19,"WebDriver:ExecuteScript",{"args":[],"script":"return document.documentURI;"}]
01-31 19:02:39.297  6618  6634 I Gecko   : 1580497359297	Marionette	DEBUG	0 <- [1,19,null,{"value":"https://www.amazon.com/"}]
01-31 19:02:39.318  6618  6634 I Gecko   : 1580497359318	Marionette	DEBUG	0 -> [0,20,"Marionette:GetContext",{}]
01-31 19:02:39.318  6618  6634 I Gecko   : 1580497359318	Marionette	DEBUG	0 <- [1,20,null,{"value":"content"}]
01-31 19:02:39.321  6618  6634 I Gecko   : 1580497359321	Marionette	DEBUG	0 -> [0,21,"Marionette:SetContext",{"value":"chrome"}]
01-31 19:02:39.321  6618  6634 I Gecko   : 1580497359321	Marionette	DEBUG	0 <- [1,21,null,{"value":null}]
01-31 19:02:39.323  6618  6634 I Gecko   : 1580497359323	Marionette	DEBUG	0 -> [0,22,"WebDriver:ExecuteAsyncScript",{"args":[null],"script":"      const cb = arguments[arguments.length-1];\n      Promise.resolve(\n        windowUtils.setCompositionRecording(false)\n      )\n        .then(() => cb());\n  "}]
01-31 19:02:39.535   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:02:40.264   806   806 I chatty  : uid=1000(system) power@1.1-servi identical 5 lines
01-31 19:02:40.536   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:02:49.134  6618  6634 E GeckoConsole: [JavaScript Error: "Polling for changes failed: NetworkError when attempting to fetch resource.." {file: "resource://services-settings/remote-settings.js" line: 251}]
01-31 19:02:49.134  6618  6634 E GeckoConsole: remoteSettingsFunction/remoteSettings.pollChanges@resource://services-settings/remote-settings.js:251:13
01-31 19:02:56.575  2454  2454 E TaskServiceHelper: Cannot execute TaskName.UNKNOWN (or data got corrupted).
01-31 19:02:56.627  4584  6873 W Gcs.WfaEliRefreshTask: [327] h.b: NULL eligibility response.
01-31 19:02:56.928  1177  1556 E BatteryStatsService: no controller energy info supplied
01-31 19:02:56.953  1177  1556 E KernelUidCpuFreqTimeReader: Failed to read /proc/uid_time_in_state: java.io.FileNotFoundException: /proc/uid_time_in_state (No such file or directory)
01-31 19:02:59.241  1177  1188 D WificondControl: Scan result ready event
01-31 19:03:00.028   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:04.105  1177  1586 I ActivityManager: Process com.google.android.gms.unstable (pid 3748) has died: cch+4CEM 
01-31 19:03:00.082   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:04.105  1177  1197 W zygote64: kill(-3748, 9) failed: No such process
01-31 19:03:04.105  1177  1197 I zygote64: Successfully killed process cgroup uid 10017 pid 3748 in 0ms
01-31 19:03:04.106   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:04.624  1177  1586 I ActivityManager: Process com.google.android.contacts (pid 5283) has died: cch+6CEM 
01-31 19:03:04.159   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:04.624  1177  1197 W zygote64: kill(-5283, 9) failed: No such process
01-31 19:03:04.625  1177  1197 I zygote64: Successfully killed process cgroup uid 10051 pid 5283 in 0ms
01-31 19:03:04.625   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:04.631  1177  1556 I ActivityManager: Process android.process.acore (pid 5263) has died: cch+6CEM 
01-31 19:03:04.631  1177  1197 W zygote64: kill(-5263, 9) failed: No such process
01-31 19:03:04.631  1177  1197 I zygote64: Successfully killed process cgroup uid 10026 pid 5263 in 0ms
01-31 19:03:04.680   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:05.288  1177  1556 I ActivityManager: Process com.android.mtp (pid 5995) has died: cch+4CEM 
01-31 19:03:05.288  1177  1197 W zygote64: kill(-5995, 9) failed: No such process
01-31 19:03:05.288  1177  1197 I zygote64: Successfully killed process cgroup uid 10003 pid 5995 in 0ms
01-31 19:03:05.289   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:05.289   828   828 E lowmemorykiller: Error writing /proc/3031/oom_score_adj; errno=22
01-31 19:03:05.289  1177  1556 I ActivityManager: Process android.process.media (pid 6036) has died: cch+4CEM 
01-31 19:03:05.290  1177  1197 W zygote64: kill(-6036, 9) failed: No such process
01-31 19:03:05.290  1177  1197 I zygote64: Successfully killed process cgroup uid 10003 pid 6036 in 0ms
01-31 19:03:05.298  1177  1498 D ConnectivityService: ConnectivityService NetworkRequestInfo binderDied(NetworkRequest [ LISTEN id=12, [ Capabilities: INTERNET&NOT_RESTRICTED&TRUSTED] ], android.os.BinderProxy@72d45a4)
01-31 19:03:05.298  1177  1188 I ActivityManager: Process com.android.vending (pid 3031) has died: cch+6CEM 
01-31 19:03:05.299  1177  1197 W zygote64: kill(-3031, 9) failed: No such process
01-31 19:03:05.299  1177  1197 I zygote64: Successfully killed process cgroup uid 10006 pid 3031 in 0ms
01-31 19:03:05.351   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:05.645  6618  6715 W google-breakpad: ExceptionHandler::GenerateDump cloned child 
01-31 19:03:05.645  6876  6715 W google-breakpad: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
01-31 19:03:05.645  6618  6715 W google-breakpad: 6876
01-31 19:03:05.645  6618  6715 W google-breakpad: 
01-31 19:03:05.645  6618  6715 W google-breakpad: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
01-31 19:03:05.685   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:05.950   806   806 I chatty  : uid=1000(system) power@1.1-servi identical 1 line
01-31 19:03:06.019   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:06.126  6657  6729 I Gecko   : Exiting due to channel error.
01-31 19:03:06.137  1177  2351 I ActivityManager: Process org.mozilla.geckoview_example:socket (pid 6679) has died: fore TOP 
01-31 19:03:06.138  1177  2351 W ActivityManager: Scheduling restart of crashed service org.mozilla.geckoview_example/org.mozilla.gecko.process.GeckoServiceChildProcess$socket in 1000ms
01-31 19:03:06.138  1177  1197 W zygote64: kill(-6679, 9) failed: No such process
01-31 19:03:06.138  1177  1197 I zygote64: Successfully killed process cgroup uid 10120 pid 6679 in 0ms
01-31 19:03:06.140   806   806 E android.hardware.power@1.1-service.wahoo: Governor not supported by powerHAL, skipping
01-31 19:03:06.146  1177  1584 I ActivityManager: Process org.mozilla.geckoview_example:tab (pid 6657) has died: fore TOP 
01-31 19:03:06.146  1177  1197 W zygote64: kill(-6657, 9) failed: No such process
01-31 19:03:06.146  1177  1584 W ActivityManager: Scheduling restart of crashed service org.mozilla.geckoview_example/org.mozilla.gecko.process.GeckoServiceChildProcess$tab in 10991ms
01-31 19:03:06.147  1177  1286 W InputDispatcher: channel '3dc7eb9 org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity (server)' ~ Consumer closed input channel or an error occurred.  events=0x9
01-31 19:03:06.148  1177  1586 I ActivityManager: Process org.mozilla.geckoview_example (pid 6618) has died: fore TOP 
01-31 19:03:06.148  1177  1497 I WindowManager: WIN DEATH: Window{3dc7eb9 u0 org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity}
01-31 19:03:06.148  1177  1586 W ActivityManager: Force removing ActivityRecord{82e35aa u0 org.mozilla.geckoview_example/.GeckoViewActivity t5806}: app died, no saved state
01-31 19:03:06.149  1177  1286 E InputDispatcher: channel '3dc7eb9 org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity (server)' ~ Channel is unrecoverably broken and will be disposed!
01-31 19:03:06.149  1177  1497 W InputDispatcher: Attempted to unregister already unregistered input channel '3dc7eb9 org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity (server)'
01-31 19:03:06.152   738   738 I Zygote  : Process 6618 exited due to signal (11)

note that we are getting some kind of minidump written:

01-31 19:03:05.645  6618  6715 W google-breakpad: ExceptionHandler::GenerateDump cloned child 
01-31 19:03:05.645  6876  6715 W google-breakpad: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
01-31 19:03:05.645  6618  6715 W google-breakpad: 6876
01-31 19:03:05.645  6618  6715 W google-breakpad: 
01-31 19:03:05.645  6618  6715 W google-breakpad: ExceptionHandler::SendContinueSignalToChild sent continue signal to child

I would very much like somebody to dig into these minidumps and ensure that we really are getting OOM and not some actual crash in Gecko!

(In reply to Nick Alexander :nalexander [he/him] from comment #5)

An update on this: locally, I have geckodriver pulling minidumps off the device. I still need to wire up the mozcrash pieces, but that looks almost trivial once the minidumps are captured.

Hey Nick. Do you have any kind of code here you could share?

I would like to see crash handling support in geckodriver sooner than later. Getting the minidump out of the profile that geckodriver deletes immediatily after the test finished is hard. For a quick workaround I filed bug 1613064, but on the longer term bug 1433495 would be good to have (with support on desktop and mobile).

Depends on: 1433495
Flags: needinfo?(nalexander)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #7)

(In reply to Nick Alexander :nalexander [he/him] from comment #5)

An update on this: locally, I have geckodriver pulling minidumps off the device. I still need to wire up the mozcrash pieces, but that looks almost trivial once the minidumps are captured.

Hey Nick. Do you have any kind of code here you could share?

Yes.

I would like to see crash handling support in geckodriver sooner than later. Getting the minidump out of the profile that geckodriver deletes immediatily after the test finished is hard. For a quick workaround I filed bug 1613064, but on the longer term bug 1433495 would be good to have (with support on desktop and mobile).

I've posted a patch series (some of which has r?whimboo, although even those bits aren't quite ready for review). That series started with (ato's?) existing work on respecting MINIDUMP_SAVE_PATH. (For the record, I think using this env var is janky for external consumers; an argument or other configuration mechanism would be better.). Then I built adb pull support for Android. I'm aware of one serious issue with that functionality: we need to avoid using Path for on-device paths to avoid tickets like Bug 1609263. (Might as well update adb push while you're in the neighbourhood: Bug 1610629.) Finally, I started to think about testing, including automated testing, and concluded that it's probably possible but it's tricky because the Raptor + Browsertime harness just isn't very flexible. (Crash test handling looks like functional testing, not performance testing, so there's an impedance mismatch.). The last mile would be to wire up and ensure that the mozcrash handling works for the Raptor + Browsertime harness variants; once the minidumps are being pulled that should be fairly straight-forward.

Feel free to commandeer everything. Or, if you prefer, we can break out the adb pull patches into a separate ticket, I can address review comments, and then you can pick up the geckodriver + Raptor + Browsertime bits. Good luck!

Flags: needinfo?(nalexander)

Oh crumbs, I put the patch series against Bug 1433495. Oh well, most of it is actually for that functionality and not the Raptor + Browsertime bits.

No longer blocks: 1582756

Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)

See Also: → 1642573
See Also: → 1678320
Depends on: 1678320
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: browsertime
No longer blocks: 1680214
Summary: Add support to raptor-browsertime to check for browser crashes → Add support to raptor-browsertime to check for browser crashes for desktop
Blocks: 1680214
No longer blocks: browsertime-ci-android
Severity: normal → S3
Severity: normal → S3

This can now be fixed by setting the MINIDUMP_SAVE_PATH environment variable to a folder that is used for artifact upload in CI. You can have a look at the wpt test jobs which work fine for that.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #11)

This can now be fixed by setting the MINIDUMP_SAVE_PATH environment variable to a folder that is used for artifact upload in CI. You can have a look at the wpt test jobs which work fine for that.

That's great to know, thanks @whimboo!

@andrej, you had expressed interest in working on this before. As this is a valuable feature missing since we moved to browsertime, I have assigned it to you. Perhaps we could schedule it for an upcoming "sprint"?

Assignee: nobody → aglavic
Status: REOPENED → ASSIGNED
Status: ASSIGNED → NEW

:whimboo, I spent a bit of time looking into this today, and I haven't been able to get crashes saved/moved to the correct location. I've set the MINIDUMP_SAVE_PATH environment variable in multiple locations, but none of them seem to have an effect on where the crash dump file gets moved too or saved in. I am seeing the MINIDUMP_SAVE_PATH in the capabilities in a geckodriver log though (it's also set in the process that the browser is started from):

17:22:43     INFO -  raptor-browsertime Info: 1708708963430	webdriver::server	DEBUG	-> POST /session {"capabilities":{"firstMatch":[{}],"alwaysMatch":{"browserName":"firefox","pageLoadStrategy":"none","moz:fir ... linux-gnu/dist/bin/firefox","args":["-no-remote"],"env":{"MINIDUMP_SAVE_PATH":"/home/sparky/crashes"}},"webSocketUrl":true}}}

I see the minidumps being produced in the temporary profile folder that geckodriver produces, but the folder gets deleted before the minidumps get moved or saved elsewhere.

Do you know what I might be missing here?

Flags: needinfo?(hskupin)

If you are trying to do that for desktop you cannot use the env capability (this way is only supported on Android atm and we might even remove it). Instead you will have to set the environment variable when calling browsertime internally.

Flags: needinfo?(hskupin)

I tried that as well by setting process.env.MINIDUMP_SAVE_PATH = "/home/sparky/crashes"; in this area of the browsertime firefox builder, and none of the minidumps were copied/moved over (I've been testing with about:crashparent, and about:crashcontent): https://github.com/sitespeedio/browsertime/blob/a4dfa25fa21b43d6a287043a0bb5bffe1ded1768/lib/firefox/webdriver/builder.js#L245

I double checked to see if the env variables being set there are being found by firefox by setting --firefox.collectMozLog since it modifies the same process.env to get mozlogs collected, and it works fine. Are there any additional flags I might have to pass, or is there a flag that might be preventing this to work?

Flags: needinfo?(hskupin)

That is strange. Do you actually have a full trace log for such a case?

There shouldn't any other additional flags or environment variables be needed. Note that this is purely a Firefox feature and if Firefox has this env variable set you might better want to talk to Gabriele Svelto about it.

Flags: needinfo?(hskupin)
Attached file crash-testing-log.txt

Yup, I've attached the log of a test run. In this one I navigated to about:crashparent.

:gsvelto, do you know what might be happening here? We're trying to set the MINIDUMP_SAVE_PATH variable to save the minidumps from browsertime test runs (uses selenium), but it doesn't seem to have an effect.

As I mentioned in comment #15 above, we're able to get environment variables set for the firefox process (tested with mozlogs). I checked to see if this was a local issue as well by running about:crashparent in a talos test, and I'm able to get the crash generated there.

I've attached a file with logs from geckodriver, but let me know if you need any additional logs/info.

Flags: needinfo?(gsvelto)

:gsvelto, see comment #18.

I checked the log file and I cannot actually see a navigation to about:crashparent but about:blank. So not sure which action actually triggered the crash. Does the same also happen for about:crashcontent? In such a case Marionette is shutting down the browser.

Ah sorry, I should have mentioned that I navigated to about:crashparent through the URL bar. In this log, I changed the about:blank navigation to about:crashcontent. The logs are a bit different with about:crashcontent, but the minidumps don't get moved/copied to the MINIDUMP_SAVE_PATH folder either.

I'm not familiar with how geckodriver works, but MINIDUMP_SAVE_PATH is honored by mozcrash.py if no dump_save_path argument has been passed. Is it possible that mozcrash is being invoked with a different dump_save_path argument? If that's the case then MINIDUMP_SAVE_PATH will be ignored (see here).

Flags: needinfo?(gsvelto)

Oh interesting, thanks for the info! I can't get to that point because the temporary profile that's generated, and contains the minidumps is deleted by the time I have a chance to call check_for_crashes in the raptor-browsertime code. The browser gets started, and stopped in a node.js process that's separate from the code that would call mozcrash.

It sounds like I'd need to catch the crash within browsertime, and move the minidumps there, or something else along those lines.

Note that MINIDUMP_SAVE_PATH exists in combination with geckodriver to exactly not be affected by the profile removal. geckodriver does not have any internals like what mozcrash has, so minidumps need to be checked independently. As such use a folder for MINIDUMP_SAVE_PATH which clearly is created by Raptor, and can then be checked and deleted as well.

I was able to get it working for some cases in browsertime. Afaict, the reason wpt captures them when MINIDUMP_SAVE_PATH is used, is because it starts/runs the firefox process through mozrunner which has logic for capturing the minidumps. One thing I'm wondering is if the documentation here is still correct, I tried running a basic test similar to what's documented there where I open a browser, and crash it but the minidumps don't get saved: https://firefox-source-docs.mozilla.org/testing/geckodriver/CrashReports.html

For browsertime, I had to add some code here to pull the minidumps into the save path right before we quit the driver: https://github.com/sitespeedio/browsertime/blob/0af4e411159185ea7ec5f258564df54ad1085d24/lib/core/seleniumRunner.js#L527

However, an issue I'm hitting now for the majority of cases is that when the browser crashes, browsertime might be waiting to check if the page load completed. The result of this is that the webdriver tears down the session before we can check for minidumps. Here's an example output where the browser crashes, then a page complete check that's done from a driver.wait call triggers the teardown:

# navigated to about:crashcontent
13:30:41     INFO -  raptor-browsertime Info: 1709040641863	Marionette	TRACE	[14] MarionetteCommands actor destroyed for window id 30
13:30:41     INFO -  raptor-browsertime Info: 1709040641864	webdriver::server	DEBUG	<- 200 OK {"value":null}
13:30:41     INFO -  raptor-browsertime Info: 1709040641870	Marionette	DEBUG	Closed connection 0
13:30:41     INFO -  raptor-browsertime Info: console.error: ({})
# browser is crashed now, then the next line is the page load complete check being done in the driver.wait method which destroys the temp profile folder
13:30:42     INFO -  raptor-browsertime Info: 1709040642867	webdriver::server	DEBUG	-> POST /session/88a8f5af-1fc6-473b-97eb-ab89a906527a/execute/sync {"script":"\nreturn (function(waitTime) {\n    try { \n            var end = window.performance.timing.loadEventEnd;\n            var start= window.performance.timing.navigationStart;\n            return (end > 0) && (performance.now() > end - start + waitTime);\n    } \n    catch(e) {\n        return true;\n    }\n})(arguments[arguments.length - 1]);\n","args":[5000]}
13:30:42     INFO -  raptor-browsertime Info: 1709040642867	webdriver::server	DEBUG	Teardown session
13:30:42     INFO -  raptor-browsertime Info: 1709040642876	geckodriver::marionette	ERROR	Failed to close browser connection: Socket not connected (os error 107)
# temp profile directory with minidumps is now deleted

Here's the line where this happens at, but the teardown gets triggered on any driver calls from what I noticed: https://github.com/sitespeedio/browsertime/blob/0af4e411159185ea7ec5f258564df54ad1085d24/lib/core/seleniumRunner.js#L189-L194

:whimboo, would you have any ideas about how I could get around this or know of some way to prevent the session teardown? Alternatively, I'm thinking of finding a way to check if the browser process is still alive, and prevent driver calls when it's not.

Flags: needinfo?(hskupin)

If the minidump files have not been saved then I assume that MINIDUMP_SAVE_PATH is not correctly set for the Firefox process. Once a crash happened the crash reporter should automatically save the minidump in that folder and not within the minidumps subfolder of the used Firefox profile.

When the browser crashes we should return the exit code of the browser, which could be used to determine if the (unexpected) shutdown was a normal exit or not. The exit code of a content process crash will be 245 (since my fix on bug 1370520 landed).

Flags: needinfo?(hskupin)

Can you point me to the code that does this? Maybe that will help me debug this further. I might be missing something here, but based on :gsvelto's comment, it sounds like it's mozcrash that does the saving/capturing, and not something within Firefox.

Thanks for the tip regarding the exit code, I'll see if I can use that to move forward.

Flags: needinfo?(hskupin)

Oh no. Now I see. I completely missed that. I was really thinking that the crash reporter could save the minidumps directly into that folder. I filed bug 1882338 to really implement it in geckodriver. Sorry again!

Gabriele, would it be possible to instruct the crash reporter (maybe via such an environment variable) to not use the default minidump folder? I'm happy to file a bug if wanted and possible.

Depends on: 1882338
Flags: needinfo?(hskupin) → needinfo?(gsvelto)

No worries, thanks for taking a look! Glad we figured it out :)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #28)

Gabriele, would it be possible to instruct the crash reporter (maybe via such an environment variable) to not use the default minidump folder? I'm happy to file a bug if wanted and possible.

If you can access JS we already have the minidumpPath property which can be used to set the path where minidumps are written. See how we do it for xpcshell tests.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #30)

If you can access JS we already have the minidumpPath property which can be used to set the path where minidumps are written. See how we do it for xpcshell tests.

Oh that is neat! I didn't know about that. It's neat and would make it especially much easier for us to pull the minidump files from a remote device by not having to take care of the actual profile cleanup. Thanks!

No longer depends on: 1433495

Hi Gabriele, thanks for the pointers and we are looking at solutions using JS to set minidumpPath.

But in case of a very early crash, there is still a risk that we might not set minidumpPath in time. Do you think we could add a preference to set the default minidump path? This way geckodriver could set this preference in the profile before starting the browser.

Flags: needinfo?(gsvelto)

It would be rather easy. When we set the exception handler we use a temporary location for putting the minidumps which is later changed to the final location. We could set that to the contents of an environment variable, but not a preference, it's too early in the startup process for that.

That being said the temporary path is stored in minidumpPath, so the harness could read it and check for crashes under that path before setting it to the desired location.

Flags: needinfo?(gsvelto)

Please note that I've moved the recent conversation to bug 1884695 comment 1, where it's better suited. It's not really related to raptor (at least at the discussed level).

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: