crashes in telemetry marionette tests should symbolicate minidumps
Categories
(Toolkit :: Telemetry, enhancement, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox96 | --- | affected |
People
(Reporter: froydnj, Assigned: willkg)
References
Details
Attachments
(1 file)
Otherwise, C/C++/Rust crashes in these tests are very hard to debug.
(It's possible this is a more general marionette test harness problem, in which case feel free to move this bug around.)
Updated•4 years ago
|
Comment 1•4 years ago
|
||
Try run that deliberately crashes Firefox https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b74ed389e36ff3d9455ee89e293978519889b59
Comment 2•4 years ago
|
||
The priority flag is not set for this bug.
:chutten, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•4 years ago
|
Comment 4•4 years ago
|
||
I have been looking into this and asked folks for advice: At this point, I don't know what exactly needs to be done to get the minidumps symbolicated for the marionette/mozharness-based telemetry-tests-client suite. The build job seems to upload symbols as a zip archive and the test job downloads that archive, but as far as I can tell it doesn't add the symbols correctly.
Reporter | ||
Comment 5•4 years ago
|
||
Does the harness know where to find the symbols that the test job downloads? Does the test job even invoke minidump_stackwalk
, or is minidump_stackwalk
even present for the test job? Does the harness point minidump_stackwalk
at the symbols?
Updated•4 years ago
|
Comment 6•4 years ago
|
||
Is this still a P1?
Comment 7•4 years ago
|
||
Did this get fixed? This task from yesterday looks symbolified (mingw is being tracked in a different bug as far as I know).
Comment 8•4 years ago
|
||
I'm not aware of any patches to resolve this issue. Perhaps something changed in the harness we depend on for these tests.
Chutten, do you know more about this?
Comment 10•2 years ago
|
||
This is still an issue. Could this be get back on the roadmap? It's one of the integration tests and those can catch real world issues, e.g. they reported the Windows failures today but without symbolication the error message matched the one for intermittent failures.
Comment 11•2 years ago
|
||
Hi :aryx, I wasn't able to resolve this issue back when I looked into it and currently don't have the bandwidth to investigate further.
I have brought this up in a team meeting and I am actively looking for someone who can help with this.
Assignee | ||
Comment 12•2 years ago
•
|
||
I spent some time looking into this. I haven't looked at any of this before, so if I appear clueless, it's because I probably am!
The log for the failed job in comment #10 is here:
The line where it logged running minidump_stackwalk is here:
[task 2021-10-27T11:17:54.602Z] 11:17:54 INFO - mozcrash Copy/paste: Z:/task_163533189718508/fetches\minidump_stackwalk\minidump_stackwalk.exe Z:\task_163533189718508\build\tmpmeh_w7gm.mozrunner\minidumps\f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp C:\Users\task_163533189718508\AppData\Local\Temp\tmp8uqptln_ https://symbols.mozilla.org/
Then this line is from the modules output from minidump_stackwalk
:
[task 2021-10-27T11:17:59.568Z] 11:17:59 INFO - 0x7ff815ca0000 - 0x7ff81cdd4fff xul.dll 95.0.0.7970 (WARNING: No symbols, xul.pdb, 2E2202753857E7524C4C44205044422E1)
That line says it can't find the symbols for xul.pdb
2E2202753857E7524C4C44205044422E1
.
I checked https://symbols.mozilla.org/ for that file and it's in the bucket (you may not have permissions to this page):
https://symbols.mozilla.org/uploads/files/file/27964530
That file was uploaded at 2021-10-27T11:12:25.200Z. That's 5 minutes before the timestamp in the log line from the taskcluster job where it printed the line for how it was running minidump_stackwalk
.
minidump_stackwalk
on Windows has this usage:
Usage: minidump_stackwalk.exe [-m|-s] <miniump-file> [<symbol server URL> | <symbol-path>]+
Given all that, the answers to Nathan's questions in comment #5 are these:
- Does the harness know where to find the symbols that the test job downloads? Yes.
- Does the test job even invoke minidump_stackwalk, or is minidump_stackwalk even present for the test job? Yes it's invoked and yes it's present.
- Does the harness point minidump_stackwalk at the symbols? Yes.
I downloaded minidump_stackwalk
and the dmp file and ran it on my Windows 11 machine:
minidump_stackwalk.exe f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp . https://symbols.mozilla.org/
The error output that http_symbol_supplier.cc
is spitting out is this:
2021-11-08 16:27:40: http_symbol_supplier.cc:507: INFO: HTTPSymbolSupplier: querying https://symbols.mozilla.org/xul.pdb/2E2202753 857E7524C4C44205044422E1/xul.sym
2021-11-08 16:27:40: http_symbol_supplier.cc:278: INFO: HTTPSymbolSupplier: HttpSendRequest: Error: 12152
I'm not sure what error 12152 is. I've also never run minidump_stackwalk
on Windows 11 before.
Is that a bug in minidump_stackwalk
? Is it the same issue that the Telemetry Test Client job in TaskCluster is hitting? Is it a problem with how I ran it locally? Is it a case of the Mondays?
Gabriele: Do you think this is a bug in minidump_stackwalk
?
Comment 13•2 years ago
|
||
Looks like a bug in minidump_stackwalk. If I try this on my machine:
minidump_stackwalk.exe f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp . https://symbols.mozilla.org/
It fails, and I get this type of errors in the log:
2021-11-08 22:48:56: simple_symbol_supplier.cc:196: INFO: No symbol file at ./xul.pdb/2E2202753857E7524C4C44205044422E1/xul.sym
2021-11-08 22:48:56: http_symbol_supplier.cc:507: INFO: HTTPSymbolSupplier: querying https://symbols.mozilla.org/xul.pdb/2E2202753857E7524C4C44205044422E1/xul.sym
2021-11-08 22:49:00: http_symbol_supplier.cc:424: INFO: HTTPSymbolSupplier: symbol exists, saving to ./xul.pdb/2E2202753857E7524C4C44205044422E1/xul.sym
2021-11-08 22:49:00: http_symbol_supplier.cc:444: INFO: HTTPSymbolSupplier: failed to rename file, errno=18
However if I try with an absolute path:
minidump_stackwalk.exe f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp /tmp https://symbols.mozilla.org/
Boom! It works! I haven't dug too deep but my gut feeling is that specifying '.' as the destination dir for the symbols triggers a bug where the file is getting downloaded and renamed in-place which leads to the error.
Assignee | ||
Comment 14•2 years ago
|
||
Oh, I didn't realize you had to pass in a temp directory. I changed the run line to this:
minidump_stackwalk.exe f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp c:\Users\willk\AppData\Local\Temp https://symbols.mozilla.org/
I still get the same error.
An internet search suggests error 12152 on Windows is from not passing INTERNET_FLAG_SECURE
in HttpOpenRequestA
which is this line:
I don't have a build environment on my Windows 11 machine, but I could try a fix, do a try build, and see if that fixes anything.
Assignee | ||
Comment 15•2 years ago
|
||
Updated•2 years ago
|
Assignee | ||
Comment 16•2 years ago
|
||
The function in question has these docs:
https://docs.microsoft.com/en-us/windows/win32/api/wininet/nf-wininet-httpopenrequesta
The fix here is to check the scheme and pass in the INTERNET_FLAG_SECURE
flag accordingly.
When I pull down minidump_stackwalk.exe
with that fix, then I can run that on the dump in question with both http://symbols.mozilla.org/
(HTTP) and https://symbols.mozilla.org/
(HTTPS) and both work.
I think this is the right fix, but I'm not positive. Maybe there are other problems? Maybe the test harness is doing something subtly different? I'm not sure how to trigger the problematic situation where firefox crashes during telemetry test client run to verify the issue.
Also, this fix is Windows-specific. Everything I've seen so far suggests this is only a problem on Windows. If there are logs that show this problem on other platforms, I'd love to see them.
Comment 17•2 years ago
|
||
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/205695ca22d2 Pass INTERNET_FLAG_SECURE when using https symbol supplier. r=gsvelto
Comment 18•2 years ago
|
||
bugherder |
Assignee | ||
Comment 19•2 years ago
|
||
I'm reopening this. The fix I did fixes an issue, but I'm not sure if there are additional issues. I need to do a verification pass with the steps Raphael walked me through.
Updated•2 years ago
|
Assignee | ||
Comment 20•2 years ago
|
||
When I left this, I had fixed an issue with minidump_stackwalk (with an underscore) where it couldn't make HTTPS connections when run on Windows so it couldn't talk to symbols.mozilla.org.
Then Aria switched minidump_stackwalk (with an underscore) with minidump-stackwalk (with a hyphen) which is the Rust rewrite that we're now using on Socorro (hooray!).
Prior to October 2021, mozcrash ran minidump(-|_) such that it only looked at the symbols on disk that were generated during the build. After October 2021, it also looked up symbols on symbols.mozilla.org. The problem here is that we don't want mozcrash to rely on symbols.mozilla.org for symbols that were in the build because it'll be slower, more error prone, and incur egress costs.
Thus I kept this bug open in order to verify two things:
- mozcrash is running minidump-stackwalk and successfully symbolicating crash stacks on all platforms when telemetry test client tests fail because Firefox crashed
- the symbols for the build are on disk and minidump-stackwalk prefers them over symbols it's fetching from symbols.mozilla.org
In order to do that, I need to be able to crash Firefox during a telemetry test client run.
I can do that by adding a navigation to about:crashparent to one of the telemetry test client tests. A resulting try run of that shows mozcrash running and minidump-stackwalk producing human-readable stacks for macos and linux.
However, when telemetry test client tests run on Windows and Firefox crashes in the way I'm doing it, mozprocess times out waiting for interprocess communication, mozcrash never runs, and minidump-stackwalk never walks the stack of the crash dump.
Here's my latest attempt:
Am I crashing Firefox wrong such that it doesn't work right on Windows? Is there a better way to do this? Am I hitting some bug in marionette or something similar?
Tagging :whimboo because Raphael said that was a good idea.
Comment 21•2 years ago
|
||
That is interesting and I don't expect that behavior on Windows. A debug build also doesn't show much more beside the following output in the log:
https://treeherder.mozilla.org/logviewer?job_id=361505532&repo=try&lineNumber=2810-2855
[task 2021-12-16T16:02:39.586Z] 16:02:39 INFO - [Child 2528, IPC I/O Child] WARNING: [17B17B39CE863D7B.464D727C04E73B32]: Ignoring message 'EVENT_MESSAGE' to peer 1.1 due to a missing broker: file /builds/worker/checkouts/gecko/ipc/glue/NodeController.cpp:346
[task 2021-12-16T16:02:39.586Z] 16:02:39 INFO - Exiting due to channel error.
[task 2021-12-16T16:07:39.607Z] 16:07:39 INFO - Automation Error: mozprocess timed out after 300 seconds running ['Z:\\task_163966767388828\\build\\venv\\Scripts\\python', 'Z:\\task_163966767388828\\build\\venv\\Lib\\site-packages\\telemetry_harness\\runtests.py', '--binary', 'Z:\\task_163966767388828\\build\\application\\firefox\\firefox.exe', '--address', 'localhost:2828', '--server-root', 'Z:\\task_163966767388828\\build\\tests\\telemetry\\marionette\\harness\\www', '--workspace', 'Z:\\task_163966767388828\\build', '--gecko-log=-', '--log-raw=-', '--log-html', 'Z:\\task_163966767388828\\build\\blobber_upload_dir\\report.html', '--log-xunit', 'Z:\\task_163966767388828\\build\\blobber_upload_dir\\report.xml', '-vv', '--enable-webrender', '--setpref=media.peerconnection.mtransport_process=false', '--setpref=network.process.enabled=false', '--setpref=layers.d3d11.enable-blacklist=false', 'Z:\\task_163966767388828\\build\\tests\\telemetry\\marionette\\tests\\client\\manifest.ini', 'Z:\\task_163966767388828\\build\\tests\\telemetry\\marionette\\tests\\unit\\manifest.ini']
Note that we also have a dedicated Marionette unit test that actually checks for crashes and triggers these via about:crashparent
or about:crashcontent
:
We only skip it for ASAN builds and when Marionette is not managing the instance. But these two cases don't seem to apply here for the Telemetry client.
Will, could you please try what happens in a CI run when you run the Mn jobs? Does the above test module succeed?
Nevertheless this seems to be an issue with Firefox on Windows. CC'ing Toshi for now given that he mostly works on improvements for Windows and maybe know.
Comment 22•2 years ago
|
||
Thank you for your help, Henrik!
Assignee | ||
Comment 23•2 years ago
|
||
Henrik: I updated the changeset to the current tip and squashed the commits and did a try run. Then I added Mn jobs to it.
https://treeherder.mozilla.org/jobs?repo=try&revision=ceff17d271c63312237df002af287340a754c953
I don't see any crashes or mozcrash output in the logs I looked at, but it looks like test_crash
tests pass:
[task 2021-12-17T13:41:15.814Z] 13:41:15 INFO - TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_chrome_process | took 11865ms
...
[task 2021-12-17T13:47:17.402Z] 13:47:17 INFO - TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process | took 361587ms
...
[task 2021-12-17T13:47:28.650Z] 13:47:28 INFO - TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrashInSetUp.test_crash_in_setup | took 11247ms
...
[task 2021-12-17T13:47:40.515Z] 13:47:40 INFO - TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrashInTearDown.test_crash_in_teardown | took 11865ms
Did I run what you asked for correctly?
Comment 24•2 years ago
|
||
Yes, all the tests are passing but you won't see any gecko output because we have to suppress it. If we won't do that this output would show the MOZ_CRASH
which triggers Treeherder to think the test is failing. But interestingly we also have quite a long delay in shutdown of the browser:
https://treeherder.mozilla.org/logviewer?job_id=361639018&repo=try&lineNumber=50895-50902
But that didn't trigger the mozprocess failure. Maybe because we have a longer timeout set. But that's an issue with Firefox when about:crashparent
is loaded.
Assignee | ||
Comment 25•2 years ago
•
|
||
Awesome--that worked super! Thank you, Henrik!
I did a try run that increases the timeout to 1000 and now I'm seeing mozcrash output for Linux, macOS, and Windows.
https://treeherder.mozilla.org/jobs?repo=try&revision=64cf0beaba279d96af6c10ff9163dcc923d7f682
Then I removed symbols.mozilla.org symbols-url, forcing minidump-stackwalk to use the symbols that were built that should be on disk and none of the stacks are symbolicated: Linux, macOS, or Windows.
https://treeherder.mozilla.org/jobs?repo=try&revision=4b1cf6306b1a19cf3867bd3aa9e392dd00b305fe
Therefore, minidump-stackwalk is symbolicating stacks solely with symbols it's downloading from symbols.mozilla.org.
I claim this is not what we want. What we want is for mozcrash to use the symbols that were created with the build and use symbols.mozilla.org for system symbols. This will work in situations where symbols were not uploaded to symbols.mozilla.org and possibly speed up the tests a little (downloading the symbols from the build artifacts as a big zip vs. downloading individual symbols files from symbols.mozilla.org and wherever the symbols live (AWS S3 for now, but GCP GCS soon)).
However, because mozcrash is currently set up to use symbols.mozilla.org for stackwalking on all platforms and crash stacks are symbolicated, I think this bug is fixed and was fixed by my fix to minidump-stackwalk's support for https on Windows. (And then that was all moot-ified by Aria's replacing minidump-stackwalk with the far superior Rust one which also works with https.)
Given that, I'm going to mark this bug as FIXED and open a new bug for mozcrash should use build symbols and then degrade to symbols.mozilla.org.
Comment 26•2 years ago
|
||
Thank you so much for your work on this bug, Will!
Description
•