Closed Bug 1594515 Opened 4 years ago Closed 2 years ago

crashes in telemetry marionette tests should symbolicate minidumps

Categories

(Toolkit :: Telemetry, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
firefox96 --- affected

People

(Reporter: froydnj, Assigned: willkg)

References

Details

Attachments

(1 file)

Otherwise, C/C++/Rust crashes in these tests are very hard to debug.

(It's possible this is a more general marionette test harness problem, in which case feel free to move this bug around.)

Assignee: nobody → rpierzina
Status: NEW → ASSIGNED

The priority flag is not set for this bug.
:chutten, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(chutten)
Priority: -- → P1

:raphael is assigned and working on it.

Flags: needinfo?(chutten)

I have been looking into this and asked folks for advice: At this point, I don't know what exactly needs to be done to get the minidumps symbolicated for the marionette/mozharness-based telemetry-tests-client suite. The build job seems to upload symbols as a zip archive and the test job downloads that archive, but as far as I can tell it doesn't add the symbols correctly.

Does the harness know where to find the symbols that the test job downloads? Does the test job even invoke minidump_stackwalk, or is minidump_stackwalk even present for the test job? Does the harness point minidump_stackwalk at the symbols?

Severity: normal → S4
Type: defect → enhancement

Is this still a P1?

Blocks: 1642164
Blocks: 1644300
Blocks: 1644860
Blocks: 1646285
Blocks: 1650282
No longer blocks: 1650282
Blocks: 1573767

Did this get fixed? This task from yesterday looks symbolified (mingw is being tracked in a different bug as far as I know).

Flags: needinfo?(rpierzina)

I'm not aware of any patches to resolve this issue. Perhaps something changed in the harness we depend on for these tests.

Chutten, do you know more about this?

Flags: needinfo?(rpierzina) → needinfo?(chutten)

It's news to me!

Flags: needinfo?(chutten)

This is still an issue. Could this be get back on the roadmap? It's one of the integration tests and those can catch real world issues, e.g. they reported the Windows failures today but without symbolication the error message matched the one for intermittent failures.

Hi :aryx, I wasn't able to resolve this issue back when I looked into it and currently don't have the bandwidth to investigate further.

I have brought this up in a team meeting and I am actively looking for someone who can help with this.

Assignee: rpierzina → nobody
Status: ASSIGNED → NEW

I spent some time looking into this. I haven't looked at any of this before, so if I appear clueless, it's because I probably am!

The log for the failed job in comment #10 is here:

https://firefox-ci-tc.services.mozilla.com/tasks/aJ_GYErDRF-u7-Q5C3Ymaw/runs/0/logs/public/logs/live.log

The line where it logged running minidump_stackwalk is here:

https://firefox-ci-tc.services.mozilla.com/tasks/aJ_GYErDRF-u7-Q5C3Ymaw/runs/0/logs/public/logs/live.log#L1897

[task 2021-10-27T11:17:54.602Z] 11:17:54     INFO - mozcrash Copy/paste: Z:/task_163533189718508/fetches\minidump_stackwalk\minidump_stackwalk.exe Z:\task_163533189718508\build\tmpmeh_w7gm.mozrunner\minidumps\f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp C:\Users\task_163533189718508\AppData\Local\Temp\tmp8uqptln_ https://symbols.mozilla.org/

Then this line is from the modules output from minidump_stackwalk:

[task 2021-10-27T11:17:59.568Z] 11:17:59     INFO - 0x7ff815ca0000 - 0x7ff81cdd4fff  xul.dll  95.0.0.7970  (WARNING: No symbols, xul.pdb, 2E2202753857E7524C4C44205044422E1)

That line says it can't find the symbols for xul.pdb 2E2202753857E7524C4C44205044422E1.

I checked https://symbols.mozilla.org/ for that file and it's in the bucket (you may not have permissions to this page):

https://symbols.mozilla.org/uploads/files/file/27964530

That file was uploaded at 2021-10-27T11:12:25.200Z. That's 5 minutes before the timestamp in the log line from the taskcluster job where it printed the line for how it was running minidump_stackwalk.

minidump_stackwalk on Windows has this usage:

Usage: minidump_stackwalk.exe [-m|-s] <miniump-file> [<symbol server URL> | <symbol-path>]+

Given all that, the answers to Nathan's questions in comment #5 are these:

  1. Does the harness know where to find the symbols that the test job downloads? Yes.
  2. Does the test job even invoke minidump_stackwalk, or is minidump_stackwalk even present for the test job? Yes it's invoked and yes it's present.
  3. Does the harness point minidump_stackwalk at the symbols? Yes.

I downloaded minidump_stackwalk and the dmp file and ran it on my Windows 11 machine:

minidump_stackwalk.exe f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp . https://symbols.mozilla.org/

The error output that http_symbol_supplier.cc is spitting out is this:

2021-11-08 16:27:40: http_symbol_supplier.cc:507: INFO: HTTPSymbolSupplier: querying https://symbols.mozilla.org/xul.pdb/2E2202753    857E7524C4C44205044422E1/xul.sym
2021-11-08 16:27:40: http_symbol_supplier.cc:278: INFO: HTTPSymbolSupplier: HttpSendRequest: Error: 12152

I'm not sure what error 12152 is. I've also never run minidump_stackwalk on Windows 11 before.

Is that a bug in minidump_stackwalk? Is it the same issue that the Telemetry Test Client job in TaskCluster is hitting? Is it a problem with how I ran it locally? Is it a case of the Mondays?

Gabriele: Do you think this is a bug in minidump_stackwalk?

Flags: needinfo?(gsvelto)

Looks like a bug in minidump_stackwalk. If I try this on my machine:

minidump_stackwalk.exe f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp . https://symbols.mozilla.org/

It fails, and I get this type of errors in the log:

2021-11-08 22:48:56: simple_symbol_supplier.cc:196: INFO: No symbol file at ./xul.pdb/2E2202753857E7524C4C44205044422E1/xul.sym
2021-11-08 22:48:56: http_symbol_supplier.cc:507: INFO: HTTPSymbolSupplier: querying https://symbols.mozilla.org/xul.pdb/2E2202753857E7524C4C44205044422E1/xul.sym
2021-11-08 22:49:00: http_symbol_supplier.cc:424: INFO: HTTPSymbolSupplier: symbol exists, saving to ./xul.pdb/2E2202753857E7524C4C44205044422E1/xul.sym
2021-11-08 22:49:00: http_symbol_supplier.cc:444: INFO: HTTPSymbolSupplier: failed to rename file, errno=18

However if I try with an absolute path:

minidump_stackwalk.exe f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp /tmp https://symbols.mozilla.org/

Boom! It works! I haven't dug too deep but my gut feeling is that specifying '.' as the destination dir for the symbols triggers a bug where the file is getting downloaded and renamed in-place which leads to the error.

Flags: needinfo?(gsvelto)

Oh, I didn't realize you had to pass in a temp directory. I changed the run line to this:

minidump_stackwalk.exe f69c1f39-1be2-4f8c-b021-0b235e4af1f1.dmp c:\Users\willk\AppData\Local\Temp https://symbols.mozilla.org/

I still get the same error.

An internet search suggests error 12152 on Windows is from not passing INTERNET_FLAG_SECURE in HttpOpenRequestA which is this line:

https://searchfox.org/mozilla-central/source/tools/crashreporter/minidump_stackwalk/http_symbol_supplier.cc#266

I don't have a build environment on my Windows 11 machine, but I could try a fix, do a try build, and see if that fixes anything.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

The function in question has these docs:

https://docs.microsoft.com/en-us/windows/win32/api/wininet/nf-wininet-httpopenrequesta

The fix here is to check the scheme and pass in the INTERNET_FLAG_SECURE flag accordingly.

When I pull down minidump_stackwalk.exe with that fix, then I can run that on the dump in question with both http://symbols.mozilla.org/ (HTTP) and https://symbols.mozilla.org/ (HTTPS) and both work.

I think this is the right fix, but I'm not positive. Maybe there are other problems? Maybe the test harness is doing something subtly different? I'm not sure how to trigger the problematic situation where firefox crashes during telemetry test client run to verify the issue.

Also, this fix is Windows-specific. Everything I've seen so far suggests this is only a problem on Windows. If there are logs that show this problem on other platforms, I'd love to see them.

Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/205695ca22d2
Pass INTERNET_FLAG_SECURE when using https symbol supplier. r=gsvelto
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch

I'm reopening this. The fix I did fixes an issue, but I'm not sure if there are additional issues. I need to do a verification pass with the steps Raphael walked me through.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

When I left this, I had fixed an issue with minidump_stackwalk (with an underscore) where it couldn't make HTTPS connections when run on Windows so it couldn't talk to symbols.mozilla.org.

Then Aria switched minidump_stackwalk (with an underscore) with minidump-stackwalk (with a hyphen) which is the Rust rewrite that we're now using on Socorro (hooray!).

Prior to October 2021, mozcrash ran minidump(-|_) such that it only looked at the symbols on disk that were generated during the build. After October 2021, it also looked up symbols on symbols.mozilla.org. The problem here is that we don't want mozcrash to rely on symbols.mozilla.org for symbols that were in the build because it'll be slower, more error prone, and incur egress costs.

Thus I kept this bug open in order to verify two things:

  1. mozcrash is running minidump-stackwalk and successfully symbolicating crash stacks on all platforms when telemetry test client tests fail because Firefox crashed
  2. the symbols for the build are on disk and minidump-stackwalk prefers them over symbols it's fetching from symbols.mozilla.org

In order to do that, I need to be able to crash Firefox during a telemetry test client run.

I can do that by adding a navigation to about:crashparent to one of the telemetry test client tests. A resulting try run of that shows mozcrash running and minidump-stackwalk producing human-readable stacks for macos and linux.

However, when telemetry test client tests run on Windows and Firefox crashes in the way I'm doing it, mozprocess times out waiting for interprocess communication, mozcrash never runs, and minidump-stackwalk never walks the stack of the crash dump.

Here's my latest attempt:

https://treeherder.mozilla.org/jobs?repo=try&revision=397a678efb16570f81c52d428794b0df1ae5b84e&selectedTaskRun=Z2iDxdHRRkCkehfQTdLESQ.0

Am I crashing Firefox wrong such that it doesn't work right on Windows? Is there a better way to do this? Am I hitting some bug in marionette or something similar?

Tagging :whimboo because Raphael said that was a good idea.

Flags: needinfo?(hskupin)

That is interesting and I don't expect that behavior on Windows. A debug build also doesn't show much more beside the following output in the log:

https://treeherder.mozilla.org/logviewer?job_id=361505532&repo=try&lineNumber=2810-2855

[task 2021-12-16T16:02:39.586Z] 16:02:39     INFO -  [Child 2528, IPC I/O Child] WARNING: [17B17B39CE863D7B.464D727C04E73B32]: Ignoring message 'EVENT_MESSAGE' to peer 1.1 due to a missing broker: file /builds/worker/checkouts/gecko/ipc/glue/NodeController.cpp:346
[task 2021-12-16T16:02:39.586Z] 16:02:39     INFO -  Exiting due to channel error.
[task 2021-12-16T16:07:39.607Z] 16:07:39     INFO - Automation Error: mozprocess timed out after 300 seconds running ['Z:\\task_163966767388828\\build\\venv\\Scripts\\python', 'Z:\\task_163966767388828\\build\\venv\\Lib\\site-packages\\telemetry_harness\\runtests.py', '--binary', 'Z:\\task_163966767388828\\build\\application\\firefox\\firefox.exe', '--address', 'localhost:2828', '--server-root', 'Z:\\task_163966767388828\\build\\tests\\telemetry\\marionette\\harness\\www', '--workspace', 'Z:\\task_163966767388828\\build', '--gecko-log=-', '--log-raw=-', '--log-html', 'Z:\\task_163966767388828\\build\\blobber_upload_dir\\report.html', '--log-xunit', 'Z:\\task_163966767388828\\build\\blobber_upload_dir\\report.xml', '-vv', '--enable-webrender', '--setpref=media.peerconnection.mtransport_process=false', '--setpref=network.process.enabled=false', '--setpref=layers.d3d11.enable-blacklist=false', 'Z:\\task_163966767388828\\build\\tests\\telemetry\\marionette\\tests\\client\\manifest.ini', 'Z:\\task_163966767388828\\build\\tests\\telemetry\\marionette\\tests\\unit\\manifest.ini']

Note that we also have a dedicated Marionette unit test that actually checks for crashes and triggers these via about:crashparent or about:crashcontent:

https://searchfox.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit/test_crash.py

We only skip it for ASAN builds and when Marionette is not managing the instance. But these two cases don't seem to apply here for the Telemetry client.

Will, could you please try what happens in a CI run when you run the Mn jobs? Does the above test module succeed?

Nevertheless this seems to be an issue with Firefox on Windows. CC'ing Toshi for now given that he mostly works on improvements for Windows and maybe know.

Flags: needinfo?(hskupin) → needinfo?(willkg)

Thank you for your help, Henrik!

Henrik: I updated the changeset to the current tip and squashed the commits and did a try run. Then I added Mn jobs to it.

https://treeherder.mozilla.org/jobs?repo=try&revision=ceff17d271c63312237df002af287340a754c953

I don't see any crashes or mozcrash output in the logs I looked at, but it looks like test_crash tests pass:

[task 2021-12-17T13:41:15.814Z] 13:41:15     INFO -  TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_chrome_process | took 11865ms
...
[task 2021-12-17T13:47:17.402Z] 13:47:17     INFO -  TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process | took 361587ms
...
[task 2021-12-17T13:47:28.650Z] 13:47:28     INFO -  TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrashInSetUp.test_crash_in_setup | took 11247ms
...
[task 2021-12-17T13:47:40.515Z] 13:47:40     INFO -  TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrashInTearDown.test_crash_in_teardown | took 11865ms

Did I run what you asked for correctly?

Flags: needinfo?(willkg) → needinfo?(hskupin)

Yes, all the tests are passing but you won't see any gecko output because we have to suppress it. If we won't do that this output would show the MOZ_CRASH which triggers Treeherder to think the test is failing. But interestingly we also have quite a long delay in shutdown of the browser:

https://treeherder.mozilla.org/logviewer?job_id=361639018&repo=try&lineNumber=50895-50902

But that didn't trigger the mozprocess failure. Maybe because we have a longer timeout set. But that's an issue with Firefox when about:crashparent is loaded.

Flags: needinfo?(hskupin)

Awesome--that worked super! Thank you, Henrik!

I did a try run that increases the timeout to 1000 and now I'm seeing mozcrash output for Linux, macOS, and Windows.

https://treeherder.mozilla.org/jobs?repo=try&revision=64cf0beaba279d96af6c10ff9163dcc923d7f682

Then I removed symbols.mozilla.org symbols-url, forcing minidump-stackwalk to use the symbols that were built that should be on disk and none of the stacks are symbolicated: Linux, macOS, or Windows.

https://treeherder.mozilla.org/jobs?repo=try&revision=4b1cf6306b1a19cf3867bd3aa9e392dd00b305fe

Therefore, minidump-stackwalk is symbolicating stacks solely with symbols it's downloading from symbols.mozilla.org.

I claim this is not what we want. What we want is for mozcrash to use the symbols that were created with the build and use symbols.mozilla.org for system symbols. This will work in situations where symbols were not uploaded to symbols.mozilla.org and possibly speed up the tests a little (downloading the symbols from the build artifacts as a big zip vs. downloading individual symbols files from symbols.mozilla.org and wherever the symbols live (AWS S3 for now, but GCP GCS soon)).

However, because mozcrash is currently set up to use symbols.mozilla.org for stackwalking on all platforms and crash stacks are symbolicated, I think this bug is fixed and was fixed by my fix to minidump-stackwalk's support for https on Windows. (And then that was all moot-ified by Aria's replacing minidump-stackwalk with the far superior Rust one which also works with https.)

Given that, I'm going to mark this bug as FIXED and open a new bug for mozcrash should use build symbols and then degrade to symbols.mozilla.org.

Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED

Thank you so much for your work on this bug, Will!

Blocks: 1704344
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: