Closed Bug 1460433 Opened 7 years ago Closed 6 years ago

Implement failure logging for bootstrap process

Categories

(Firefox :: General, enhancement, P1)

Unspecified
Windows
enhancement

Tracking

()

RESOLVED FIXED
Firefox 67
Tracking Status
firefox67 --- fixed

People

(Reporter: bugzilla, Assigned: bugzilla)

References

(Blocks 1 open bug)

Details

(Whiteboard: inj+)

Attachments

(6 files, 3 obsolete files)

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1430092#c10 We need to make sure that we have a good story for failure logging.
Note that MOZ_CRASH calls in the launcher process should do something useful. We probably need to replace those with something else.
Hey Zibi, for the Firefox launcher process, I'm thinking that we'll want to use the same localization facilities that we're using for the crash reporter client, like https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/crashreporter/crashreporter.ini. If I add a new ini file for translation under a path such as toolkit/locales/en-US/launcher/launcher.ini, what else do I need to do to get this on the radar for translation?
Assignee: nobody → aklotz
Status: NEW → ASSIGNED
Flags: needinfo?(gandalf)
Priority: P3 → P1
Is adding launcher.ini to LOCALIZED_FILES in moz.build sufficient?
Is there a chance you can use Rust library in the launcher process? We'd love to avoid adding another place where we use `ini` files, as we're trying to get rid of them and unify everything to use Fluent.
Flags: needinfo?(gandalf) → needinfo?(aklotz)
Maybe? The problem is that the launcher process needs to be incredibly fast; since it runs during the critical path of browser startup, we cannot afford to initialize libraries until we absolutely need them. The launcher process does not use any facilities from Gecko. OTOH, we don't need to display any messages unless there has been an error of some kind. As long as it is possible to delay initialization of the Rust library (and its dependencies) until we absolutely need it, that would be acceptable. Thoughts?
Flags: needinfo?(aklotz) → needinfo?(gandalf)
Totally. I don't know the infra, but I don't see why rust library would add any overhead over C++ code. In principle, they're equally fast and zero overhead, right? The difference is that we have a rust library for Fluent - https://github.com/projectfluent/fluent-rs - which you could use to format the messages. I don't see a reason why you couldn't initialize it lazily, but again, I'm not experienced with this area. I don't even know how the `ini` parser/runtime works, and my hope is that we can just replace it with the fluent parser/runtime 1-1. We're also planning to migrate Crash Reported, Installer and Profile Manager to fluent-rs, so maybe that's an opportunity to lay the foundation for all of them?
Flags: needinfo?(gandalf)
I spent a lot of time last week learning about the fluent-rs interface and figuring out what needs to be done here. Between that research and an IRC discussion that I had with Pike, I think we need to move in a different direction for the launcher process. In particular, I am increasingly moving away from wanting to display error messages to end users; there is limited utility to doing so and it is unlikely that such messages would be actionable by most users. Given the amount of work that is still outstanding to get Fluent up and running on the standalone binaries, I'm not sure it is worth doing for the purposes of the launcher process. I think that the best route forward is for us to log errors to telemetry, with a potential fallback to writing them into a log of some sort (perhaps the Windows Event Log).
(In reply to Aaron Klotz [:aklotz] from comment #7) > Given the amount of work that is still outstanding to get Fluent up and > running on the standalone binaries, I'm not sure it is worth doing for the > purposes of the launcher process. (ie, I don't think it is worth the launcher process being blocked on l10n)
> Given the amount of work that is still outstanding to get Fluent up and running on the standalone binaries, I'm not sure it is worth doing for the purposes of the launcher process. Can you elaborate on what fluent-rs is missing for that purpose?
Flags: needinfo?(aklotz)
Depends on: 1508460
No longer depends on: 1508460
Flags: needinfo?(aklotz)
Depends on: 1508468
Flags: needinfo?(aklotz)
Depends on: 1511078
Flags: needinfo?(aklotz)

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #9)

Given the amount of work that is still outstanding to get Fluent up and running on the standalone binaries, I'm not sure it is worth doing for the purposes of the launcher process.

Can you elaborate on what fluent-rs is missing for that purpose?

Sure. The primary issue is that instantiating a FluentBundle requires a list of locales. But the launcher process is the very first thing that is run in Firefox; we don't have prefs, we don't have omnijar, we don't have anything other than the C++ runtime and Win32.

It appears to me that the only way that we could obtain information about the desired locale(s) is to dig it out from somewhere, which currently appears to be difficult when so few services are available to us.

Having said that, concurrently with this discovery, I also concluded that displaying error messages to users is of limited utility anyway, so we're not going that route regardless.

But in case we need to use fluent-rs from the launcher process in the future, being able to determine the desired locale for the Firefox installation without having to dive into the Omnijar (or some other non-trivial method) would be ideal.

Great, thank you!

It seems like a not-that-rare request to get locale info (and locale resources) out of the Firefox package for the external app to use via fluent-rs. @ted mentioned that for the crash reporter as well.

In this particular case, my guess is that you don't actually want Firefox locale - you want the OS locale which you could get out of some winapi.

Okay, back to working on this: I'm going to make the error handler generate a telemetry ping, and then invoke our existing pingsender binary to send it, since it already does everything that I need it to do.

Depends on: 1400344
Attached file Sample ping (obsolete) —

BasicDllServices is used to gain access to the authenticode APIs in non-Gecko
contexts. One feature that WinDllServices provides is the ability to register
a callback interface to be notified when a DLL has been loaded.

This is not particularly useful in the BasicDllServices use case, and in the
"handle a launcher process failure on a background thread" use case, would
actually be harmful.

This patch modifies the DLLServices backend to offer a "basic" option that
omits the callback stuff.

This patch does a few things:

  • Fleshes out the launcher process failure ping;
  • Sends that ping via pingsender;
  • If there is any failure in doing so, we fall back to the Windows event log;
  • Any launcher process failures will result in us falling back to the normal
    startup code path, ensuring that users will still see a browser.

A sample ping will be attached to the bug.

Depends on D19696

Attachment #9045098 - Flags: data-review?(chutten)

Before I get started on Data Collection Review, allow me to put on my Firefox Telemetry Engineer hat.

First allow me to apologize for our C++ ping-sending story not being great. Ultimately one of the future aims of the glean project is to have a cross-platform library with lovely interfaces that'll take care of a lot of this for you. But that's a ways off (Firefox Desktop isn't even the second product we plan to support), so for now it isn't the friendliest.

Now to the tech: if possible and if it makes sense in this case I would recommend having this ping adhere to the Common Ping Format. This'll help with ingestion as quite a few things may depend on, for example, the ping having a unique identifying "type" and "id" and both being present in the outermost part of the JSON body. (Having the "application" structure match is also helpful for channel normalization and other stuff). Of course, this depends on the information being present at the time you're assembling the ping and relevant to your data needs. We can always adapt ingestion to your needs, but it might be faster/more helpful for your analysis requirements to fit it into the same rough shape as other Firefox Telemetry pings.

The example ping is fairly small, and seems like it should only be sent a maximum of once per session which is nice from a bandwidth/budget standpoint. It appears as though there are some structures with potentially-unbounded sizes that could be exploited to make our lives a bit more annoying, though. If possible, consider putting byte limits on the strings and length limits on the arrays to ensure things are sane, and to ensure those persistent haxxors fuzzing our Telemetry ingestion point don't waste too many of our buffers with their SQL injection and buffer overflow attempts. We use 80 bytes for string Scalars, and shorter ones for Telemetry Event strings, if you're looking for ideas.

The considerations of the pipeline aren't exactly my wheelhouse, so consider reaching out to mreid to put you in touch with a pipeline Data Engineer to consult. You'll probably need to talk to them about how to shape your data for analysis anyway :) (ie, do you want this in sql.tmo? Will you need scheduled jobs?)

Comment on attachment 9045098 [details] Data review request for launcher failure ping This is data-review- for a couple of reasons. 1) We're going to need documentation (like [this](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/downgrade-ping.html)) for the ping to describe the collection. 2) The path of the build machine is Category 4 data, as anyone's allowed to build Firefox and user names can be considered personal. Think of a Linux distro volunteer building Firefox at home and packaging it for everyone. There do exist mechanisms (which I think you helped write and review) for sanitizing path names to only include Cat 1 data (mostly by ensuring that we only include subfolders (or, for externally-located files, only the file name)). Would your analysis requirements be satisfied by sanitized paths? 3) The opt-out must be available to Nightly users as well. It sounds as though the launcher process is enabled for all Nightly users irrespective of their study/data choices prefs so we'll need some signal to disable data reporting if the pref's off. (Or maybe I'm misunderstanding the precedence) Let me know if you have any questions. You know where to find me : )
Attachment #9045098 - Flags: data-review?(chutten) → data-review-
Depends on: 1529429

(In reply to Chris H-C :chutten from comment #19)

Comment on attachment 9045098 [details]
Data review request for launcher failure ping

This is data-review- for a couple of reasons.

  1. We're going to need documentation (like
    [this](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/
    telemetry/data/downgrade-ping.html)) for the ping to describe the collection.

OK, will do.

  1. The path of the build machine is Category 4 data, as anyone's allowed to
    build Firefox and user names can be considered personal. Think of a Linux
    distro volunteer building Firefox at home and packaging it for everyone.

Makes sense.

There do exist mechanisms (which I think you helped write and review) for
sanitizing path names to only include Cat 1 data (mostly by ensuring that we
only include subfolders (or, for externally-located files, only the file
name)).

I can't use those functions in the launcher process because they contain Gecko dependencies, but I can write new ones.

Would your analysis requirements be satisfied by sanitized paths?

The leaf name of the file is probably sufficient. I'm adding that to a subsequent revision.

  1. The opt-out must be available to Nightly users as well. It sounds as
    though the launcher process is enabled for all Nightly users irrespective of
    their study/data choices prefs so we'll need some signal to disable data
    reporting if the pref's off. (Or maybe I'm misunderstanding the precedence)

I filed (and am in the process of landing) dependent bug 1529429 that will tie this to the SHIELD enable pref. I'll update the data review request accordingly.

(In reply to Chris H-C :chutten from comment #18)

Now to the tech: if possible and if it makes sense in this case I would recommend having this ping adhere to the Common Ping Format. This'll help with ingestion as quite a few things may depend on, for example, the ping having a unique identifying "type" and "id" and both being present in the outermost part of the JSON body. (Having the "application" structure match is also helpful for channel normalization and other stuff). Of course, this depends on the information being present at the time you're assembling the ping and relevant to your data needs. We can always adapt ingestion to your needs, but it might be faster/more helpful for your analysis requirements to fit it into the same rough shape as other Firefox Telemetry pings.

Unfortunately some of the stuff in the common ping format is not usable or sensible in the case of the launcher process. Having said that, I did see some value in some of the fields that were included in that format, so I have modified both the patch and the schema to add those additional fields.

Updated schema for reference purposes

Attachment #9044264 - Attachment is obsolete: true
Attached file Sample ping (r2)

Updating the sample ping. (This is a multi-line sample because it was generated by a debug build. Release builds generate single-line JSON)

Attachment #9043692 - Attachment is obsolete: true
Pushed by aklotz@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0291384ba660 Part 1 - Change BasicDllServices to not do any dll notification callbacks; r=mhowell

All of your concerns should be handled now. The data review request has been updated accordingly.

(In reply to Chris H-C :chutten from comment #19)

Comment on attachment 9045098 [details]
Data review request for launcher failure ping

This is data-review- for a couple of reasons.

  1. We're going to need documentation (like
    [this](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/
    telemetry/data/downgrade-ping.html)) for the ping to describe the collection.

Okay, done with reluctance - I have added it as a Part 3 to this patch. However, I really think that this is a waste of developer time and should be automatically derived from the schema. mreid has filed an issue on GitHub for looking into this.

  1. The path of the build machine is Category 4 data, as anyone's allowed to
    build Firefox and user names can be considered personal. Think of a Linux
    distro volunteer building Firefox at home and packaging it for everyone.
    There do exist mechanisms (which I think you helped write and review) for
    sanitizing path names to only include Cat 1 data (mostly by ensuring that we
    only include subfolders (or, for externally-located files, only the file
    name)). Would your analysis requirements be satisfied by sanitized paths?

I am now just including the leaf name.

  1. The opt-out must be available to Nightly users as well. It sounds as
    though the launcher process is enabled for all Nightly users irrespective of
    their study/data choices prefs so we'll need some signal to disable data
    reporting if the pref's off. (Or maybe I'm misunderstanding the precedence)

Okay, in bug 1529429 I tied launcher process to the SHIELD pref.

Attachment #9045098 - Attachment is obsolete: true
Attachment #9046036 - Flags: data-review?(chutten)

(In reply to Aaron Klotz [:aklotz] from comment #26)

(In reply to Chris H-C :chutten from comment #19)

Comment on attachment 9045098 [details]
Data review request for launcher failure ping

This is data-review- for a couple of reasons.

  1. We're going to need documentation (like
    [this](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/
    telemetry/data/downgrade-ping.html)) for the ping to describe the collection.

Okay, done with reluctance - I have added it as a Part 3 to this patch. However, I really think that this is a waste of developer time and should be automatically derived from the schema. mreid has filed an issue on GitHub for looking into this.

It's the price we pay for data collection. Telemetry tries to make this easy for default types via the "description" field, and I'm glad to hear :mreid's looking into making it similarly easy to generate from schemas.

Comment on attachment 9046036 [details] Data review request for launcher failure ping (r2) DATA COLLECTION REVIEW RESPONSE: Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? Yes. This collection is documented in the in tree docs and will live here: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/launcher-process-failure.html Is there a control mechanism that allows the user to turn the data collection on and off? Yes. This is presently tied to the Studies preference which is subordinate to Firefox's Data Choices preference. If the request is for permanent data collection, is there someone who will monitor the data over time? Yes. :aklotz is responsible. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? Category 1, Technical. Is the data collection request for default-on or default-off? Default-on for all channels. Does the instrumentation include the addition of any new identifiers? No. (It does include the names of antivirus, antispyware, and firewall products installed on the host machine, but those are product names) Is the data collection covered by the existing Firefox privacy notice? Yes. Does there need to be a check-in in the future to determine whether to renew the data? No. This collection is permanent. --- Result: datareview+
Attachment #9046036 - Flags: data-review?(chutten) → data-review+
Pushed by aklotz@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7733f36c6e38 Part 2 - Log launcher process failures to a telemetry ping; r=mhowell
Pushed by aklotz@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9105b6f8d62c Part 3 - Documentation for the launcher-process-failure ping; r=chutten

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=busted&selectedJob=230350963&revision=7733f36c6e38cf684cf612e81f6b89cad4df3566

Failure log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=230354802&repo=autoland&lineNumber=29458
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=230354785&repo=autoland&lineNumber=33407

Backout link: https://hg.mozilla.org/integration/autoland/rev/73f63e74eaa1b62e0d702c21c1284757e51d6c08

18:33:45 INFO - z:/build/build/src/browser/app/winlauncher/ErrorHandler.cpp(667,27): error: unused function 'SendPingThread' [-Werror,-Wunused-function]
18:33:45 INFO - static unsigned __stdcall SendPingThread(void* aContext) {
18:33:45 INFO - ^
18:33:45 INFO - 1 error generated.
18:33:45 INFO - z:/build/build/src/config/rules.mk:805: recipe for target 'Unified_cpp_app_winlauncher0.obj' failed
18:33:45 INFO - mozmake.EXE[4]: *** [Unified_cpp_app_winlauncher0.obj] Error 1
18:33:45 INFO - mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/browser/app/winlauncher'
18:33:45 INFO - z:/build/build/src/config/recurse.mk:74: recipe for target 'browser/app/winlauncher/target' failed
18:33:45 INFO - mozmake.EXE[3]: *** [browser/app/winlauncher/target] Error 2
18:33:45 INFO - mozmake.EXE[3]: *** Waiting for unfinished jobs....
18:33:45 INFO - mozmake.EXE[4]: Entering directory 'z:/build/build/src/obj-firefox/mfbt/tests'
18:33:45 INFO - z:/build/build/src/sccache2/sccache.exe z:/build/build/src/clang/bin/clang.exe --driver-mode=cl -FoTestAlgorithm.obj -c -DDEBUG=1 -DIMPL_MFBT -Iz:/build/build/src/mfbt/tests -Iz:/build/build/src/obj-firefox/mfbt/tests -Iz:/build/build/src/obj-firefox/dist/include -Iz:/build/build/src/obj-firefox/dist/include/testing -Iz:/build/build/src/obj-firefox/dist/include/nspr -Iz:/build/build/src/obj-firefox/dist/include/nss -MD -FI z:/build/build/src/obj-firefox/mozilla-config.h -DMOZILLA_CLIENT -Qunused-arguments -guard:cf -U_FORTIFY_SOURCE -Xclang -fno-common -Qunused-arguments -fsanitize=address -fsanitize-blacklist=z:/build/build/src/build/sanitizers/asan_blacklist_win.txt -fcrash-diagnostics-dir=z:/build/public/build -fcrash-diagnostics-dir=/z/build/public/build -TP -nologo -w15038 -wd5026 -wd5027 -Zc:sizedDealloc- -wd4091 -wd4577 -D_HAS_EXCEPTIONS=0 -guard:cf -U_FORTIFY_SOURCE -Xclang -fno-common -W3 -Gy -Zc:inline -wd4251 -wd4244 -wd4267 -wd4800 -wd4595 -wd4065 -Wno-inline-new-delete -Wno-invalid-offsetof -Wno-microsoft-enum-value -Wno-microsoft-include -Wno-unknown-pragmas -Wno-ignored-pragmas -Wno-deprecated-declarations -Wno-invalid-noreturn -Wno-inconsistent-missing-override -Wno-implicit-exception-spec-mismatch -Wno-unused-local-typedef -Wno-ignored-attributes -Wno-used-but-marked-unused -we4553 -D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING -GR- -Z7 -Xclang -load -Xclang z:/build/build/src/obj-firefox/build/clang-plugin/clang-plugin.dll -Xclang -add-plugin -Xclang moz-check -O1 -Oy- -WX -wd4275 -wd4530 -Xclang -MP -Xclang -dependency-file -Xclang .deps/TestAlgorithm.obj.pp -Xclang -MT -Xclang TestAlgorithm.obj -FdTestAlgorithm.pdb z:/build/build/src/mfbt/tests/TestAlgorithm.cpp
18:33:45 INFO - mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/mfbt/tests'
18:33:45 INFO - mozmake.EXE[4]: Entering directory 'z:/build/build/src/obj-firefox/mfbt/tests'

[task 2019-02-25T18:11:42.122Z] 18:11:42 INFO - /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/interceptor/VMSharingPolicies.h:141:12: note: remove std::move call here
[task 2019-02-25T18:11:42.123Z] 18:11:42 INFO - return std::move(items);
[task 2019-02-25T18:11:42.123Z] 18:11:42 INFO - ^~~~~~~~~~ ~
[task 2019-02-25T18:11:42.123Z] 18:11:42 INFO - In file included from /builds/worker/workspace/build/src/obj-firefox/browser/app/winlauncher/Unified_cpp_app_winlauncher0.cpp:11:
[task 2019-02-25T18:11:42.123Z] 18:11:42 ERROR - /builds/worker/workspace/build/src/browser/app/winlauncher/ErrorHandler.cpp:162:52: error: unknown type name '_bstr_t'
[task 2019-02-25T18:11:42.123Z] 18:11:42 INFO - static mozilla::UniquePtr<char[]> WideToUTF8(const _bstr_t& aStr) {
[task 2019-02-25T18:11:42.123Z] 18:11:42 INFO - ^
[task 2019-02-25T18:11:42.123Z] 18:11:42 ERROR - /builds/worker/workspace/build/src/browser/app/winlauncher/ErrorHandler.cpp:325:9: error: no matching function for call to '_wcslwr_s'
[task 2019-02-25T18:11:42.123Z] 18:11:42 INFO - if (_wcslwr_s(leaf)) {
[task 2019-02-25T18:11:42.123Z] 18:11:42 INFO - ^~~~~~~~~

Flags: needinfo?(aklotz)
Flags: needinfo?(aklotz)
Pushed by aklotz@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7f034553129d Part 2 - Log launcher process failures to a telemetry ping; r=mhowell
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 67
Depends on: 1530700
Depends on: 1531030
Depends on: 1531168
Depends on: 1532068
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: