Closed Bug 1608971 Opened 2 years ago Closed 2 months ago

end accepting crash reports for Thunderbird (November 8th, 2021)

Categories

(Socorro :: General, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

This is a followup bug for bug #1604848 which added a throttling rule to reject all incoming crashes for Thunderbird major version > 68.

This bug covers the rest of that roadmap towards ending support for Thunderbird in Socorro.

Current plan:

August 2020:

  • If Thunderbird has had a release with a major version > 68 by then, we'll continue to accept crash reports for 68 and lower. We do not plan to forward crash reports to another system.
  • If Thunderbird has not had a release with a major version > 68, then we'll move Thunderbird to the unsupported products list and reject all crash reports from that point going forward.

September 2020:

  • If we haven't added Thunderbird to the unsupported products list by this point, we will do so.
  • We'll remove Thunderbird-specific code from the rest of Socorro and remove crash report data from storage systems.
Depends on: 1604848
Priority: -- → P1
Summary: end support for Thunderbird (January 2021) → end accepting crash reports for Thunderbird (January 2021)
Summary: end accepting crash reports for Thunderbird (January 2021) → end accepting crash reports for Thunderbird (September 2020)
Depends on: 1624033

Bunch of things happened since we set the original plan.

The new plan is as follows:

March 2020:

  • Remove existing collector throttle rules from previous plan (bug #1621033)
  • Work out new schedule with Thunderbird project

October 2020:

  • Check in with Thunderbird project on progress towards switch crash ingestion systems.

July 31st, 2021:

  • End support for Thunderbird project in Socorro.
  • Remove Thunderbird as a supported project from collector.

January 2022:

  • Any existing Thunderbird crash report data expires from Socorro.
  • Remove Thunderbird as a supported project from processor and webapp.
Summary: end accepting crash reports for Thunderbird (September 2020) → end accepting crash reports for Thunderbird (July 31st, 2021)

How is this going? Does the Thunderbird project have a plan for switching crash report ingestion, yet?

Flags: needinfo?(vseerror)
Flags: needinfo?(philipp)
Flags: needinfo?(mkmelin+mozilla)

We've been in a holding pattern, though there was some initial investigation of alternatives. Also see bug 1628329.

I'd still be very much interested in discussing the possibility of the moco team running a second Thunderbird specific instance of socorro under contract. The Thunderbird infra team also grew so perhaps now there is a possibility Thunderbird could run it's on clone too. But I don't think we understand enough of what that would take to evaluate the feasibility. Maybe we should set up a meeting to discuss? Is the system documented somewhere publicly?

Blocks: 1628329
Flags: needinfo?(vseerror)
Flags: needinfo?(philipp)
Flags: needinfo?(mkmelin+mozilla)

The options haven't really changed in the several years we've been talking about this.

Socorro is pretty MoCo-specific. While someone could spend time and effort making it more general and easier to run, I'm not planning to do more work beyond what I did late 2019 and early 2020. Further, in 2021, I'm planning to rebuild Socorro to be more like Telemetry architecturally and infrastructurally. You will probably need to fork Socorro if you want to keep running it.

I suggest going the hosted-service route. There are several hosted options listed at:

https://github.com/mozilla-services/socorro/#support

I'll talk to my manager about whether a MoCo team would run a Thunderbird-specific Socorro under contract.

If you are rebuilding Socorro for Firefox then as long as the code lives in toolkit there is a high chance that Thunderbird would need little to no changes to continue using it, and hence not require any special casing on the server side aside maybe from the server either supporting more than one product. I'm convinced we can find a solution that is low to zero effort for MoCo, while avoiding major effort for MZLA and would be happy to meet. Looking forward to hearing back once you've had a chance to speak to your manager.

We're not talking about the crash reporter which has its code in toolkit. We're talking about the crash ingestion system. Currently Thunderbird sends crash reports to a system called Socorro that MoCo runs. This bug is about how we're not going to continue accepting Thunderbird crash reports and y'alls need to find a new place to send them.

I talked to my manager. We don't have the time or resources to run another instance of Socorro for Thunderbird. That option is off the table.

That leaves Thunderbird with these options:

  1. Fork Socorro and run it yourself.
  2. Find a service that can handle breakpad-style crash reports like Backblaze or Sentry.

Looks like Socorro code is here: https://github.com/mozilla-services/socorro
Since Thunderbird is so similar to Firefox code-wise, it's hard to see what would need forking in a new instance. I guess we'll need to investigate feasibility of running such an instance. Is it currently run on Google cloud, AWS, or where?

In 2021, I'm planning to rebuild Socorro to be more like Telemetry architecturally and infrastructurally. I'll be rewriting it to work on GCP. You will probably need to fork Socorro if you want to keep running it.

The collector code is in: https://github.com/mozilla-services/antenna

The processor, webapp, and crontabber code is in: https://github.com/mozilla-services/socorro

It currently runs on AWS.

How is this going?

Flags: needinfo?(vseerror)
Flags: needinfo?(philipp)
Flags: needinfo?(mkmelin+mozilla)

This is fully in the hands of the development team.

Flags: needinfo?(vseerror)

For specifying the endpoint there is bug 1675676.

@Sancus, please re-visit our options.

Assignee: nobody → sancus
Flags: needinfo?(philipp)
Flags: needinfo?(mkmelin+mozilla)

I've run into a lot of problems and questions trying to upload our crash reporters to third party services.

  1. Instead of using POST parameters, we dump all of our attributes into an "extra.json" file. This isn't supported by any third party crash aggregators.
  2. I'm not exactly sure how we decide whether crash report submission was successful or not, but it isn't a 200 OK response from the server. This means the integration in about:crashes doesn't work, and all crash reports show as failed.
  3. However the "signature" calculation and deduplication works, third parties are not doing things the same way and it's impossible to tell whether we'll get good deduplication/aggregation without testing on live data. Nightly doesn't even output enough crash data for us to test on it. Beta may have barely enough, but only a couple thousand per week.

Problems 1 and 2 can be overcome by changing the Breakpad client implementation in Thunderbird, which is painful. I am not sure how to even evaluate #3 or whether it would be fixable if it turns out to be a problem...

An alternative would be to run some version of Socorro, even if we have to pin at an old version when changes are made. I don't really want to do this, but I'm worried about spending a lot of time modifying the breakpad client and unending up with a poor solution due to deduplication and sorting problems...

For #2, in toolkit/crashreporter/CrashSubmit.jsm, the response that the server sends back should contain some data that's formatted like key=value. One of those keys is CrashID. That's checked against a regex.

https://searchfox.org/mozilla-central/rev/cecdac0aa5733fee515a166b6e31e38cc58abf32/toolkit/crashreporter/CrashSubmit.jsm#27,193-210

Replacing the breakpad implementation is possible, but it won't be easy since it's inclusion is automatic (on supported platforms) for anything that uses old-configure in the root of M-C. This is just a guess, but I expect that we would need to explicitly disable with --disable-crashreporter and then create a separate option. That will cause problems though because now MOZ_CRASHREPORTER is undefined so all of the #ifdefs that go with it will need updating.

Alternatively, if we port MOZ_CRASHREPORTER out of old-configure into mozconfigure for them, we could arrange for a Thunderbird build to use an unmodified breakpad client.

The other not-so-obvious thing, there are two client implementations. One is the patched google-breakpad client code, the other is CrashSubmit.jsm. I suspect the later is for crashes submitted manually by the user through Firefox vs ones submitted by the crashreporter executable.

Hi! We're coming up on July 31st and I'm wondering how things are going. Are we still on schedule?

Flags: needinfo?(mkmelin+mozilla)

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #15)

Hi! We're coming up on July 31st and I'm wondering how things are going. Are we still on schedule?

Hey Will. Magnus is on PTO. We would appreciate a bit of extra time. Right now we have things set up to send crash reports to Backtrace via a diverter app. It works well on Nightly and Beta. However, we haven't pushed it live yet. The next version of Thunderbird 78 is scheduled for August 10th and it takes a few days for uptake.

I also had another question: How long is historical data going to remain in Socorro? This will remain relevant for some time as we tune Backtrace to aggregate/deduplicate reports correctly.

Flags: needinfo?(mkmelin+mozilla)

Regarding extra time, I can do that. I can see you're making progress and I appreciate that. We could move it from July 31st to August 31st. Does that work?

All crash data expires 6 months from when it's submitted. I can keep the Thunderbird crash data until it expires.

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #17)

Regarding extra time, I can do that. I can see you're making progress and I appreciate that. We could move it from July 31st to August 31st. Does that work?

Yep, that should be plenty of time. Thanks! I will update this bug when the domain change has been pushed out and everything looks good.

All crash data expires 6 months from when it's submitted. I can keep the Thunderbird crash data until it expires.

OK, 6 months sounds great. Thanks again!

One other question: When Socorro starts refusing reports, what exactly does it return for Thunderbird clients(eg old versions) that remain pointed at it? I'm not sure if this matters really, I don't know if there's any side effects to unsubmitted crash reports building up or anything like that. I assume they're treated the same as regular ones except they lack a link to crash-stats.

We have a spec for the crash report payload and submission:

https://socorro.readthedocs.io/en/latest/spec_crashreport.html#collector-response

For unsupported products, the Socorro collector will return an HTTP 200 OK with a response body of:

Discarded=rule_unsupported_product

Assuming Thunderbird is like Firefox, there should be a crash report manager that expires unsubmitted crash reports after some period of time. I don't know offhand how long the period is, but if that's working, unsubmitted crash reports shouldn't accumulate unbounded.

Summary: end accepting crash reports for Thunderbird (July 31st, 2021) → end accepting crash reports for Thunderbird (August 31st, 2021)

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #20)

Assuming Thunderbird is like Firefox, there should be a crash report manager that expires unsubmitted crash reports after some period of time. I don't know offhand how long the period is, but if that's working, unsubmitted crash reports shouldn't accumulate unbounded.

Thanks, yes, Thunderbird uses the exact same code for this.

I REALLY hate to ask this, but is there any chance we could stretch this to Oct 31, 2021? Unfortunately, it seems like it's not possible for us to uplift the code changes that alter the crash reporter domain to 78. It requires uplift of Bug 1675676 and possibly some others(apparently the regression trail seems broken), which is unlikely at this point.

If the answer is no, then we won't have crash reporting for the last two months of 78(78.14, 78.15 at a minimum). This isn't the end of the world, but it's not great either. I do understand if extending that far conflicts with your goals too much.

I also have one other question:
Backtrace has rule-based functionality that lets you perform different actions while building the callstack signature. I've found Socorro's lists. Do you have any suggestions for which rules should be used to implement the same lists, and also, whether there's any way to prune those lists or if all 386 prefix signatures are definitely necessary. It doesn't seem like Backtrace has any code-based method of adding these and it will take a long time to do it manually through their web UI...

Thanks!

(In reply to Andrei Hajdukewycz [:sancus] from comment #22)

I also have one other question:
Backtrace has rule-based functionality that lets you perform different actions while building the callstack signature. I've found Socorro's lists. Do you have any suggestions for which rules should be used to implement the same lists, and also, whether there's any way to prune those lists or if all 386 prefix signatures are definitely necessary. It doesn't seem like Backtrace has any code-based method of adding these and it will take a long time to do it manually through their web UI...

From the screenshot you linked it seems that Backtrace supports regular expressions so you might be able to cut those down by using rules like: __strcmp_[a-zA-Z0-9_]+ instead of enumerating all the processor-specific variations of those functions. That being said manually adapting Socorro's lists is not going to be a fun job :-( It might be worth seeing what crashes you're getting in the first few weeks and then start adding rules to get proper bucketing just for your use cases; it might be faster that way.

I can adjust the timeline to this:

  • October 2021: Remove Thunderbird from the list of supported products in the collector. After this change, incoming crash reports for Thunderbird will be rejected. Crash reports that are already in Socorro will still be available for viewing until they expire.
  • January 2022 or so: Thunderbird any SeaMonkey-specific code from the rest of Socorro and remove Thunderbird from the supported products in the webapp. Delete any remaining Thunderbird crash reports from data stores that haven't expired, yet.

Andrei: Does that work?

Regarding the signature lists, someone could look at which symbols don't show up anymore and prune those from the lists. I don't know how long that would take--maybe a week? They could use the "proto_signature" field for this, I think.

Does Backtrace have an API for setting up rules? Maybe you could script something with Selenium?

Flags: needinfo?(sancus)

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #24)

I can adjust the timeline to this:

  • October 2021: Remove Thunderbird from the list of supported products in the collector. After this change, incoming crash reports for Thunderbird will be rejected. Crash reports that are already in Socorro will still be available for viewing until they expire.
  • January 2022 or so: Thunderbird any SeaMonkey-specific code from the rest of Socorro and remove Thunderbird from the supported products in the webapp. Delete any remaining Thunderbird crash reports from data stores that haven't expired, yet.

Andrei: Does that work?

Yes, that's great! Thanks.

Does Backtrace have an API for setting up rules? Maybe you could script something with Selenium?

As far as I can tell, they do not. I've asked their support if they have some better way to do this, maybe i can send them a list and they have an API they can use internally to add them or something like that. Scripting something that interacts with the web UI is a good idea though.

Flags: needinfo?(sancus)
Summary: end accepting crash reports for Thunderbird (August 31st, 2021) → end accepting crash reports for Thunderbird (October 31st, 2021)

Is there some kind of special magic to do with shutdown hangs, like this one: https://crash-stats.mozilla.org/report/index/56f31c67-357f-4720-9736-faf4f0211019

I'm not seeing a sensible callstack in Backtrace for these and I have no idea how to fix that or if it's even fixable. What I get is something like this:
[ 0 ] mozilla::`anonymous namespace'::RunWatchdog(void*)
[ 1 ] _PR_NativeRunThread(void*)
[ 2 ] pr_root(void*)
[ 3 ] thread_start<unsigned int (__stdcall*)(void )>
[ 4 ] @BaseThreadInitThunk@12
[ 5 ] patched_BaseThreadInitThunk(int, void
, void*)
[ 6 ] ___RtlUserThreadStart@8
[ 7 ] __RtlUserThreadStart@8

Andrei: That's the crashing thread (24). For shutdown hangs, Crash Stats shows thread 0 on the main page rather than the crashing thread. It's this code here:

https://github.com/mozilla-services/socorro/blob/058f371bbadb4651aee071252f759e01a250e31e/webapp-django/crashstats/crashstats/views.py#L146-L150

There's a Crash Reporting Working Group (https://wiki.mozilla.org/Data/WorkingGroups/CrashReporting) and we all hang out on #crashreporting on Matrix. That's a good place to ask questions like this.

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #27)

Andrei: That's the crashing thread (24). For shutdown hangs, Crash Stats shows thread 0 on the main page rather than the crashing thread. It's this code here:

Thanks. I don't think this logic can be duplicated in Backtrace, they don't support any kind of rule or option to change which thread you show. I'll ask their support, but I suspect we'll just have to live without this.

I talked with sancus and we agreed to push the collection end date out a week to November 8th. Updating the summary accordingly.

Summary: end accepting crash reports for Thunderbird (October 31st, 2021) → end accepting crash reports for Thunderbird (November 8th, 2021)

I'm just about ready to go here, should be able to turn off submissions to Socorro tomorrow at around 3pm PT. Once I've done that, I'll comment in this bug again.

Server side changes are live, we are no longer passing reports to Socorro.
Our crash report view(which uses the Backtrace API, and template borrowed from Socorro :)) has also been live for a couple of days: https://crash-stats.thunderbird.net/report/bp-213f4e6c-61bb-4ea4-8113-a362b0211108

Assignee: sancus → willkg
Status: NEW → ASSIGNED

willkg merged PR #755: "bug 1608971: end crash report collection for Thunderbird" in e50cadf.

This ends support for collecting crash reports for Thunderbird. Landing this change will trigger an automatic deploy to the staging environment. I'll deploy it to production either today or tomorrow depending on how my day goes.

The next step will happen in January 2022 where we'll remove Thunderbird crash reports from storage systems and remove Thunderbird-specific supporting code from Socorro.

Ending crash collection was deployed just now to prod in bug #1741402.

I wrote up a bug for deleting crash data and removing support from the webapp. I think we're done here. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.