Closed Bug 1257531 Opened 6 years ago Closed 4 years ago

[tracker] Stop saving crash data to postgresql

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: adrian, Assigned: willkg)

References

Details

Now that bug 1252461 has landed, we can change the configuration to stop saving the processed crashes in postgres.
Configuration change:

> [centos@ip-172-31-10-19 ~]$ consulate kv get socorro/processor/destination.storage_classes
> socorro.external.postgresql.crashstorage.PostgreSQLCrashStorage, socorro.external.es.crashstorage.ESCrashStorage, socorro.external.boto.crashstorage.BotoS3CrashStorage, socorro.external.statsd.crashstorage.StatsdCrashStorage
> [centos@ip-172-31-10-19 ~]$ consulate kv set socorro/processor/destination.storage_classes 'socorro.external.postgresql.crashstorage.PostgreSQLBasicCrashStorage, socorro.external.es.crashstorage.ESCrashStorage, socorro.external.boto.crashstorage.BotoS3CrashStorage, socorro.external.statsd.crashstorage.StatsdCrashStorage'

r? anyone
Depends on: 1252461
this is not complete.

while this will set up the defaults for the destination PolyCrashStore correctly, there is another key that will override it back to the original value.  In each of the separate numbered crashstores within the PolyCrashStore, we override the default with "socorro.external.statsd.statsd_base.StatsdBenchmarkingWrapper".  Then we give the original intended crashstorage class to that class.

Look for:
   destination.storage0.wrapped_crashstore

its current value is:
    socorro.external.postgresql.crashstorage.PostgreSQLCrashStorage

change it to:
    socorro.external.postgresql.crashstorage.PostgreSQLBasicCrashStorage

If I were to start over with the configuration, and knowing that the benchmarking wrappers are permanent rather than temporary, I would have set the key "socorro/processor/destination.storage_classes" to be use the "socorro.external.statsd.statsd_base.StatsdBenchmarkingWrapper" 3 times. That way there would be less ambiguity that lead you to your proposal.
Thanks Lars! We clarified on IRC, the key is actually `wrapped_object_class` instead of `wrapped_crashstore` (which is legacy).

Verification: 

> [centos@ip-172-31-10-19 ~]$ consulate kv get socorro/processor/destination.storage0.wrapped_object_class
> socorro.external.postgresql.crashstorage.PostgreSQLCrashStorage


New proposal:
> $ consulate kv set socorro/processor/destination.storage_classes 'socorro.external.postgresql.crashstorage.PostgreSQLBasicCrashStorage, socorro.external.es.crashstorage.ESCrashStorage, socorro.external.boto.crashstorage.BotoS3CrashStorage, socorro.external.statsd.crashstorage.StatsdCrashStorage'
> $ consulate kv set socorro/processor/destination.storage0.wrapped_object_class 'socorro.external.postgresql.crashstorage.PostgreSQLBasicCrashStorage'
I have just applied those changes to stage. I'm watching datadog and logs for the next 30 minutes, after what I'll be gone for the weekend. If things go wrong, here is the config rollback: 

consulate kv set socorro/processor/destination.storage_classes 'socorro.external.postgresql.crashstorage.PostgreSQLCrashStorage, socorro.external.es.crashstorage.ESCrashStorage, socorro.external.boto.crashstorage.BotoS3CrashStorage, socorro.external.statsd.crashstorage.StatsdCrashStorage'

consulate kv set socorro/processor/destination.storage0.wrapped_object_class 'socorro.external.postgresql.crashstorage.PostgreSQLCrashStorage'
Red Alert. We need to undo this. Turns out that the current correlations scripts, which we run on crash-analysis (the old dbaron scripts) RELY on processed_crashes (joined with reports). See
https://github.com/mozilla/socorro/blob/f9a84eb267c457c04adbec7891b7e4039f2da857/scripts/crons/cron_libraries.sh#L56
https://github.com/mozilla/socorro/blob/f9a84eb267c457c04adbec7891b7e4039f2da857/scripts/crons/cron_libraries.sh#L81

Basically, it does a join on `processed_crashes` AND `reports_$DATE`.

To quote Lars (we noticed this today whilst vidyo'ing), "That's what happens when you have too many moving parts."

I think we can leave the source code as is. Just need to make sure our config is set to to continue to write processed crashes to PG.
Once we've rewritten how correlations are aggregated and accumulated we should ideally not use PG at all. At all! (since we can get product versions from /api/ProductVersions/).

Lars first unfinished prototype of correlations_app.py (which I'm taking over slowly) relies on PG (for UUIDs) and S3 (for processed crash JSON blobs). To start with, we'll get UUIDs out of ES (SuperSearch maybe). Maybe we'll use ES for the processed crashes too and not even involve S3.
I have rolled back the configuration on stage, applying the commands I wrote in comment #4.
Peter, what's the status here? Do we still rely on the old correlation scripts?
(In reply to Adrian Gaudebert [:adrian] from comment #8)
> Peter, what's the status here? Do we still rely on the old correlation
> scripts?

I don't know. I'm certainly not working on it. For the moment we're still relying on pulling down the correlations from crash-analysis with http so they're not even stored in Postgres.
If cron_libraries.sh is still running on crash-analysis, then we still rely on this. There are three lines in there that rely on direct postgres access. Replace those with ES backed API calls, update crash-analysis, and we can go ahead with this.
Component: Infra → General
What's the status of this bug? Do we still use the old correlation scripts mentioned in comment #6?
It's messy. 

1) https://github.com/mozilla/socorro/blob/master/scripts/crons/cron_libraries.sh is still in git. It still depends on talking directly to postgres to query the `processed_crashes` table. 

2) The cron job on crash-analysis is still going. Nobody updates as far as I know. And they shouldn't have a reason to either now since figuring out what the featured versions are is automated. Correlations are still generated daily. E.g. https://crash-analysis.mozilla.com/crash_analysis/20170427/ 

3) The webapp does NOT depend on crash-analysis.m.c. We used to pull on data from crash-analysis.m.o by requests.get() but we longer do. (:marco replaced this will a query to telemetry)

4) If we re-do the configuration so that our crashstorage stops saving processed crashes in PG, then the output on crash-analysis.m.o will cease. The socorro team certainly "doesn't care" since we neither use or depend on it for anything else in socorro. But other stability people might care. 

Rough outline of a plan
-----------------------

1. Ask once and twice the stability list if people are dependent of the output on https://crash-analysis.mozilla.com/crash_analysis/*

2. If someone says they are; Ask :marco how we can guide these people away from it and start relying on Telemetry.

3. If nobody says they are; 
  3.1. make the change (again) mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1257531#c4 (and on prod too)
  3.2 remove cron_libraries.sh from our code
  3.3 talk to calixte if they still need the crash-analysis server at all and if we can recyle it


:marco, :calixte Do you know of anybody using the old crash-analysis.m.c?

Adrian, did I get this right?

Who wants to do the comms to find out if we can shut this down?
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(cdenizet)
Blocks: 1314814
Depends on: 1361396
:peterbe, I'm currently removing the stuff we've on c-a.m.c.
I think Marcia continues to use https://crash-analysis.mozilla.com/release-mgmt/0000.overview.html (Latest Daily Reports) but I'm rewriting this stuff and will put it on Heroku.
I'll ping you when everything will be finished (I hope very soonly).
I already filled a bug for that purpose: https://bugzilla.mozilla.org/show_bug.cgi?id=1315007
Flags: needinfo?(cdenizet)
(In reply to Calixte Denizet (:calixte) from comment #14)
> :peterbe, I'm currently removing the stuff we've on c-a.m.c.
> I think Marcia continues to use
> https://crash-analysis.mozilla.com/release-mgmt/0000.overview.html (Latest
> Daily Reports) but I'm rewriting this stuff and will put it on Heroku.
> I'll ping you when everything will be finished (I hope very soonly).
> I already filled a bug for that purpose:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1315007

That's nice but not answering the question :)

Do you know if anybody uses https://crash-analysis.mozilla.com/crash_analysis/
(Not; Do you use crash-analysis.mozilla.com)

There's a cron job that generates these .txt and .txt.gz files. We *used* download those from the Socorro webapp and display their content as part of the signature report and/or the report index pages. We don't any more. The question is; do you know anybody else who depends on those .txt (and .txt.gz) files at all?
:peterbe, sorry I missed that...
As far as I know, nobody use this. Maybe you should check the server logs to see if someone download this stuff.
The only thing I know of is DLL correlations with versions (see bug 1361396).
Flags: needinfo?(mcastelluccio)
(In reply to Peter Bengtsson [:peterbe] from comment #15)
> 
> There's a cron job that generates these .txt and .txt.gz files. We *used*
> download those from the Socorro webapp and display their content as part of
> the signature report and/or the report index pages. We don't any more. The
> question is; do you know anybody else who depends on those .txt (and
> .txt.gz) files at all?

We at Beijing office scraps these reports to find out if any of our extensions have unusual high correlation with high volumn crashes, it helped a few times in the past.

We just noticed we've been gotten empty results since 5 May, and it turns out since that day only the first character of the extension id was left. [1][2]

[1]: https://crash-analysis.mozilla.com/crash_analysis/20170504/20170504_Firefox_52.0.2-interesting-addons.txt
[2]: https://crash-analysis.mozilla.com/crash_analysis/20170505/20170505_Firefox_52.0.2-interesting-addons.txt
Hector: Can you write up a new bug for that? That's a different (but related) issue.
Hector, that problem very likely comes from us changing the format of the addons field. See bug 1250132 for context.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #19)
> Hector: Can you write up a new bug for that? That's a different (but
> related) issue.

Filed as bug 1371574.

I'm also interested in knowing what can I do after bug 1361396.
Assignee: adrian → nobody
Blocks: 1361394
No longer blocks: 1314814
Summary: Stop saving processed crash in PostgreSQL → [tracker] Stop saving processed crash in PostgreSQL
Duplicate of this bug: 1399124
Depends on: 1375434
Depends on: 1434425
Depends on: 1435551
Adjusting the summary a bit since the postgresql crash storage saves raw and processed crash data to a few different places.
Summary: [tracker] Stop saving processed crash in PostgreSQL → [tracker] Stop saving crash data to postgresql
Depends on: 1459272
Depends on: 1459276
Depends on: 1435071
Depends on: 1460125
The last changes removing all the bits from the db related to crashstorage landed today. We're all done here!
Assignee: nobody → willkg
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
We pushed it all out today in Socorro tag 319. That's the end of that.
You need to log in before you can comment on or make changes to this bug.