Write a cron job to populate the `signatures` table from Super Search data

NEW
Unassigned

Status

Socorro
Backend
6 months ago
5 days ago

People

(Reporter: adrian, Unassigned)

Tracking

(Blocks: 2 bugs)

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

6 months ago
Feature URL
-----------

https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=54.0&days=7

Parts impacted
--------------

 * socorro, cron.jobs

Rationals
---------

 * the `signatures` table is currently populated by the `update_reports_clean` stored procedure, and we want to remove that (but we want to keep the content of the `signatures` table)
 * Elasticsearch, via Super Search, should be the only source of crash data and aggregations
Why do we need the signatures table?

I suspect the topcrashers page needs it. Perhaps it's because of the first_seen column. Is that the only reason to have that table?
Flags: needinfo?(adrian)
(Reporter)

Comment 2

4 months ago
Yes, that's the table we use for the SignatureFirstDate service, which is used by topcrashers. I don't remember what else uses it. I was under the impression it was a generally useful service for our users, but you might want to challenge that assumption.
Flags: needinfo?(adrian)
The table looks like this:;

breakpad=> \d signatures
                                          Table "public.signatures"
    Column    |           Type           |                             Modifiers
--------------+--------------------------+-------------------------------------------------------------------
 signature_id | integer                  | not null default nextval('signatures_signature_id_seq'::regclass)
 signature    | text                     |
 first_report | timestamp with time zone |
 first_build  | numeric                  |
Assignee: nobody → peterbe
Blocks: 1399124
https://github.com/mozilla-services/socorro/pull/3984 is quite big and non-trivial. But I'm confident it's on the right path. 

We could solve this by a processor rule that has write access to postgres (adrian's idea) but it's a bit sad. The processor needs to be simple and have as few side-effects as possible. Not needing postgres would also mean we can cease to worry about making the database connection etc. Also, this feature (signatures' first date) does feel like something that belongs more towards the web app side of things. Granted, crontabber isn't the webapp but it just generally feels better of there as it's massaging the data. 

The PR needs unit tests and I don't think know the best way to test it but what I had in mind is that we install it on -stage admin's crontabber job list. Then we watch the logs to see that it's spotting signatures and it should also mention that it's not inserting them (because the stored procedure has already done it). But if you manually delete some signatures from the list (of newish ones) and wait, this crontabber app should populate it.
See https://github.com/mozilla-services/socorro/pull/3984#issuecomment-329838697 for why I'm un-assigning this from myself.
Assignee: peterbe → nobody
One thing I was thinking about is that maybe we should think about this as if it were a data flow. We have crashes flowing from Antenna -> Processor => various crash storage systems and by crash id already.

Maybe we should create a flow for signatures? Every time the processor computes a signature, it tosses it in a flow of signatures and we have a consumer pull those off and update Postgres.

We've got pieces of this in place already. We could use RabbitMQ which the processor already uses and write a RabbitMQ crash storage that tosses signatures into a queue. Then we also write a microservice component that consumes from the RabbitMQ queue and updates Postgres.

Seems like this would have similar properties as adding a processor rule. I wonder if there are other problems we could/should solve similarly.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #6)
> One thing I was thinking about is that maybe we should think about this as
> if it were a data flow. We have crashes flowing from Antenna -> Processor =>
> various crash storage systems and by crash id already.
> 
> Maybe we should create a flow for signatures? Every time the processor
> computes a signature, it tosses it in a flow of signatures and we have a
> consumer pull those off and update Postgres.
> 
> We've got pieces of this in place already. We could use RabbitMQ which the
> processor already uses and write a RabbitMQ crash storage that tosses
> signatures into a queue. Then we also write a microservice component that
> consumes from the RabbitMQ queue and updates Postgres.
> 
> Seems like this would have similar properties as adding a processor rule. I
> wonder if there are other problems we could/should solve similarly.

I like that a lot more than a non-trivial crontabber app that spans ES and PG. 
However, it might be worth "making a deal". Switch to this app nowish, then you can much more confidently switch off storing anything related to crashes in PG and disable it from the PolyCrashStorage. Once that's in place it'll be easier to innovate on new cool Jansky solutions. In other words, a cryptic "side-step" just to be able to get rid of PG based crashstorage and then add it back in but in a better way.
I can't think of a reason why we couldn't do this now--I don't think we need to wait for Jansky. I think we can implement this and throw it in the mix and it'll work just fine with the existing stuff. Then once we're happy with it, we can remove the existing stuff.
Blocks: 1257531
You need to log in before you can comment on or make changes to this bug.