Closed Bug 1178097 Opened 9 years ago Closed 9 years ago

Search by email sometimes results in no hits

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(firefox42 affected)

RESOLVED FIXED
Tracking Status
firefox42 --- affected

People

(Reporter: wsmwk, Assigned: lars)

Details

Yesterday I encountered one address with one crash that could not be found via search.

Today I encountered another example user address with many crash IDs where search returns no results. See next comment.
Lars, I suspect that because we are using a redactor in our elasticsearch CrashStorage class, along with the json_dump, we remove emails and URLs from processed crashes. I can't seem to find any of that data in elasticsearch anymore. Can you please confirm?
Component: General → Backend
Flags: needinfo?(lars)
ES is the only crashstore that uses redaction during the save process.  That redaction should only remove the results of the stackwalker output and have no bearing on the output of the UserData processor rule.  That rule copies the email address from the raw crash into the processed crash. 

There is a bit of a mystery here.  If I fetch the unredacted processed crash 66c29e9d-2603-4b94-ab27-78eb52150625, I see it does not have the email field.  I then reprocessed that crash and fetched the unredacted version again.  This this time the email field was present.

I'm starting some research as to why some processed crashes appear to be redacted at the wrong time...
Flags: needinfo?(lars)
I've found some problems in the AWS processor configuration.  the writing redactor was falling back to default redaction, which meant that the standard email, url, etc were being removed before saving to ES,  We'll be doing some backfill reprocessing.
This config change has been pushed to production, so this will be fixed for any incoming crashes.

We need to determine how far back we want to reprocess, to fix existing crashes.
If we were to reprocess all crashes since AWS processing went live:

breakpad=> select count(uuid) from reports where (email is not null and email <> '') or (url is not null and url <> '') and date_processed between '2015-06-23' and '2015-07-02';
  count  
---------
 6579957
(1 row)

Not too unreasonable, I'd just watch datadog and spin up more processors if appropriate (that is, raise the "max" and "desired" number of nodes on the auto-scaler settings.)

Lars, you mentioned that this would affect aggregate reports that happened before a recent change to the signature processing rule - one thing to note is that the aggregate PG reports won't change unless you *also* backfill_matviews() for the affected range. We don't do any aggregate reports involving email or URLs that I am aware of.

You may want to backfill just for consistency's sake though, generally we expect the data stores to be in sync, though this is an odd case.
Assignee: nobody → lars
Flags: needinfo?(lars)
 The template collapse signature generation rule coincided with our move to AWS.  No additional signature changing rules have been enabled since.  However, if new symbols have been loaded for crashes that were previously processed without them, some signatures may change.
Flags: needinfo?(lars)
(In reply to K Lars Lohn [:lars] [:klohn] from comment #8)
>  The template collapse signature generation rule coincided with our move to
> AWS.  No additional signature changing rules have been enabled since. 
> However, if new symbols have been loaded for crashes that were previously
> processed without them, some signatures may change.

Oh ok! Well I'd say it's safe to just reprocess all of these in that case, backfill is optional but encouraged.
I'd like to have URLs present in ES for at least as far as two weeks back.
please retry your failing queries.  Earlier this last weekend a reprocessing job restored the missing information to Elastic Search.
LGTM. Thanks!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.