New crash reports don't get processed

VERIFIED FIXED

Status

--
critical
VERIFIED FIXED
9 years ago
4 years ago

People

(Reporter: whimboo, Assigned: aravind)

Tracking

Details

(URL)

Attachments

(1 attachment)

(Reporter)

Description

9 years ago
Created attachment 411955 [details]
current status

Somehow the average time until a crash report gets processed raised to over 3000s. Further no more crash reports are in the queue. Which is kinda strange. So do we fail in pushing crash reports into the queue?

Can we please get this fixed ASAP so we do not loose too many crash reports?
(Reporter)

Comment 1

9 years ago
Ok, the average time has dropped to zero but we still don't process any crash report.
Summary: Average time to process crash reports raised to greater than 3000s → New crash reports don't get processed
(Reporter)

Updated

9 years ago
Assignee: nobody → server-ops
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → mrz
Version: Trunk → other

Updated

9 years ago
Assignee: server-ops → aravind
(Assignee)

Comment 2

9 years ago
Some of the processors were not restarted correctly.  I have restarted stuff since and we should be processing stuff correctly now.
(Reporter)

Comment 3

9 years ago
Thanks Aravind. It looks like that it is working again. Don't we have a nagios alert if processors haven't been restarted correctly?

So we can mark it as fixed?
Status: NEW → ASSIGNED
(Assignee)

Comment 4

9 years ago
We do have log check monitors in place that let us know if stuff isn't running.  In this case, it appears that stuff was restarted, but the old stuff didn't die because of nfs store problems.  I will document stuff better and have people check for stuff when they restart.
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED

Comment 5

9 years ago
This was fall out from this morning, bug 528186.

Comment 6

9 years ago
(In reply to comment #4)
The page exposes a mood which was set to Deadly.

The page exposes other metrics, such as the number of jobs in the queue. Why don't we alarm on either of these metrics?

We can expose this as a simple name value pair with very little dev work, just let us know what needs to happen.
(In reply to comment #0)
> Can we please get this fixed ASAP so we do not loose too many crash reports?

fwiw, when the processor isn't running, we don't lose any reports at all. The Socorro system was designed so that submissions can still take place even if the db and processor are offline. It catches up when it's brought back online.
(Reporter)

Comment 8

9 years ago
(In reply to comment #7)
> fwiw, when the processor isn't running, we don't lose any reports at all. The
> Socorro system was designed so that submissions can still take place even if
> the db and processor are offline. It catches up when it's brought back online.

Yes, I have seen this yesterday after the processors have been started. The number of unprocessed crashes jumped up to nearly 40.000. But good to know.

We are at 4000 unprocessed reports. So it's working fine. Marking as verified.

Austin, shall we file a new bug on your comment so it's separated from this already verified bug?
Status: RESOLVED → VERIFIED
(Assignee)

Comment 9

9 years ago
There is already a bug in place - https://bugzilla.mozilla.org/show_bug.cgi?id=480167
Ermm, i have recent crash IDs here (not mine, found in a support forum) what

a) give a

Please Wait...
Fetching this archived report will take 30 seconds to 5 minutes

and then

Oh Noes!
This archived report could not be located.

example:
bp-6e50c93e-8395-47f1-a20e-b5483aadabc5 (11.11.2009 19:10)
bp-1ecdcdd3-c24b-436b-b724-0db2f1c33361 (09.11.2009 19:51) (CET times)

b) give a

Oh Noes!
This archived report has expired because it is greater than 3 years of age.

example:
bp-929344d7-c9af-4fa7-b0c0-c54d9b394915 (09.11.2009 19:34)

=> those don't seem to be processed ever anymore and so be lost
=> this leads to my assumtion that a certain amount of reports got lost, right?
(Reporter)

Comment 11

9 years ago
Those crashes are not related to that bug. Probably those weren't send and would require a resending via about:crashes. The user should click on that entry.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.