Closed Bug 444150 Opened 12 years ago Closed 12 years ago

Processor is not processing new reports

Categories

(Socorro :: General, task)

x86
macOS
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: morgamic, Assigned: lars)

References

Details

This is a tracking bug for the issues we're currently working on.

Expected:
1) submit report
2) click on link
3) view report in < 10 minutes

Actual:
1) submit report
2) click on link
3) see pending page (sometimes with queue info) for hours
3b) sometimes never see a report

Issues currently being worked on:
1) monitor crashes with randomly when scanning a large backlog of reports (cause unknown)... potential causes
- NFS issues
- Python threading issues
- pgsql connection timeout

2) priority jobs are not being processed correctly as a result of main monitor thread dying

Solutions being worked on:
1) separating monitor threads
2) debugging python, pgsql and NFS to figure out why the monitor is crashing

Why this is taking so long:
- each new monitor loop takes a matter of hours to populate the database
- the problem is not reproducable on identical hardware with an identical codebase

Current strategy:
1) using a collector update to create simple uuid entries in a top-level directory in order to avoid having to do deep scans for new jobs
2) having the separate main monitor thread be responsible for processing just these new uuid entries and nothing else
3) remove need for priority job processing and queueing from having to do any scans

We are currently working on implementing the above in staging.
Blocks: 444152
nominating to get this on the radar. As we get closer to major releases as well as during the beta cycle, we need to have crash data to evaluate whether any top crashes were introduced during the release.

Also, is there an explanation as to why this happened now versus before final ship - is it a case of pure volume?
Flags: blocking1.9.1?
Problems started when the backlog was > 800k, and we've been working for about a week to debug and refactor the monitor and filesystem scanning portion of the processor.

ETA on the monitor factor is 2 days, possibly sooner.  We are still collecting data, priority processing is delayed by 1+ hours because of the time it takes to scan for new jobs on the NFS.

We need to refactor the monitor before inserting new queue jobs because the current main monitor scan fails before it can complete and does so without any logged errors.  Lars or Aravind -- anything I missed?
(In reply to comment #1)
> nominating to get this on the radar. As we get closer to major releases as well
> as during the beta cycle, we need to have crash data to evaluate whether any
> top crashes were introduced during the release.

We have processed a significant number of crashes post 3.0 already.  Is this data not enough to get trends?  I am not clear how this is blocking any kind of trending.  If its individual reports you are after, then the current system should still get you want you need, only the reports will be delayed by a few hours (instead of minutes).

We are working on fixes that should get things back to normal processing for priority reports - within a few minutes.  However the backlog of unprocessed reports will continue to grow until we throttle the crash reporter somehow.
I am not seeing reports delayed by a few hours. In my case I am seeing that more than 24 hours elapses before I see a report. QA needs stack information to diagnose what might be happening during our testing cycle. In my case during my 3.0.1 testing I crashed several times and was not able to get my information before we pushed the builds out to beta. During the 3.0.1 beta cycle which began yesterday, we need to be be able to receive the data in a timely manner in order to evaluate whether a top crash may have been introduced into 3.0.1.

Also, just curious as to how the backlog grew to over 800K?
(In reply to comment #4)
> Also, just curious as to how the backlog grew to over 800K?

Reports per day are almost 4x project values -- so something is going on in client throttling that is causing more reports to come in than before, people are crashing more, we have more users or all of the above.

800K reports is less than 4 days of reports.  Over 2 weeks our cumulative backlog grew to that value, which is not necessarily being 4x behind, but still behind quite a bit.
Oh, I get it - I forgot about the throttling. If that is not working then I can see how we are getting more data.
I think it's working, it's just not throttling as much as we expected for some reason -- there is another bug about dropping it to 10% instead of 25%.
This should be fixed now.  Priority processing is working correctly as well.  We still have issues with the reporter that are being tracked in a separate bug (444749).

We have a huge backlog of nearly 1.8M reports.. not sure if we will ever catch up, but anyways, the reports you care about should be ready in like 5 minutes after you first visit them.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Flags: blocking1.9.1?
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.