Closed Bug 422581 Opened 16 years ago Closed 16 years ago

Instantly process a queued report if requested


(Socorro :: General, task, P1)



(Not tracked)



(Reporter: vlad, Assigned: morgamic)




(1 file)

Reports seem to be taking a really long time to process (at least 3 hours, if not more) -- e.g. ffffce33-f099-11dc-bb23-001a4bd46e84 submitted 6:08 PM is still not processed as of 9:38 PM.  We really need to beef up whatever needs beefing up to get to a point where we can process delays within minutes given the beta load, because the load for go-live will be significantly greater.  This is currently an issue with development, because if a crash happens, it's too easy to forget to ever look back at it 6+ hours later to see what actually happened.

Can we just run parallel instances of the component that parses the report and generates symbolic backtraces and stuff?

I'd even consider this blocking 1.9, because we're going to need this data to get an idea of what needs to be looked into for dot releases for 1.9.
Flags: blocking1.9?
The processor can be run multiple times in parallel. AFAIK IT is working on bringing up more hardware capacity, and morgamic has Lars working on improving processor throughput.
Assignee: nobody → morgamic
morgamic and I came up with a simple idea that would solve at least the developer problem: a simple page, probably behind LDAP, where someone can go and stick in a crash UUID to have that crash processed instantly in a separate queue.  (Writing this here so we don't forget about it)
That's a pretty good idea. If you're on Windows, you might also consider using the symbol server to get stack traces from nightlies on your machine.
Yeah, we'll need to fix this for release.
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
Update on this: lars checked in a new monitor and processor this afternoon that is the fix for the faulty mutual exclusion method and also uses a proper queue for pending reports that can be used to flag for priority.  We're staging this atm and should be able to push this early next week.

The UUID flagging mentioned in comment #2 would use the queue table to push things to the front for immediate processing.

Also of note, the previous monitor randomly selected new items for processing, which caused erratic behavior as well (new reports instantly go through, old ones still waiting).  This should also be fixed with the new patch.
Been looking at Lars' queue and I think we should make current (crappy) 404 page scenario do this:
* flag that UUID as having priority in the queue (if exists)
* display js refresh on the page, maybe with a monkey/shovel saying "I'm working"
* load page again after 10 sec, which should be how long it takes to wait for a flagged report

This would eliminate the need to:
* log in
* enter any uuids

In the case where the UUID doesn't exist anywhere (queue or db) -- I think we should make 404 friendlier as noted in bug 414258, but not sure what text would go there to make the "we have no idea what that UUID is" case more manageable.

Also, changing title to be more appropriate.  And adding refactor as a dependency (bug 420809) since deploying that adds our queue.
Depends on: 420809
Priority: P2 → P1
Summary: Socorro takes too long to process reports → Instantly process a queued report if requested
Target Milestone: --- → 0.6
Could probably just dupe bug 411347 over here as well if we're going to bump priority on queued reports when you hit the URL.
Lars' queue patch was pushed, we have bug 426940 to resolve to eliminate lag times between the point when the collector receives a file and when it's queued.  Worst-case wait would be ~30 seconds.

Once it's in the db, it's a really simple patch to:
* query jobs.uuid
* if exists, set priority=1
** load templates/working.html that has a meta-refresh to reports/index/[uuid] in 15 sec
** add sexy apng (
* else do 404 or 410

I think this is better than the login/enter UUIDs scenario...
Depends on: 426940
Target Milestone: 0.6 → 0.5
(In reply to comment #9)
> I think this is better than the login/enter UUIDs scenario...

That is a great idea.
No longer depends on: 426940
Attached patch v1, first runSplinter Review
Ted - this is what I have so far if you want to hack on it today.  Needs review and if you want to hack on it go for it.

It does some simple/cool stuff:
* flags jobs entry for priority processing
* shows queue status on pending page
* redirects to report page after 10s
* redirects back to pending page for another 10s if report isn't ready yet

There's a small delay for an active Thread to pick up the queued report -- so delta between the priority update is probably ~5-8 seconds and process time is ~5-10 seconds.  So we might want to up the refresh time to 15 seconds, but 10s doesn't hurt that much.

Concerned a little about the ability to send no-cache headers.  None of this should be cached so one of the things we need to do is force no-cache headers so it doesn't get stuck in the proxy cache.  I fear pylons sucks horribly at this, but unconfirmed.
Attachment #313566 - Flags: review?(ted.mielczarek)
Comment on attachment 313566 [details] [diff] [review]
v1, first run

This looks good. I would prefer if we could just 404 on the report/index page instead of redirecting to report/pending and then 404ing, but that'd mean an extra db lookup for each queued job, I guess, which is silly. (Unless you have a way to pass this data along with the redirect, but we'd have to stick it in a cookie or something, wouldn't we?)

I would  bump the timeout to 15 or even 20 seconds if you expect processing to take more than 10 seconds. No point in redirecting them to the same page twice if we can just have the refresh be a bit slower.
Attachment #313566 - Flags: review?(ted.mielczarek) → review+
Alright, checked in on trunk with bumped refresh time (rev 351).  Working with Lars and Aravind to stage this so we can get this sucker pushed.
Closed: 16 years ago
Keywords: push-needed
Resolution: --- → FIXED
Keywords: push-needed
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.