Closed Bug 608309 Opened 14 years ago Closed 13 years ago

middleware should gracefully handle missing data

Categories

(Socorro :: General, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: stephend, Assigned: brandon)

References

()

Details

Attachments

(1 file)

http://crash-stats.stage.mozilla.com/topcrasher/byversion/Thunderbird/3.1b1pre is currently giving me:

"Unable to load data

System error, please retry in a few minutes"

Ditto for http://crash-stats.stage.mozilla.com/topcrasher/byversion/Fennec/2.0.2, though that might actually be a code issue.
And http://crash-stats.stage.mozilla.com/topcrasher/byversion/Fennec/2.0a1, which I don't think is a code issue, since that version is available from the Versions pulldown
Assignee: server-ops → aravind
I think this might be an application/code problem.  Here is the error I see in the middleware logs.

Oct 29 14:02:12 dhcp-10-2-11-235 Socorro Web Services (pid 22548): 2010-10-29 14:02:12,556 DEBUG - MainThread - TopCrashBySignatureTrends get {'crashType': 'browser', 'product': 'Fennec', 'endDate': datetime.datetime(2010, 10, 29, 14, 0), 'listSize': 300, 'productdims_id': 120, 'version': '4.0b1', 'duration': datetime.timedelta(14)}
Oct 29 14:02:12 dhcp-10-2-11-235 Socorro Web Services (pid 22548): 2010-10-29 14:02:12,564 DEBUG - MainThread - entered twoPeriodTopCrasherComparison
Oct 29 14:02:12 dhcp-10-2-11-235 Socorro Web Services (pid 22548): 2010-10-29 14:02:12,569 DEBUG - MainThread - endDate 2010-10-29 14:00:00
Oct 29 14:02:12 dhcp-10-2-11-235 Socorro Web Services (pid 22548): 2010-10-29 14:02:12,569 DEBUG - MainThread - rangeOfQueriesGenerator for 2010-10-01 14:00:00 to 2010-10-15 14:00:00
Oct 29 14:02:12 dhcp-10-2-11-235 Socorro Web Services (pid 22548): 2010-10-29 14:02:12,572 ERROR - MainThread - MainThread Caught Error: exceptions.TypeError
Oct 29 14:02:12 dhcp-10-2-11-235 Socorro Web Services (pid 22548): 2010-10-29 14:02:12,572 ERROR - MainThread - int argument required

If I am looking at the wrong messages here, please toss it back to server ops and we will debug more.
Assignee: aravind → nobody
Component: Server Operations → Socorro
Product: mozilla.org → Webtools
QA Contact: mrz → socorro
and here is the message corresponding to the TB requests.

Oct 29 13:41:34 dhcp-10-2-11-235 Socorro Web Services (pid 9572): 2010-10-29 13:41:34,031 DEBUG - MainThread - TopCrashBySignatureTrends get {'crashType': 'browser', 'product': 'Thunderbird', 'endDate': datetime.
datetime(2010, 10, 29, 14, 0), 'listSize': 300, 'productdims_id': 69, 'version': '3.1b1pre', 'duration': datetime.timedelta(14)}
Oct 29 13:41:34 dhcp-10-2-11-235 Socorro Web Services (pid 9572): 2010-10-29 13:41:34,037 DEBUG - MainThread - entered twoPeriodTopCrasherComparison
Oct 29 13:41:34 dhcp-10-2-11-235 Socorro Web Services (pid 9572): 2010-10-29 13:41:34,039 DEBUG - MainThread - endDate 2010-02-23 04:00:00
Oct 29 13:41:34 dhcp-10-2-11-235 Socorro Web Services (pid 9572): 2010-10-29 13:41:34,039 DEBUG - MainThread - rangeOfQueriesGenerator for 2010-01-26 04:00:00 to 2010-02-09 04:00:00
Oct 29 13:41:34 dhcp-10-2-11-235 Socorro Web Services (pid 9572): 2010-10-29 13:41:34,041 ERROR - MainThread - MainThread Caught Error: exceptions.TypeError
Oct 29 13:41:34 dhcp-10-2-11-235 Socorro Web Services (pid 9572): 2010-10-29 13:41:34,041 ERROR - MainThread - int argument required
Oct 29 13:41:34 dhcp-10-2-11-235 Socorro Web Services (pid 9572): 2010-10-29 13:41:34,041 ERROR - MainThread - trace back follows:   File "/data/breakpad/processor/socorro/webapi/webapiService.py", line 33, in GE
T     result = self.get(*args)   File "/data/breakpad/processor/socorro/services/topCrashBySignatureTrends.py", line 228, in get     return twoPeriodTopCrasherComparison(cursor, parameters)   File "/data/breakpad
/processor/socorro/services/topCrashBySignatureTrends.py", line 193, in twoPeriodTopCrasherComparison     listOfTopCrashers = listOfListsWithChangeInRank(rangeOfQueriesGenerator(databaseCursor, context, listOfTop
CrashersFunction))[0]   File "/data/breakpad/processor/socorro/services/topCrashBySignatureTrends.py", line 135, in listOfListsWithChangeInRank     for i, aListOfTopCrashers in enumerate(listOfQueryResultsIterabl
e):   File "/data/breakpad/processor/socorro/services/topCrashBySignatureTrends.py", line 109, in rangeOfQueriesGenerator     yield queryExecutionFunction(aCursor, parameters)   File "/data/breakpad/pro
Oct 29 13:52:00 dhcp-10-2-11-235 Socorro Web Services (pid 9571): 2010-10-29 13:52:00,409 DEBUG - MainThread - MainThread - creating crashStorePool
http://crash-stats.stage.mozilla.com/topcrasher/byversion/Camino/2.0.2, also?  Have we started putting checks on the cron jobs, to ensure they're returning the right data?
Target Milestone: --- → 1.7.5
Target Milestone: 1.7.5 → 1.7.6
Changing the summary, since (per IRC discussion) we think this is a case where middleware is not error checking appropriately.

In the face of missing data, middleware should:

1) log appropriately
2) return something the UI can interpret as "no data" rather than "error"
Summary: Missing data for Thunderbird 3.1b1pre on staging → middleware should gracefully handle missing data
Target Milestone: 1.7.6 → 1.7.7
(In reply to comment #5)
> http://crash-stats.stage.mozilla.com/topcrasher/byversion/Camino/2.0.2, also? 
> Have we started putting checks on the cron jobs, to ensure they're returning
> the right data?

This is covered by bug 616480, and I think jabba has actually taken care of most of it.
Assignee: nobody → laura
(In reply to comment #0)
> "Unable to load data
> 
> System error, please retry in a few minutes"

FWIW, I've seen this message several times this week in production (for https://crash-stats.mozilla.com/topcrasher/byversion/Camino/2.1a1, which then has data a bit later on a reload).  Dunno if this is a generic error message or related to this bug?
I'm trying to repro this on devdb and not having much luck.  It's getting new data shortly so I'll try again after that.
Stephen: got any other test cases?  None of these reproduce the specific error for me, they all just 404.

What's the desired behavior?
(In reply to comment #10)
> Stephen: got any other test cases?  None of these reproduce the specific error
> for me, they all just 404.
> 
> What's the desired behavior?

This was reproducible before by using NetSparker Community Edition and/or Acunetix, two free scanning/fuzzing tools (Windows-only, I'm afraid, though any good fuzzer/crawler should trigger this).

[1] http://www.mavitunasecurity.com/communityedition/
[2] http://www.acunetix.com/cross-site-scripting/scanner.htm

If you're up for it, I can do a trial run right now, and see if it's still reproducible.  When it was, though, we flooded the error logs, and iirc, took down the staging server (or at least made it unavailable for a long while).
(In reply to comment #11)
> 
> This was reproducible before by using NetSparker Community Edition and/or
> Acunetix, two free scanning/fuzzing tools (Windows-only, I'm afraid, though any
> good fuzzer/crawler should trigger this).
> 
> [1] http://www.mavitunasecurity.com/communityedition/
> [2] http://www.acunetix.com/cross-site-scripting/scanner.htm
> 
> If you're up for it, I can do a trial run right now, and see if it's still
> reproducible.  When it was, though, we flooded the error logs, and iirc, took
> down the staging server (or at least made it unavailable for a long while).

Can you just run it for a short time period?  That'd be dandy.
(In reply to comment #12)

> Can you just run it for a short time period?  That'd be dandy.

Did just that, tonight, with jabba and rhelmer around; at 4pm, I fired off Netsparker (it begins in "crawl" mode), and around 4:21pm, in "attack" mode, jabba saw around 2.4GB of core dumps within a few minutes, at which point I stopped.
I'm going to bump this until we have a reproducible test case again.
Target Milestone: 1.7.7 → ---
Talked in person w/Laura, and we decided this can wait for new staging, where hopefully the "attack" mode won't be as much of a deal, and where we could also fix logging problems as we see them coming in.
Can this be considered for 1.7.8, or at least 1.7.9 (or whichever milestone is next)?  We've uncovered some pretty good bugs via scanners/fuzzers, and it sounds like improving logging in the middleware is a win over all, for obvious reasons.

If we run this on https://crash-stats.allizom.org, which should be a mirror of production, both hardware and config/code-wise, then the concerns about DOS'ing it should be ameliorated.

Lars, iirc, Brandon had questions about the approach to addressing this problem.
Any chance we can 2.0 this?  It's pretty important that WebQA be able to negative-test the app.
Target Milestone: --- → 2.0
Assignee: laura → bsavage
This patch raises a BadRequest exception for the user in the event that they give us data that results in a TypeError. This patch is short and sweet.
Attachment #538235 - Flags: review?(lars)
Attachment #538235 - Flags: review?(lars) → review+
Fixed in revision 3210.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: