Closed Bug 704410 Opened 13 years ago Closed 13 years ago

missing topcrasher correlation reports cause ~900 HTTP connections

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rhelmer)

Details

I just did a quick test on crash-stats-dev just, and it appears that the topcrasher report is logging 1800 HTTP 404 requests for each topcrasher page. This is for a total of three unique files. STR: Hit this URL after restarting httpd+memcached: https://crash-stats-dev.allizom.org/topcrasher/byversion/Firefox/10.0a1/7 While running this: $ tail -f 2011-11-21.log.php | tee /tmp/out It takes a while for the log to actually be written to (presumably we're buffering), after waiting a few minutes I run this: $ grep '10.0a1' /tmp/out | grep warn | wc -l 1800
One of the thing the topcrasher page does is load correlation reports, which happens via 3 HTTP POSTS: /correlation/bulk_ajax/cpu/Firefox/10.0a1/ /correlation/bulk_ajax/addon/Firefox/10.0a1/ /correlation/bulk_ajax/module/Firefox/10.0a1/ Each POST sends a list of signatures for that page (we show 300 topcrashers per report, so that's 300 signatures). I haven't read through all of this yet, but here's the controller for bulk_ajax: https://github.com/mozilla/socorro/blob/master/webapp-php/application/controllers/correlation.php Here's the Correlation object which does the actual work: https://github.com/mozilla/socorro/blob/master/webapp-php/application/libraries/Correlation.php The error handler that's doing the logging is set up for: https://github.com/mozilla/socorro/blob/master/webapp-php/application/libraries/Correlation.php#L112
(In reply to Robert Helmer [:rhelmer] from comment #0) > $ grep '10.0a1' /tmp/out | grep warn | wc -l > 1800 I generated a very minimal test DB to see this without quite so much noise: http://socorro.readthedocs.org/en/latest/populatepostgres.html With a single signature on the topcrasher report, I see 8 404 warnings (presumably we made 8 HTTP connections, though I haven't tried to verify this directly). Looking closer at these, I see: """ 1 log message: fopen(http://people.mozilla.com/crash_analysis/20111121/20111121_WaterWolf_1.0-core-counts.txt) 1 log message: gzopen(http://people.mozilla.com/crash_analysis/20111121/20111121_WaterWolf_1.0-core-counts.txt.gz) 2 log messages: fopen(http://people.mozilla.com/crash_analysis/20111121/20111121_WaterWolf_1.0-core-counts.txt) 2 log messages: gzopen(http://people.mozilla.com/crash_analysis/20111121/20111121_WaterWolf_1.0-core-counts.txt.gz) 1 log message: fopen(http://people.mozilla.com/crash_analysis/20111121/20111121_WaterWolf_1.0-interesting-modules.txt) 1 log message: gzopen(http://people.mozilla.com/crash_analysis/20111121/20111121_WaterWolf_1.0-interesting-modules.txt.gz) """ Looks like it's trying .txt first then falling back to .gz, and for some reason it tries "core-counts" twice (?). I went ahead and generated a DB with two sigs in topcrashers, and it logs 16 connection attempts, so this seems consistent with above. Also, I took a look at wireshark while this was running, and only count 6 "HTTP GET"s: 0.054079 192.168.1.14 -> 63.245.208.169 HTTP GET /crash_analysis/20111121/20111121_WaterWolf_1.0-interesting-addons.txt HTTP/1.0 0.055170 192.168.1.14 -> 63.245.208.169 HTTP GET /crash_analysis/20111121/20111121_WaterWolf_1.0-core-counts.txt HTTP/1.0 0.055224 192.168.1.14 -> 63.245.208.169 HTTP GET /crash_analysis/20111121/20111121_WaterWolf_1.0-interesting-modules.txt HTTP/1.0 2.405245 192.168.1.14 -> 63.245.208.169 HTTP GET /crash_analysis/20111121/20111121_WaterWolf_1.0-interesting-addons.txt HTTP/1.0 2.426665 192.168.1.14 -> 63.245.208.169 HTTP GET /crash_analysis/20111121/20111121_WaterWolf_1.0-interesting-modules.txt HTTP/1.0 2.430087 192.168.1.14 -> 63.245.208.169 HTTP GET /crash_analysis/20111121/20111121_WaterWolf_1.0-core-counts.txt HTTP/1.0 That is consistent with each signature making one HTTP request, so something is not agreeing here with the logging and the actual number of connections. Still, this seems to suggest that we're making at least 3 HTTP connections per signature on the topcrasher page, for the same 3 unique files...
Summary: topcrasher correlation fetch generating way too many connections → missing topcrasher correlation reports cause ~900 HTTP connections
(In reply to Robert Helmer [:rhelmer] from comment #1) > I haven't read through all of this yet Read through this more and did some testing with wireshark, looks like what happens is that when the report exists it gets cached and is not fetched over and over (which makes sense otherwise I'd expect it to take much longer than it does!) We only do the pathological thing in the case when the report can't be fetched and attempt to fetch it 900 times (topcrasher_rows*3). It happens in an XHR call so it probably isn't hurting pageload time, but it's a silly thing to do and causes a *ton* of log noise. Looks easy to patch.
Commit pushed to https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/2557ab1120b5432559f363c67f9546c09b31e349 Merge pull request #177 from rhelmer/bug704410-missing-correlation bug 704410 - cache fact that report failed
r+
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Tested on crash-stats-dev by "tail -f"ing the log and loading: (has valid correlation reports): https://crash-stats-dev.allizom.org/topcrasher/byversion/Firefox/7.0.1/7 (does not have valid correlation reports): https://crash-stats-dev.allizom.org/topcrasher/byversion/Firefox/10.0a1/7 Before the new build, ~1800 "warning" messages in the logs for the first URL, second one works as usual. After the new build (plus restarting memcached), 3 warnings on first URL (and none on refresh thanks to caching) and second works fine. Two things about this, not sure if it's worth worrying about/reopening or not: 1) we discussed on irc but did not change the cache to be lower, error load will be cached for 14 hours just like correlation reports AFAICT from the code+config 2) cached errors display "UNKNOWN: No Data" rather than an error message, it looks like the PHP code intends to return a message so I would rather just treat this as a pre-existing bug (that we probably don't care about since we want to move correlation reports into Socorro proper at some point anyway)
Assignee: nobody → rhelmer
Target Milestone: --- → 2.3.4
(In reply to Robert Helmer [:rhelmer] from comment #6) > (does not have valid correlation reports): > https://crash-stats-dev.allizom.org/topcrasher/byversion/Firefox/10.0a1/7 > (has valid correlation reports): > https://crash-stats-dev.allizom.org/topcrasher/byversion/Firefox/7.0.1/7 Of course I got these backwards saying "first" and "second" below, I just flipped them around in this quoting ^ :) > Before the new build, ~1800 "warning" messages in the logs for the first > URL, second one works as usual. > > After the new build (plus restarting memcached), 3 warnings on first URL > (and none on refresh thanks to caching) and second works fine.
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.