Closed Bug 1492768 Opened 7 years ago Closed 7 years ago

Perfherder graphs frequently don't load due to /alertsummary/ API calls timing out

Categories

(Tree Management :: Perfherder, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: wlach)

Details

Attachments

(1 file)

I have spent a lot of time refreshing pages and trying to get graphs to load- this seems to be a problem for the tart test. Here is a recent example: https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=mozilla-inbound,1644601,1&series=autoland,41098223c1c7585b9af92ff51b83ea579d41e6d0,0 If I CTRL+K to see the devtools console, it loads right up, I see this in the console error: TypeError: "H.zoom is undefined; can't access its "x" property" zoomGraph https://treeherder.mozilla.org/perf.1085b7484a9698d040ea.bundle.js:21:43553 plotGraph https://treeherder.mozilla.org/perf.1085b7484a9698d040ea.bundle.js:21:45477 processQueue https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:136126 scheduleProcessQueue https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:136792 digest https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:144507 evalAsync https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:146563 completeTask https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:154211 Zi https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:74056 defer https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:74028 evalAsync https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:146532 get https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:135248 scheduleProcessQueue https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:136684 reject https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:137529 rejectPromise https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:137448 reject https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:137853 get https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:41:270288 invoke https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:69230 qi https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:70322 getService https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:68458 <anonymous> https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:41:289817 invoke https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:69230 createInjector https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:70530 forEach https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:45661 createInjector https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:70502 Di https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:55448 bootstrap https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:55796 angularInit https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:54941 <anonymous> https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:269259 jr https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:48658 Fr https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:49077 resolve https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:49310 Dr https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:46768 fireWith https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:47622 fire https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:47657 Dr https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:46768 fireWith https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:47622 ready https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:50917 completed https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:3371 EventListener.handleEvent* https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:51053 <anonymous> https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:1830 <anonymous> https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:1729 __webpack_require__ https://treeherder.mozilla.org/manifest.fea02df2c5ce4f6b21b2.bundle.js:1:114 <anonymous> https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:45:269675 __webpack_require__ https://treeherder.mozilla.org/manifest.fea02df2c5ce4f6b21b2.bundle.js:1:114 <anonymous> https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:8:6648 __webpack_require__ https://treeherder.mozilla.org/manifest.fea02df2c5ce4f6b21b2.bundle.js:1:114 <anonymous> https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:71:12334 __webpack_require__ https://treeherder.mozilla.org/manifest.fea02df2c5ce4f6b21b2.bundle.js:1:114 webpackJsonp https://treeherder.mozilla.org/manifest.fea02df2c5ce4f6b21b2.bundle.js:1:454 <anonymous> https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js:1:1 Possibly unhandled rejection: {} vendor.b94667f4e12c463094ae.bundle.js:45:127585 Source map error: request failed with status 404 Resource URL: https://treeherder.mozilla.org/vendor.b94667f4e12c463094ae.bundle.js Source Map URL: vendor.b94667f4e12c463094ae.bundle.js.map[Learn More] :emorley has anything changed in treeherder recently that would be a good cause of this error?
Flags: needinfo?(emorley)
The graphs URL above works for me? The line that appears to be the one referenced in the stack trace above, is: https://github.com/mozilla/treeherder/blob/a0c1b607debad384642b9cfb258d5da9e3c4bba6/ui/js/controllers/perf/graphs.js#L370 The history of that file is here: https://github.com/mozilla/treeherder/commits/a0c1b607debad384642b9cfb258d5da9e3c4bba6/ui/js/controllers/perf/graphs.js Though these two calls are very slow - perhaps they timed out for you and Perfherder doesn't correctly handle failed requests/display a nice message? 10:35:02.969 XHR GET https://treeherder.mozilla.org/api/performance/alertsummary/?alerts__series_signature=1644601&repository=2 [HTTP/1.1 200 OK 17593ms] 10:35:03.078 XHR GET https://treeherder.mozilla.org/api/performance/alertsummary/?alerts__series_signature=1653659&repository=77 [HTTP/1.1 200 OK 11816ms]
Flags: needinfo?(emorley)
It seems as though this morning all my perfherder loads are timing out (I have gigabit fiber to my house, so I doubt it is on my end unless it is my computer or web browser - firefox nightly freshly restarted). The code sheriffs have been leaving notes on bugs that they cannot load the graphs, this is new as of the last 24 hours. when I try to get the alertsummary lines you provided (that returned in 11 and 18 seconds...that is a long time), I get application error which looks like a heroku message or something.
using the console trick, I see the same heroku error message: Possibly unhandled rejection: {"data":"<!DOCTYPE html>\n\t<html>\n\t <head>\n\t\t<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n\t\t<meta charset=\"utf-8\">\n\t\t<title>Application Error</title>\n\t\t<style media=\"screen\">\n\t\t html,body,iframe {\n\t\t\tmargin: 0;\n\t\t\tpadding: 0;\n\t\t }\n\t\t html,body {\n\t\t\theight: 100%;\n\t\t\toverflow: hidden;\n\t\t }\n\t\t iframe {\n\t\t\twidth: 100%;\n\t\t\theight: 100%;\n\t\t\tborder: 0;\n\t\t }\n\t\t</style>\n\t </head>\n\t <body>\n\t\t<iframe src=\"//www.herokucdn.com/error-pages/application-error.html\"></iframe>\n\t </body>\n\t</html>","status":503,"config":{"method":"GET","transformRequest":[null],"transformResponse":[null],"jsonpCallbackParam":"callback","url":"/api/performance/alertsummary/?alerts__series_signature=1651299&repository=77","headers":{"Accept":"application/json, text/plain, */*","X-CSRFToken":"s23IFzzAFPFHOyMl2wnUVn1sgZ7lssGlY68nk4WgYRxQKWEyf751b9hjvTA93KRP"}},"statusText":"Service Unavailable","xhrStatus":"complete"} I don't know why we are getting errors from the api- that has always worked well; has the API had recent changes?
Flags: needinfo?(emorley)
Summary: perfherder fails to load tart graphs → perfherder fails to load most graphs
The HTTP 503 means the request has been killed for taking too long (over 20 seconds). Looking at new relic, I can see that there are indeed Perfherder requests that are taking that long (and so are killed by gunicorn with a SystemExit exception): https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors?tw%5Bend%5D=1537437517&tw%5Bstart%5D=1537351117#/table?top_facet=transactionUiName&barchart=barchart (Looking at the New Relic user list I see you have an account already, but here's a screenshot for anyone else: https://screenshots.firefox.com/VJOaGNIaF7fVnJEl/rpm.newrelic.com) Someone (Ionut?) needs to do some profiling and speed up that API: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1537437798&tw%5Bstart%5D=1537351398#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e706572666f726d616e63655f646174613a506572666f726d616e6365416c65727453756d6d617279566965775365742e6c697374222c22225d Screenshots showing profile breakdown: https://screenshots.firefox.com/qKkjvqE645d0cxF8/rpm.newrelic.com https://screenshots.firefox.com/taIZNuh3GWnucJlQ/rpm.newrelic.com In addition, the error handling in the frontend needs to be adjusted such that failed requests display a user-friendly in-app notification message rather than resulting in an exception.
Flags: needinfo?(emorley)
Priority: -- → P1
Summary: perfherder fails to load most graphs → Perfherder graphs frequently don't load due to /alertsummary/ API calls timing out
1600+ database queries for one API request seems pretty broken: https://screenshots.firefox.com/taIZNuh3GWnucJlQ/rpm.newrelic.com
:wlach, I don't know if you have any thoughts here- I think :emorley has outlined where the problems are, maybe you have some background or suspicions based on the original implementation. My main concern is the api timeout and the large volume of queries for a single api call. :igoldan is out until next week and I will be out as of this afternoon- I suspect :igoldan could start looking into this next week- at the very least getting proper error reporting.
Flags: needinfo?(wlachance)
Filing a PR for this as week speak.
Flags: needinfo?(wlachance)
Ok, the PR seems to improve query times by a factor of 2 on prototype, mainly by prefetching framework and user data. There's still a ridiculous number of queries, but that's a little harder to fix given how the endpoint is constructed -- we're currently filtering on the field of a relation, which is difficult to express in a performant way using django rest framework. We may want to consider either using a custom endpoint which looks up alert summaries based on alerts (essentially the inverse of what we're doing right now) or a graphql query here.
Comment on attachment 9010652 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/4050 Ed, I think you're in the best place to review this.
Attachment #9010652 - Flags: review?(emorley)
(In reply to William Lachance (:wlach) (use needinfo!) from comment #10) > we're currently filtering on the field of a relation, > which is difficult to express in a performant way using django rest framework Are you sure it's a limitation of django-rest-framework, and not due to the API using the auto-generated queries via django-filters? Sadly I'm not really familiar enough with d-r-f to suggest anything more concrete. George would have been a good person to ask for this. Thank you for the PR though and nice work on the 2x speedup :-)
Attachment #9010652 - Flags: review?(emorley) → review+
(In reply to Ed Morley [:emorley] from comment #12) > (In reply to William Lachance (:wlach) (use needinfo!) from comment #10) > > we're currently filtering on the field of a relation, > > which is difficult to express in a performant way using django rest framework > > Are you sure it's a limitation of django-rest-framework, and not due to the > API using the auto-generated queries via django-filters? Sadly I'm not > really familiar enough with d-r-f to suggest anything more concrete. George > would have been a good person to ask for this. > > Thank you for the PR though and nice work on the 2x speedup :-) No, I'm not sure at all. :) I feel like I'm constantly learning about new things that are possible with drf and django-filters, and SQL for that matter. Basically this PR just fixes some things that were obviously wrong with my initial naive implementation.
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/2806e500714486fde9c745fb71acdf445d825825 Bug 1492768 - Optimize alert summary queries executed from graphs view (#4050) Not a complete solution but hopefully this will improve things enough to stop the timeouts we're seeing.
Many thanks!
Assignee: nobody → wlachance
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: