Closed Bug 1202081 Opened 9 years ago Closed 9 years ago

Try results take more than an hour to be available on perfherder compare

Categories

(Tree Management :: Perfherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: glandium, Unassigned)

Details

As of writing, perfherder compare view shows no results for 7c37d2e0cdb4, which has jobs that finished an hour ago.
More than 1 hour and a half later, still nothing.
It appears that they're there now. https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=e5c83bfca854&newProject=try&newRevision=7c37d2e0cdb4 I suppose the only thing that could be happening is that the task to update performance series is being delayed for some unknown reason. I believe bug 1192976 should help with this as performance data will be inserted immediately into the database after the job is processed. Anyway, we should figure out what's happening. NI'ing myself.
Flags: needinfo?(wlachance)
What's frightening is that it looks like some transactions aren't even completing, i.e. look at the missing subtest results here: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=e5c83bfca854&newProject=try&newRevision=7c37d2e0cdb4 Although they did appear on stage: https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=e5c83bfca854&newProject=try&newRevision=7c37d2e0cdb4 Looking at new relic, it looks like some of the times for populate-performance-series are bordering on ridiculous (i.e. nearly a minute). Maybe some of them are timing out? :emorley, is this a possible explanation? If that's the case, this should all go away when bug 1192976 lands and inserting new perf data into the database should be practically instantaneous.
Flags: needinfo?(wlachance) → needinfo?(emorley)
There have been other reports of strange delays with jobs appearing on Try etc. Things I would look into: * celery queue sizes * DB replication lag * transaction throughput/errors on new relic The trouble is catching one of these in progress. Unfortunately I did not see this bug until CCed just now, since I do not component watch Perfherder (unlike the Treeherder components), to try and improve my bugmail SnR.
Flags: needinfo?(emorley)
(In reply to Ed Morley [:emorley] from comment #4) > There have been other reports of strange delays with jobs appearing on Try > etc. > > Things I would look into: > * celery queue sizes > * DB replication lag > * transaction throughput/errors on new relic I'm not seeing any errors related to performance specifically in New Relic, though maybe I'm not looking hard enough. I do see some timeouts like this in the celery log: `/var/log/celery/celery_worker.log:[2015-09-07 23:12:54,347: INFO/Worker-102] MySQL operational error `(1205, 'Lock wait timeout exceeded; try restarting trans action')` hit. Retry #1 in 0.026s: {'debug_show': False, 'placeholders': [1441692723, 'x\x9c\xad}\xed\xaa$H\x8e\xdd\xab,\xf3{I\xe2CRD\xf8U\x8c\x19\xc6\xde\x0 6\x8f\x99\xf1\x0e\xdb\xbd`0~w+\xea\xde\xa1B\'\xa4\xbc7U9\xd0\x03U\xd5\x9du\xaeR\xa1o\x1d\xfd\xd7\xff\xfb\` ... but as that says, it should be retrying. I would like to understand what's going on more deeply, otoh it's pretty well known to us that the way we handle performance data in perfherder is pretty broken (and may even be causing problems for other parts of treeherder). I'm tempted to just address this as part of bug 1192976, which is quite close to landing.
I'm pretty sure this was an intermittent. And as I said earlier, bug 1192976 (which just landed) should further reduce delays: as long as the log is being parsed, you should be seeing performance results in perfherder. No need to wait for a "populate performance series" task. Please do reopen if you're still seeing this though.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.