Closed
Bug 1283413
Opened 9 years ago
Closed 9 years ago
Log downloads from arewefastyet.com frequently time out
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P1)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: h4writer)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
228.66 KB,
image/jpeg
|
Details |
Something like 80%+ of the failures during log parsing (which are mostly timeouts etc) are for logs downloaded from arewefastyet.com:
https://insights.newrelic.com/accounts/677903/explorer?eventType=TransactionError&timerange=day&filters=%255B%257B%2522key%2522%253A%2522transactionUiName%2522%252C%2522value%2522%253A%2522log-parser%2522%257D%255D&facet=job_log_url
(screenshot attached for those without New Relic access)
After filtering the log URL by 'arewefastyet', the errors virtually all urllib2 URLError timeouts:
https://insights.newrelic.com/accounts/677903/explorer?eventType=TransactionError&timerange=day&filters=%255B%257B%2522key%2522%253A%2522job_log_url%2522%252C%2522value%2522%253A%2522arewefast%2522%252C%2522like%2522%253Atrue%257D%255D&facet=error.message
We really need their hosting to be made more reliable, to reduce the amount of noise we see on our side.
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(hv1989)
Assignee | ||
Comment 1•9 years ago
|
||
The log is a json format of the same data that was send using an artifact to treeherder.
I annotated the log that is already parsed. We don't need to parse it again.
That should solve the matter.
https://github.com/h4writer/arewefastyet/commit/e76a6b4e29ceb69f7b7225a3c880bbe95b6fd7a7
Flags: needinfo?(hv1989)
Reporter | ||
Comment 2•9 years ago
|
||
Thank you :-)
I'm still seeing timeouts on New Relic - is that change deployed?
Assignee: nobody → hv1989
Flags: needinfo?(hv1989)
Reporter | ||
Comment 4•9 years ago
|
||
Hmm I'm still seeing the logs attempting to be parsed even in the last few hours. Could you double check? I'll also have a look at our parsing code to see if we're not skipping them properly.
--
Extra context from the other day on IRC (to explain comment 1):
<wlach> I think if we mark the logs as already parsed, that will fix things on the treeherder end of things
<wlach> if you change that to job.add_log_reference('buildbot_text', loglink, parse_status='parsed') I think you should be good
<h4writer> ok awesome! I'll do that
...
<emorley> but if the logs don't need parsing, then do mark them as parsed and we can skip that entirely :-)
<h4writer> yeah we can skip that, so I will do that
<emorley> in fact the json blob doesn't really look like a log -- is it just that there are no logs? if so, perhaps don't submit a log at all, rather than giving a dummy one and marking as parsed
<h4writer> wlach asked me to provide it like that, since that is the input data to perfherder
<emorley> oh ok
<emorley> perhaps we should split that out to another artifact type and sumbit as an artifact, not a raw log
<h4writer> I'm fine with that
<emorley> if the 'log' is being used for perfherder input, doesn't it still need to be parsed? I thought the perf ingestion happened as another step during the main log parse?
<emorley> it seems like the content that the perf ingestion scrapes could be submitted as an artifact in the main job submission
<wlach> awfy submits the performance artifacts directly, the logs are just for debugging
<emorley> ah right
<emorley> ok that sounds fine as is_parsed :-)
Reporter | ||
Comment 5•9 years ago
|
||
I'm just seeing the timeouts on Treeherder stage -- are you perhaps still submitting to stage using the older code? (In addition to now submitting to prod)
Reporter | ||
Comment 6•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #5)
> I'm just seeing the timeouts on Treeherder stage -- are you perhaps still
> submitting to stage using the older code? (In addition to now submitting to
> prod)
Flags: needinfo?(hv1989)
Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #6)
> (In reply to Ed Morley [:emorley] from comment #5)
> > I'm just seeing the timeouts on Treeherder stage -- are you perhaps still
> > submitting to stage using the older code? (In addition to now submitting to
> > prod)
I'm submitting only to treeherder.mozilla.com currently. And I only have one slave sending information. I have no idea why you are still seeing timeouts.
I just disabled submitting altogether and will do this for today and tomorrow. I would be interested to see if this removes the timeouts you see or not?
Flags: needinfo?(hv1989)
Reporter | ||
Comment 8•9 years ago
|
||
Thank you.
Picking one of the failures, it's for mozilla-inbound job_guid 28793052-203f-4f68-a055-294170bd7839, with log:
https://arewefastyet.com/data.php?file=treeherder-logs/842ba4c5f9ad474b9393312bc0f670c2.json
Which corresponds to:
https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=49b3e41bb16b2f464f38a049943379ce6c612eb8&filter-searchStr=awfy&filter-tier=1&filter-tier=2&filter-tier=3&exclusion_profile=false&selectedJob=28151543
Which is a job that was scheduled on the 29th.
On the treeherder side, jobs retry over a course of up to 6 hours, so I'm not sure why this would still be processing, unless similarly there are queues on the awfy side that are also retrying?
Reporter | ||
Comment 9•9 years ago
|
||
Hmm ok it turned out the timeouts and retry policy had caused a 100K backlog on the stage log parser queues. I've cleared that manually which should resolve the remaining timeouts.
Sorry for the confusion - you're good to re-enable submissions now :-)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 10•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #9)
> Sorry for the confusion - you're good to re-enable submissions now :-)
Thanks for the quick investigation. They are now enabled again.
Reporter | ||
Comment 11•9 years ago
|
||
In the chats we had around bug 1283845, we think that jobdetail may be a better place to store the URL for the results json blob, rather than calling it a log url. We think that the jobs API may be fine with accepting a job with no log url.
Anyway, everything should work fine for now, but if awfy ever starts submitting via the Pulse ingestion method, the API there doesn't support is_parsed as-is, so you'd need to make that change.
You need to log in
before you can comment on or make changes to this bug.
Description
•