Closed Bug 1283413 Opened 6 years ago Closed 6 years ago

Log downloads from arewefastyet.com frequently time out

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: h4writer)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Something like 80%+ of the failures during log parsing (which are mostly timeouts etc) are for logs downloaded from arewefastyet.com:
https://insights.newrelic.com/accounts/677903/explorer?eventType=TransactionError&timerange=day&filters=%255B%257B%2522key%2522%253A%2522transactionUiName%2522%252C%2522value%2522%253A%2522log-parser%2522%257D%255D&facet=job_log_url
(screenshot attached for those without New Relic access)

After filtering the log URL by 'arewefastyet', the errors virtually all urllib2 URLError timeouts:
https://insights.newrelic.com/accounts/677903/explorer?eventType=TransactionError&timerange=day&filters=%255B%257B%2522key%2522%253A%2522job_log_url%2522%252C%2522value%2522%253A%2522arewefast%2522%252C%2522like%2522%253Atrue%257D%255D&facet=error.message

We really need their hosting to be made more reliable, to reduce the amount of noise we see on our side.
Flags: needinfo?(hv1989)
The log is a json format of the same data that was send using an artifact to treeherder.

I annotated the log that is already parsed. We don't need to parse it again.
That should solve the matter.

https://github.com/h4writer/arewefastyet/commit/e76a6b4e29ceb69f7b7225a3c880bbe95b6fd7a7
Flags: needinfo?(hv1989)
Thank you :-)

I'm still seeing timeouts on New Relic - is that change deployed?
Assignee: nobody → hv1989
Flags: needinfo?(hv1989)
Yes it is rolled out. Since 16h ago.
Flags: needinfo?(hv1989)
Hmm I'm still seeing the logs attempting to be parsed even in the last few hours. Could you double check? I'll also have a look at our parsing code to see if we're not skipping them properly.

--

Extra context from the other day on IRC (to explain comment 1):

<wlach> I think if we mark the logs as already parsed, that will fix things on the treeherder end of things
<wlach> if you change that to job.add_log_reference('buildbot_text', loglink, parse_status='parsed') I think you should be good
<h4writer> ok awesome! I'll do that
...
<emorley> but if the logs don't need parsing, then do mark them as parsed and we can skip that entirely :-)
<h4writer> yeah we can skip that, so I will do that
<emorley> in fact the json blob doesn't really look like a log -- is it just that there are no logs? if so, perhaps don't submit a log at all, rather than giving a dummy one and marking as parsed
<h4writer> wlach asked me to provide it like that, since that is the input data to perfherder
<emorley> oh ok
<emorley> perhaps we should split that out to another artifact type and sumbit as an artifact, not a raw log
<h4writer> I'm fine with that
<emorley> if the 'log' is being used for perfherder input, doesn't it still need to be parsed? I thought the perf ingestion happened as another step during the main log parse?
<emorley> it seems like the content that the perf ingestion scrapes could be submitted as an artifact in the main job submission
<wlach> awfy submits the performance artifacts directly, the logs are just for debugging
<emorley> ah right
<emorley> ok that sounds fine as is_parsed :-)
I'm just seeing the timeouts on Treeherder stage -- are you perhaps still submitting to stage using the older code? (In addition to now submitting to prod)
(In reply to Ed Morley [:emorley] from comment #5)
> I'm just seeing the timeouts on Treeherder stage -- are you perhaps still
> submitting to stage using the older code? (In addition to now submitting to
> prod)
Flags: needinfo?(hv1989)
(In reply to Ed Morley [:emorley] from comment #6)
> (In reply to Ed Morley [:emorley] from comment #5)
> > I'm just seeing the timeouts on Treeherder stage -- are you perhaps still
> > submitting to stage using the older code? (In addition to now submitting to
> > prod)

I'm submitting only to treeherder.mozilla.com currently. And I only have one slave sending information. I have no idea why you are still seeing timeouts.

I just disabled submitting altogether and will do this for today and tomorrow. I would be interested to see if this removes the timeouts you see or not?
Flags: needinfo?(hv1989)
Thank you.

Picking one of the failures, it's for mozilla-inbound job_guid 28793052-203f-4f68-a055-294170bd7839, with log:
https://arewefastyet.com/data.php?file=treeherder-logs/842ba4c5f9ad474b9393312bc0f670c2.json

Which corresponds to:
https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=49b3e41bb16b2f464f38a049943379ce6c612eb8&filter-searchStr=awfy&filter-tier=1&filter-tier=2&filter-tier=3&exclusion_profile=false&selectedJob=28151543

Which is a job that was scheduled on the 29th.

On the treeherder side, jobs retry over a course of up to 6 hours, so I'm not sure why this would still be processing, unless similarly there are queues on the awfy side that are also retrying?
Hmm ok it turned out the timeouts and retry policy had caused a 100K backlog on the stage log parser queues. I've cleared that manually which should resolve the remaining timeouts.

Sorry for the confusion - you're good to re-enable submissions now :-)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Depends on: 1284289
(In reply to Ed Morley [:emorley] from comment #9)
> Sorry for the confusion - you're good to re-enable submissions now :-)

Thanks for the quick investigation. They are now enabled again.
In the chats we had around bug 1283845, we think that jobdetail may be a better place to store the URL for the results json blob, rather than calling it a log url. We think that the jobs API may be fine with accepting a job with no log url.

Anyway, everything should work fine for now, but if awfy ever starts submitting via the Pulse ingestion method, the API there doesn't support is_parsed as-is, so you'd need to make that change.
You need to log in before you can comment on or make changes to this bug.