Closed
Bug 1180055
Opened 10 years ago
Closed 10 years ago
queue: reportException reason: internal-error
Categories
(Taskcluster :: Services, defect)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jonasfj, Assigned: jonasfj)
References
Details
Attachments
(1 file)
@garndt, @pmoore,
In case of internal errors that is errors we could not foresee and handle
appropriately.
I suggest we introduce "reason: internal-error".
As meaning: you've hit an internal bug in the worker code.
Similarly I'm tempted to suggest that we implement:
"reason: resource-unavailable"
to indicate that a remote resource we dependent on was unavailable:
- docker registry down (ie. connection or 5xx error, NOT a 4xx error!)
- S3 artifact bucket down (ie. connection or 5xx error, NOT a 4xx error!)
- phone device we claimed disappeared
- can't' contact index.taskcluster.net
- anything meaning: resource exists, but we can't get to it.
Note, this is NOT the same as resource was missing for which we either reportFailed or reportException with malformed-payload.
Like I said I'm still not sure.
There are cases where we can't tell the difference between resource is missing
and resource is unavailable.
These are only something that workers can report for bugs in worker and
resources the worker uses (possibly referenced from task.payload).
Tasks should only report true/false and create logs/artifacts for the rest.
@garndt, @pmoore,
I haven't formalized this before, but I think there is a fault-line in how
we resolve tasks, see comment 8:
https://bugzilla.mozilla.org/show_bug.cgi?id=1174105#c8
There is no way around the fact the that the fault-line is gradual as some
internal errors manifest themselves in worker crashes and "claim-expired".
Note, by fault-line I mean who is at fault for the given task resolution :)
I think maintaining a strict fault-line will save us lot of intermittent
bugs that isn't ours. And keeping the number of "reasons" from exploding will
keep it feasible to consume status output from taskcluster programmatically.
Anyways, what are your thoughts? Perhaps we should discuss this Tuesday.
Flags: needinfo?(pmoore)
Flags: needinfo?(garndt)
Comment 1•10 years ago
|
||
I'm totally in favour, and agree that as few meaningful states as possible for external consumption, in combination with more detailed logs for analysis. I agree with differentiating between "unavailable resource", and "internal worker exception" (e.g. due to buggy worker).
We do rely entirely on availability of S3, heroku, docker hosting, and only a short outage can result in task exceptions, so I think it is better to handle this in a controlled fashion. Until now a reason not to handle this has been on the assumption that these external services are mostly reliable, however we lose nothing by adding these controls, since if these services are forever reliable and do not go down, implementing these changes will not have any impact. Since these changes only take affect if external services go down, I see them as only beneficial, with only minor increase to complexity.
In summary: let's dooo eeet! =)
Flags: needinfo?(pmoore)
Comment 2•10 years ago
|
||
P.S. I feel like a kindergarten kid howling "he started it, it wasn't me!". Essentially what we are implementing here. =)
| Assignee | ||
Comment 3•10 years ago
|
||
> P.S. I feel like a kindergarten kid howling "he started it, it wasn't me!".
> Essentially what we are implementing here. =)
I know I've aggressively phrased it as "who is at fault", as if it's all about finding someone to blame :)
The reason for this is that the second step of debugging is finding out what went wrong (first step is to try a few random permutations to see if that fixes the problem -- it's so wrong, but we'll done this).
Anyways, when looking at a task-exception + reasonResolved, it's important that we can quickly determine
if this was a bug in the queue/worker/provisioner or an issue in task specific code/payload.
For issues in the task specific code/payload, we shouldn't have to open papertrail, ssh into machines, etc. But we might need to clarify details in the documentation.
Being able to tell on which side of the interface a problem originates dramatically reduces the places
we need to look for the issue. So we don't go looking through irrelevant docker-worker logs, if the
problem was a spelling error in task.payload.image.
Comment 4•10 years ago
|
||
I'm all for having these two states. It'll be useful for when separating out what is our fault that we can act upon and fault of third parties that perhaps we need to open support tickets for. WE can also alert on different thresholds for each and also collect metrics about it.
Flags: needinfo?(garndt)
Updated•10 years ago
|
Component: TaskCluster → Queue
Product: Testing → Taskcluster
| Assignee | ||
Updated•10 years ago
|
Assignee: nobody → jopsen
Status: NEW → ASSIGNED
| Assignee | ||
Comment 5•10 years ago
|
||
This adds:
- internal-error
- resource-unavailable
None of them retries anything.
IMO they shouldn't "resource-unavailable" should only be used for declarative resources that you know exists... And workers should do retries, so additional retries of the entire tasks would cause cascading retries.
Attachment #8640807 -
Flags: review?(jhford)
| Assignee | ||
Comment 6•10 years ago
|
||
@pmoore, @garndt,
I assigned this to jhford has he has to review a PR this is based of too.
But feel free to check if this something you like.
Comment 7•10 years ago
|
||
Hrm, for some reason I thought there was a discussion to have an exception state that would be retried similar to worker-shutdown but I might be mistaken.
| Assignee | ||
Comment 8•10 years ago
|
||
garndt, there was... I didn't like it :)
Do you want to retry:
A) internal-error, or
B) resource-unavailable
It seems to be that (B) might qualify, but that it would be a bad idea...
Because workers should have tried, instead of relying on the entire task to retry.
Comment 9•10 years ago
|
||
What I would like to avoid is having tasks retried without bounds (other than deadline) when something bad happens if restarting on another worker won't solve it. What I'm thinking of is for some reason network is crappy on a node and the balrog vpn proxy can't start up for some reason. We can retry all we want, but it might not fix the situation where running on a different worker might be the solution. Of course having the task retried doesn't mean that a new worker will grab that task. I say we leave both as is unless pmoore has a different opinion and if we find ourselves needing to retry a particular type of exception we can discuss it at that point when we have a better use case.
| Assignee | ||
Comment 10•10 years ago
|
||
> What I'm thinking of is for some reason network is crappy on a node and the balrog vpn proxy
> can't start up for some reason.
Then we should restart the current node, ie. worker-shutdown is appropriate.
Or maybe we should retry internal-error, hoping that such an error is related to lack of disk space etc.
I fear we encourage people use unstable resources.
> it might not fix the situation where running on a different worker might be the solution.
I think that's a corner case, probably not likely to be an issue.
> if we find ourselves needing to retry a particular type of exception we can discuss it at
> that point when we have a better use case.
Yeah, we can also change behavior of these later.
But we need to somehow define the category of error cases for which a retry is valid.
Retrying just because it might work, is IMO not okay... We shouldn't encourage intermittent issues this way.
Comment 11•10 years ago
|
||
I agree in regards to the intermitten issues. I feel by adding an exception that can be retried and relied upon for this behavior is just hiding the larger issue of things being unstable.
| Assignee | ||
Comment 12•10 years ago
|
||
Comment on attachment 8640807 [details] [review]
Github PR
wow, I forgot about this PR..
@garndt... Let's not wait for jhford to get home.
Please, review and I'll roll it out.
Attachment #8640807 -
Flags: review?(jhford) → review?(garndt)
Updated•10 years ago
|
Attachment #8640807 -
Flags: review?(garndt) → review+
| Assignee | ||
Comment 13•10 years ago
|
||
merged and deployed...
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 14•10 years ago
|
||
Thanks guys for sorting this out! =)
Updated•7 years ago
|
Component: Queue → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•