Closed Bug 1443503 Opened 7 years ago Closed 4 years ago

groupResolved notifications are often sent with a considerable delay

Categories

(Taskcluster :: Services, enhancement, P5)

enhancement

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: marco, Unassigned)

References

Details

I'm often getting groupResolved notifications the day after a build and all associated tests have finished. The groupResolved notifications trigger a code coverage parsing task. This means the coverage data is sometimes lagging behind.
Summary: groupResolved notifcations are often sent with a considerable delay → groupResolved notifications are often sent with a considerable delay
Flags: needinfo?(bstack)
Not sure if bug 1387027 is related, but it might be.
See Also: → 1387027
Are we sure that the late taskGroupResolved
Assignee: nobody → bstack
Status: NEW → ASSIGNED
Flags: needinfo?(bstack)
In irc today we talked about looking into taskGroup EEiuah8fQmO_vFfGZ5axgw. Indeed, the task group did resolve a day after most of the tasks finished (and I believe a day after anything you would see in treeherder finished). However, it does appear that the notification of taskGroupResolved which was received at 2018-03-06T18:15:40.982036+00:00 matches up with the time that the last task in the group was resolved which is BbmlqYN0SoWRe2vIIU7fPw. It resolved with deadline-exceeded at 2018-03-06T18:15:38.842Z. This is most likely because a build that the build this test depended on failed and it does appear that X06XTQYHRR-BG0wyxv1K9w failed. It seems like taskGroupResolved is working pretty much as expected although it does also seem like it is not proving to be a useful tool for you given how it works. We use taskGroupResolved in taskcluster-github but also listen for failed tasks simultaneously so it fits our use-case there. Could the right thing to do here to be to listen for taskFailed/taskException on any tasks in the group and use that knowledge to trigger coverage earlier when you know certain tasks won't end up running? You could also argue that we should change the semantics of the taskGroupResolved event but that might be a bit harder to do at this point. :jonasfj, any thoughts on this?
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(jopsen)
Maybe we ought to publish an event when all tasks are resolved other than those that depend on failed/exceptioned tasks?
I feel like what we provide right now doesn't satisfy a simple/good use-case for task group events. Sorry to keep adding these in subsequent comments, I just keep thinking of more things to say :p
The best option for me would be to have the additional event, which might be useful in general and not just for code coverage. If that isn't feasible, I can implement what you do for taskcluster-github. It's a bit more complex as it requires to save state. The other option for me is to define a task that depends on all the code coverage tasks (with dummy tasks to avoid the 100 dependency limit) and then wait for the task-completed notification for this task.
Flags: needinfo?(mcastelluccio)
> The other option for me is to define a task that depends on all the code coverage tasks (with dummy tasks to avoid the 100 dependency limit) and then wait for the task-completed notification for this task. In this case you need to set task.requires = 'all-resolved' (instead of the default), see docs: https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#task Yet, if some task that the build depends on fails, you still wouldn't get your task running. @bstack, we could consider in the future that all dependent tasks should be resolved with exception: 'dependency-failed' whenever a dependent task fails and task.requires = 'all-completed' (as is default). We can't do this anytime soon, as it would have impact on remaining users of `queue.rerunTask`.
Flags: needinfo?(jopsen)
It seems this bug hasn't seen activity in a while. Marco, Brian, do you know if this is still happening and causing trouble?
QA Contact: jhford
It's still happening, but since the delay is usually short it doesn't matter that much to me.
In that case, let's tackle this as part of the move to postgres.
Depends on: 1436478
Priority: -- → P5
Component: Queue → Services
Assignee: bstack → nobody
Status: ASSIGNED → NEW

We never really got to the bottom of any delay here -- the one case Brian looked at showed the notification being sent at the correct, if surprising, time.

For lack of data and lack of activity, after we've migrated everything to postgres, I'm going to close this as WORKSFORME.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.