Closed Bug 1120919 Opened 10 years ago Closed 10 years ago

[pulsetranslator] Do not skip messages with routing keys ending with .finished instead of .log_uploaded

Categories

(Webtools :: Pulse, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [qa-automation-blocked])

Attachments

(1 file)

I'm currently setting up a script for bug 735184 which listens for all build messages coming in. It filters out specific messages for update channel activities. While doing that I noticed that not all tasks have a .log_uploaded message! Especially those which are about updates and do not upload a log file. What I can say is that we always have a .finished message. Here the example for update messages: > -rw-rw-r-- 1 henrik henrik 12694 Jan 13 12:15 2015-01-13T11:15:24Z-build.release-mozilla-beta-firefox_updates.0.started > -rw-rw-r-- 1 henrik henrik 17178 Jan 13 12:45 2015-01-13T11:45:13Z-build.release-mozilla-beta-firefox_updates.0.finished Currently pulsetranslator doesn't see those messages at all, so that they are not available on the normalized exchange. Mark, what do you think? Is that something we could easily switch?
Flags: needinfo?(mcote)
While further checking the order of .log_uploaded and .finished, I see a strange ordering problem. While in most of the cases the .finished message really comes last, in some cases it's before .log_uploaded. .log_uploaded before .finished: > -rw-rw-r-- 1 henrik henrik 6236 Jan 13 13:13 2015-01-13T12:13:21Z-build.release-mozilla-beta-win32_partner_repack.19.log_uploaded > -rw-rw-r-- 1 henrik henrik 6226 Jan 13 13:13 2015-01-13T12:13:26Z-build.release-mozilla-beta-win32_partner_repack.19.finished finished before .log_uploaded: -rw-rw-r-- 1 henrik henrik 1582 Jan 13 12:15 2015-01-13T11:15:21Z-build.release-mozilla-beta-win32_repack_complete.1.finished -rw-rw-r-- 1 henrik henrik 1904 Jan 13 12:15 2015-01-13T11:15:37Z-build.release-mozilla-beta-win32_repack_complete.1.log_uploaded Chris, could that be a problem with the builders and some kind of asynchronous tasks?
Flags: needinfo?(catlee)
(In reply to Henrik Skupin (:whimboo) from comment #0) > I'm currently setting up a script for bug 735184 which listens for all build > messages coming in. It filters out specific messages for update channel > activities. While doing that I noticed that not all tasks have a > .log_uploaded message! Especially those which are about updates and do not Looks like I was wrong here. We do have updates and also upload them. The specific message was kinda hidden between all those notifications. So here the correct list: > > -rw-rw-r-- 1 henrik henrik 12694 Jan 13 12:15 2015-01-13T11:15:24Z-build.release-mozilla-beta-firefox_updates.0.started > > -rw-rw-r-- 1 henrik henrik 17178 Jan 13 12:45 2015-01-13T11:45:13Z-build.release-mozilla-beta-firefox_updates.0.finished > -rw-rw-r-- 1 henrik henrik 16525 Jan 13 12:45 2015-01-13T11:45:47Z-build.release-mozilla-beta-firefox_updates.0.log_uploaded Here we have the .finished message coming in 30s before the .log_uploaded.
We don't want to do this categorically, although we may want to add support for certain message types. The reason we use .log_uploaded at present is that many consumers are interested in the logs (this is especially true for listeners that watch for unittest messages), and the logs are not guaranteed to be available when the .finished message is published. Switching from .log_uploaded to .finished would force those consumers to poll for logs, which is undesirable both from a design perspective and from a network perspective. What we really want to do, instead, is figure out which message types that we care about that don't have .log_uploaded messages and use .finish for those (we already do this for pulse messages for builds). Alternately, we could support both and allow the consumer to specify which to listen for.
(In reply to Jonathan Griffin (:jgriffin) from comment #3) > The reason we use .log_uploaded at present is that many consumers are > interested in the logs (this is especially true for listeners that watch for > unittest messages), and the logs are not guaranteed to be available when the > .finished message is published. Switching from .log_uploaded to .finished > would force those consumers to poll for logs, which is undesirable both from > a design perspective and from a network perspective. Sounds reasonable in case of those two messages are send in a non-defined order. I thought so far that .finish means that really all necessary tasks have been finished. Also including uploading the logs. But it looks like it is not the case. > What we really want to do, instead, is figure out which message types that > we care about that don't have .log_uploaded messages and use .finish for > those (we already do this for pulse messages for builds). For real builds we do not use .finished yet. We early return in the pulse callback: https://github.com/mozilla/pulsetranslator/blob/master/pulsetranslator/pulsetranslator.py#L375 So what does .finished exactly mean? Is this message send when the builds have been uploaded? > Alternately, we could support both and allow the consumer to specify which to listen for. I like that idea even more! If that would work, I can do this change.
catlee can correct me if I'm wrong, but I believe .finished means that buildbot is done. However, log uploading is done somewhat asynchronously, so .log_uploaded can happen before or after .finished. > > Alternately, we could support both and allow the consumer to specify which to listen for. > > I like that idea even more! If that would work, I can do this change. Sure, I think that's fine as long as we default to the current behavior.
I don't think there's much for me to add here; I support including both .finished and .log_uploaded.
Flags: needinfo?(mcote)
Jonathan is right, .log_uploaded is submitted after .finished.
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #7) > Jonathan is right, .log_uploaded is submitted after .finished. Ok, but what about those inconsistencies then? There are lots of tasks which show up the opposite order. Means .log_uploaded first, and afterward .finished. Here just an example again beside the repack one from comment 1: -rw-rw-r-- 1 mozauto mozauto 5459 Jan 13 06:00 2015-01-13T14:00:17Z-build.release-mozilla-beta-macosx64_update_verify_2.21.log_uploaded -rw-rw-r-- 1 mozauto mozauto 5314 Jan 13 06:00 2015-01-13T14:00:25Z-build.release-mozilla-beta-macosx64_update_verify_2.21.finished In case of builds being uploaded, will .log_uploaded or .finished be safer?
Flags: needinfo?(catlee)
rabbitmq doesn't guarantee message delivery order; also there will be times when sending message (1) fails for various reasons, and so it gets put into a queue to be re-submitted. In the meanwhile message (2) is submitted and succeeds right away. The timestamps inside the messages should indicate the correct order. If you see .finished, that means that the job itself is done on buildbot. When you see .log_uploaded, it means that the log for that job has been processed and uploaded successfully. Obviously the job needs to finish before the log is processed and uploaded.
Flags: needinfo?(catlee)
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Summary: [pulsetranslator] Consumers should listen for routing keys ending with .finished instead of .log_uploaded → [pulsetranslator] Do not skip messages with routing keys ending with .finished instead of .log_uploaded
Attached patch patch v1Splinter Review
Mark or Jonathan, can one of you please review? Not sure if we also want to take it for unittests, but I think it would be a good idea. Consumers who do not correctly check the type of the message might run jobs twice with that change pushed to production. Would we have to inform someone?
Attachment #8563035 - Flags: review?(mcote)
Attachment #8563035 - Flags: review?(jgriffin)
Comment on attachment 8563035 [details] [diff] [review] patch v1 Review of attachment 8563035 [details] [diff] [review]: ----------------------------------------------------------------- lgtm
Attachment #8563035 - Flags: review?(mcote) → review+
Comment on attachment 8563035 [details] [diff] [review] patch v1 Thanks Mark. I assume it is enough to get r+ from you here. So removing jgriffin.
Attachment #8563035 - Flags: review?(jgriffin)
Landed as: https://github.com/mozilla/pulsetranslator/commit/bd25f2091fef7d5618bd2bb52b57464bb807fb1e Mark, would you mind to publish this changes to production? Thanks.
Flags: needinfo?(mcote)
Done!
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(mcote)
Resolution: --- → FIXED
Are we sure this is safe? It looks like it might result in duplicate unittest messages being delivered, once for .log_uploaded, and once for .finished.
Ah I was presuming we preserve that part of the routing key... apparently I don't know enough about pulsetranslator to be reviewing code. :) Should I back this out?
I'll run a test to see if my guess is correct, first.
Jonathan might be right, at least in terms of clients which do not look out for the suffix. This would imply pulsebuildmonitor: http://hg.mozilla.org/automation/pulsebuildmonitor/file/c392765232c6/pulsebuildmonitor/pulsebuildmonitor.py#l24 Not sure what the best action is here. Trying to update all clients first and wait until all consumers have updated clients, and push this change then, or see which consumers are misbehaving. Maybe we should at least get pulsebuildmonitor fixed first?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I think the right solution is to make pulsebuildmonitor handle repacks a little differently, since it seems to be those that are at issue here, and not the unittest jobs in general, correct?
I cannot say for sure which specific routing keys would be affected here. Given by Ben and Chris all tasks should have a .finished and .log_uploaded message. It's just not sure in which order those appear. Maybe we should get it backed-out and leave it by only processing .log_uploaded messages? If we are really in trouble we could do special casing as you suggest. If you wanna do a backout please do it yourself. I will have to head out in a bit for the next two days.
I did back this out, it was causing a pileup in the pulsetranslator, because it was waiting for logs to become available (due to a .finished message) when they weren't. Because pulsetranslator was changed to be single-threaded, this basically caused us to stop processing messages.
Given that problem and that we do not know how consumers would react lets wontfix this bug for now. If there is really a need later we could reopen it.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: