Closed Bug 735184 Opened 10 years ago Closed 7 years ago

RFE: Create Pulse notifications for update channel activities

Categories

(Release Engineering :: Release Automation: Other, enhancement, P4)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: bhearsum)

References

Details

(Whiteboard: [pulse][qa-automation-blocked])

Attachments

(6 files)

For release testing Mozilla QA can now rely on Pulse and notifications when builds have been made available. That's great that this part can be run fully automated. Sadly we miss the same for our update testing process.

Not sure how complicated it will be but it would be awesome if Pulse could notify if updates have been made available, changed, or even disabled. We see benefits in those notifications in a couple of projects. For example the DTC testing one which is bug 718403.
What kind of events are you looking for?

Could you be better served by having some AUS urls that you poll regularly for changes?
Severity: normal → enhancement
Whiteboard: [pulse]
(In reply to Chris AtLee [:catlee] from comment #1)
> What kind of events are you looking for?

I'm looking for events which QA is using to start running update tests. I have no idea what happens behind the scenes but it should be the final state of flipping the setting. I'm not interested in seeing the status of uploading the mar files, or broadcasting to the mirror network.

> Could you be better served by having some AUS urls that you poll regularly
> for changes?

I don't like polling and given that everything else is implemented as notification in Pulse, what's blocking us on this front? I don't know the complexity of RelEng tasks so a short description of what would have to be done would be great.
(In reply to Henrik Skupin (:whimboo) from comment #2)
> (In reply to Chris AtLee [:catlee] from comment #1)
> > What kind of events are you looking for?
> 
> I'm looking for events which QA is using to start running update tests. I
> have no idea what happens behind the scenes but it should be the final state
> of flipping the setting. I'm not interested in seeing the status of
> uploading the mar files, or broadcasting to the mirror network.

I'd like to know what kind of events you'd find useful, and then try to map that to what happens behind the scenes. My suspicion right now is that:

1) currently with AUS2 there's no good way to map what happens behind the scenes to what you need, other than to generate an event that says "snippets have changed", or perhaps "snippets for Firefox X.Y.Z have been pushed" without much more detail

2) with Balrog (aka AUS3) we may be able to generate more detail like "Firefox release channel now pointing at Firefox X.Y.Z".

> > Could you be better served by having some AUS urls that you poll regularly
> > for changes?
> 
> I don't like polling and given that everything else is implemented as
> notification in Pulse, what's blocking us on this front? I don't know the
> complexity of RelEng tasks so a short description of what would have to be
> done would be great.

Because of 1), I think polling would be best for now. And depending on your requirements, 2) may not be sufficient.
(In reply to Chris AtLee [:catlee] from comment #3)
> I'd like to know what kind of events you'd find useful, and then try to map
> that to what happens behind the scenes. My suspicion right now is that:

Is there a list of data points which are available? Or would you have to build it up based on the needs?

> 1) currently with AUS2 there's no good way to map what happens behind the
> scenes to what you need, other than to generate an event that says "snippets
> have changed", or perhaps "snippets for Firefox X.Y.Z have been pushed"
> without much more detail

Where is AUS2 still in use? Is it for older Firefox builds? If yes, which are those? For AUS2 I would be happy with the target version and buildid of Firefox, and the channel name - if it would be possible.

> Because of 1), I think polling would be best for now. And depending on your
> requirements, 2) may not be sufficient.

Even if I would poll, I miss important information if the applied patch is the correct one. I would need at least the version and buildid of the target build served separately. Otherwise there is no way to qualify the build.
(In reply to Henrik Skupin (:whimboo) from comment #4)
> (In reply to Chris AtLee [:catlee] from comment #3)
> > I'd like to know what kind of events you'd find useful, and then try to map
> > that to what happens behind the scenes. My suspicion right now is that:
> 
> Is there a list of data points which are available? Or would you have to
> build it up based on the needs?

Yes, I want to know what you need. Nothing exists here yet, so I'm trying to gather your requirements.

> > 1) currently with AUS2 there's no good way to map what happens behind the
> > scenes to what you need, other than to generate an event that says "snippets
> > have changed", or perhaps "snippets for Firefox X.Y.Z have been pushed"
> > without much more detail
> 
> Where is AUS2 still in use? Is it for older Firefox builds? If yes, which
> are those? For AUS2 I would be happy with the target version and buildid of
> Firefox, and the channel name - if it would be possible.

AUS2 is still in use everywhere. We haven't moved anything official to Balrog yet.

> > Because of 1), I think polling would be best for now. And depending on your
> > requirements, 2) may not be sufficient.
> 
> Even if I would poll, I miss important information if the applied patch is
> the correct one. I would need at least the version and buildid of the target
> build served separately. Otherwise there is no way to qualify the build.

Can you clarify this? I don't understand what you mean here.
Priority: -- → P4
Also, you're going to have to do some polling to cope with various levels of caching. Even after new snippets are copied or new aus configs are deployed there can be many minutes before these changes show up on the production web heads.
Product: mozilla.org → Release Engineering
This came up lately again, and for us it is a blocker in doing update tests for releases without intervention from QA people. All this will fit into our smooth release process goals.

So what we would need as minimal in the pulse message are the following properties:

* platform:		the platform to run the tests on
* channel		the channel the update has been enabled for
* target version 	version of the build we are updating to
* target build id 	id of the build we are updating to

Also the changes for bug 997499 bite us a bit this time because older builds didn't update to the latest but had to make a stop-over via 29.0b8. So it would be good to even have a property which gives us the oldest build to test. Otherwise I'm not sure who to figure out those specific situations without getting hundreds of failed testruns.
Component: General Automation → Release Automation
QA Contact: catlee → bhearsum
Whiteboard: [pulse] → [pulse][qa-automation-blocked]
(In reply to Henrik Skupin (:whimboo) from comment #7)
> So what we would need as minimal in the pulse message are the following
> properties:
> 
> * platform:		the platform to run the tests on

I don't think you need this. We enable updates for all platforms at the same time. We could provide a list of platforms if it helps, I guess.

> * target version 	version of the build we are updating to

Define "version". For example, when we build 30.0b2 do you want "30.0b2" or "30.0"? What about ESR? "24.5.0esr" or "24.5.0"?

> * target build id 	id of the build we are updating to

If we provide a list of platforms, we should probably associate buildids with them (even though they're typically the same for every platform).

> Also the changes for bug 997499 bite us a bit this time because older builds
> didn't update to the latest but had to make a stop-over via 29.0b8. So it
> would be good to even have a property which gives us the oldest build to
> test. Otherwise I'm not sure who to figure out those specific situations
> without getting hundreds of failed testruns.

You can deal with this the same way we do - modify your configs in advance. Eg: http://hg.mozilla.org/build/tools/rev/6d68be2cb18a. These are relativaly freak-ish events (we've only done 2 that I'm aware of) and I don't think we should build systems to support them when it's a one-off manual change to work around.



Implementation-wise, I think we'd need to set some properties in the updates builder and start sending them as part of the buildFinished event. I don't have time to look at this further right now, I'm busy with b2g updates and other Balrog work.
(In reply to Ben Hearsum [:bhearsum] from comment #8)
> > * platform:		the platform to run the tests on
> 
> I don't think you need this. We enable updates for all platforms at the same
> time. We could provide a list of platforms if it helps, I guess.

The platform would be necessary to distinguish the buildid given that those might be different. See below. 

> > * target version 	version of the build we are updating to
> 
> Define "version". For example, when we build 30.0b2 do you want "30.0b2" or
> "30.0"? What about ESR? "24.5.0esr" or "24.5.0"?

What we would need is the exact version. So the full version specifier like 30.0b2 as listed on the ftp server. Same for 24.5.0esr because otherwise we could get a conflict between 24.0 and 24.0esr.

> > * target build id 	id of the build we are updating to
> 
> If we provide a list of platforms, we should probably associate buildids
> with them (even though they're typically the same for every platform).

Typically yes, but lately we had a situation when buildids for all platforms were different. Those cases would cause us a lot of trouble. Btw, how would you send out the notification. I assume it is a single one with all the information contained, not not individual ones per platform?

> > Also the changes for bug 997499 bite us a bit this time because older builds
> > didn't update to the latest but had to make a stop-over via 29.0b8. So it
> > would be good to even have a property which gives us the oldest build to
> > test. Otherwise I'm not sure who to figure out those specific situations
> > without getting hundreds of failed testruns.
> 
> You can deal with this the same way we do - modify your configs in advance.
> Eg: http://hg.mozilla.org/build/tools/rev/6d68be2cb18a. These are relativaly
> freak-ish events (we've only done 2 that I'm aware of) and I don't think we
> should build systems to support them when it's a one-off manual change to
> work around.

I think that this is something we have to talk with Anthony and other QA people who run release update tests. We most likely have to define which versions to use as source build. So we might also be able to circumvent this problem. Thing is that in those cases we really have to be informed about such an interim step.

> Implementation-wise, I think we'd need to set some properties in the updates
> builder and start sending them as part of the buildFinished event. I don't
> have time to look at this further right now, I'm busy with b2g updates and
> other Balrog work.

Ok, is there someone else who could help us here or will it be delayed for an infinite time?
If you want to start monitoring to see what data is already available, I think you can use routing keys like:
build.release-mozilla-beta-updates on org.mozilla.exchange.build.

Those will have subkeys of a number (the build number, according to the buildbot master - you don't care about that) and finished. So it would end up looking like:
build.release-mozilla-beta-updates.1.finished

I'm not sure what information is in that right now, but if it's got properties, it will have some of what you need.
Attached file pulse_printer.py
(In reply to Ben Hearsum [:bhearsum] from comment #10)
> If you want to start monitoring to see what data is already available, I
> think you can use routing keys like:
> build.release-mozilla-beta-updates on org.mozilla.exchange.build.
I used this script to listen to pulse and check what data we have on messages build.release-mozilla-beta-updates.
Friday I ran the script without filtering, to be sure it works, then I added this reg-ex: 'build.release-mozilla-(.*?)-updates(.*?)' to catch only the update messages. I let it ran over the weekend but we had no notification. 
Ben did I missed something here?
Flags: needinfo?(bhearsum)
(In reply to Cosmin Malutan from comment #11)
> Created attachment 8437508 [details]
> pulse_printer.py
> 
> (In reply to Ben Hearsum [:bhearsum] from comment #10)
> > If you want to start monitoring to see what data is already available, I
> > think you can use routing keys like:
> > build.release-mozilla-beta-updates on org.mozilla.exchange.build.
> I used this script to listen to pulse and check what data we have on
> messages build.release-mozilla-beta-updates.
> Friday I ran the script without filtering, to be sure it works, then I added
> this reg-ex: 'build.release-mozilla-(.*?)-updates(.*?)' to catch only the
> update messages. I let it ran over the weekend but we had no notification. 
> Ben did I missed something here?

We didn't do any builds over the weekend. 30.0 and esr build1 were built on Tuesday. esr build2 was built last night.
Flags: needinfo?(bhearsum)
Attached file log.txt
I checked the script again this evening and I found this pulse message
key = build.release-mozilla-esr24-updates.0.finished
I stooped the processes and removed the script from staging.
(In reply to Cosmin Malutan from comment #13)
> I checked the script again this evening and I found this pulse message
> key = build.release-mozilla-esr24-updates.0.finished

Interesting. So it seems to be for the esr24 releases.

> I stooped the processes and removed the script from staging.

Why did you remove it? We still haven't gotten our expected update notifications for beta and releases! Please get it working again!
It's up again.
Duplicate of this bug: 622727
Cosmin, I miss an analysis of this pulse notification. Can you please check which of the properties we need are present and which are missing? That's necessary to know, so we can continue here. Thanks.
Attached file mozilla_beta.txt
This is the message for beta.
From properties we need we have the platform which is like:
>["platform", "macosx64", "Builder"]
We don't have:
  * channel
  * previous_buildid
  * buildid
Those we get in a nightly pulse.

We also have the:
  * version '["version", "31.0b1", "Builder"]'
  * build_number '["build_number", 1, "Builder"]'
Cosmin, have we gotten any other notification since June 13th? I was hoping to get this feedback from you.
Flags: needinfo?(cosmin.malutan)
Listener was down, I started it, we will have to wait for next beta which is on Friday to see how many logs we get.
We received one more message here for beta on Saturday(21/06) afterwards we had a lot of this exceptions:
>Traceback (most recent call last):
>  File "pulse_printer.py", line 26, in main
>    pulse.listen()
>  File "/home/mozauto/Desktop/env_for_pulse_printer/local/lib/python2.7/site-packages/mozillapulse/consumers.py", line 148, in listen
>    self.connection.drain_events()
>  File "/home/mozauto/Desktop/env_for_pulse_printer/local/lib/python2.7/site-packages/kombu/connection.py", line 274, in drain_events
>    return self.transport.drain_events(self.connection, **kwargs)
>  File "/home/mozauto/Desktop/env_for_pulse_printer/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events
>    return connection.drain_events(**kwargs)
>  File "/home/mozauto/Desktop/env_for_pulse_printer/local/lib/python2.7/site-packages/amqp/connection.py", line 299, in drain_events
>    chanmap, None, timeout=timeout,
>  File "/home/mozauto/Desktop/env_for_pulse_printer/local/lib/python2.7/site-packages/amqp/connection.py", line 362, in _wait_multiple
>    channel, method_sig, args, content = read_timeout(timeout)
>  File "/home/mozauto/Desktop/env_for_pulse_printer/local/lib/python2.7/site-packages/amqp/connection.py", line 326, in read_timeout
>    return self.method_reader.read_method()
>  File "/home/mozauto/Desktop/env_for_pulse_printer/local/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
>    raise m
>IOError: Socket closed
I restarted the script, but I don't have time to dive in this this week because I want to focus on TPS-CI
Flags: needinfo?(cosmin.malutan)
(In reply to Cosmin Malutan from comment #21)
> We received one more message here for beta on Saturday(21/06) afterwards we

And where is the message content? I don't see anything attached to the bug yet.
(In reply to Henrik Skupin (:whimboo) from comment #22)
> And where is the message content? I don't see anything attached to the bug
> yet.
I think it is similar to mozilla_beta.txt, here is the message from 21, we had another today, I will upload it in a moment.
This is the message from today.
Just to give an update from our side. Mid of June I sent out an email for goals planning of this quarter. We were thinking of working on this bug. But given that we haven't received any feedback, we are going to drop this from our goals proposal list and may re-evaluate it for Q4. So we have to live at least one more quarter with manual triggered ondemand update tests for releases.

If we find a bit of time we may be update to do some smaller updates here. But that's not that clear right now.
So helpful properties of this pulse notification would be:

{
  "properties": [
    [
      "branch",
      "release-mozilla-beta",
      "Builder"
    ],
    [
      "build_number",
      1,
      "Builder"
    ],
    [
      "platform",
      "macosx64",
      "Builder"
    ],
    [
      "product",
      "Firefox",
      "Builder"
    ],
    [
      "version",
      "31.0b4",
      "Builder"
    ]
  ]
}

Something what we miss are:
* exact description of the update channel
* buildid of the target build

Also it looks like we only send it out once for each beta release. And here specifically for the beta update channel. Not sure how the platform plays into account here. Should we get a notification for each platform?
The updates for all platforms are pushed at the same time.
(In reply to Chris AtLee [:catlee] from comment #27)
> The updates for all platforms are pushed at the same time.

So should this be a single notification or individual notifications. If it should be the latter I wonder why we only receive a single one.
Blocks: 813629
Ben told me lately that every branch has to be on Balrog first, so it looks like bug 986990 is our dependency here.
Depends on: 986990
Now that Balrog is out, are we in a position to start generating pulse messages or (even better) sendchange notifications for updates?
Flags: needinfo?(catlee)
Flags: needinfo?(bhearsum)
I don't think we should be using sendchanges for these. Pulse messages we should be able to generate.

Where will the tests run?
Flags: needinfo?(catlee)
The same place as where we do our update tests now. It's:
http://mm-ci-production.qa.scl3.mozilla.com:8080/

As we talked in Portland our system is consuming pulse messages for triggering all tests automatically. So once we have them it will be easy to get them into our system.

Chris, please let me know if there is something else which is blocking you. Otherwise I have a pile of messages from a testrun about a month ago, which I would have to analyze. As you said, its not that clear if we send messages already for some cases, and if that is true - what is contained.

What we definitely need is:

* the update channel to be tested
* version of the target build to update to (if we can also get the target build by platform it would be great - if not we would have to grab it from the FTP server)

One thing which varies it the earliest version of Firefox to be tested. Not sure if this is something which should come with the Pulse message, or will be defined on our side.
In balrog we have release blobs and rules. Some rules are static and even if you have a new release blob it doesn't mean that you will get an update to that blob.

As a good example of the above was 36.0b8 release. We had a rule to point Windows users to b6 and all other platforms to point to b7. When b8 release blobs started arriving, the rule was still pointing to b7 (linux mac) and b6 (win). Then there was a change in the beta-{localtest,cdntest} to point to the new builds (b8), but since we have another rule with higher priority, windows was still pointing to b6.

This is a good illustration that pulse messages may not be sufficient to test things properly.
(In reply to Henrik Skupin (:whimboo) from comment #32)
> The same place as where we do our update tests now. It's:
> http://mm-ci-production.qa.scl3.mozilla.com:8080/
> 
> As we talked in Portland our system is consuming pulse messages for
> triggering all tests automatically. So once we have them it will be easy to
> get them into our system.
> 
> Chris, please let me know if there is something else which is blocking you.
> Otherwise I have a pile of messages from a testrun about a month ago, which
> I would have to analyze. As you said, its not that clear if we send messages
> already for some cases, and if that is true - what is contained.
> 
> What we definitely need is:
> 
> * the update channel to be tested
> * version of the target build to update to (if we can also get the target
> build by platform it would be great - if not we would have to grab it from
> the FTP server)
> 
> One thing which varies it the earliest version of Firefox to be tested. Not
> sure if this is something which should come with the Pulse message, or will
> be defined on our side.

I strongly feel you should define this on your side. This is something that happens rarely, and once a watershed is defined, it doesn't move.

(In reply to Rail Aliiev [:rail] from comment #33)
> In balrog we have release blobs and rules. Some rules are static and even if
> you have a new release blob it doesn't mean that you will get an update to
> that blob.
> 
> As a good example of the above was 36.0b8 release. We had a rule to point
> Windows users to b6 and all other platforms to point to b7. When b8 release
> blobs started arriving, the rule was still pointing to b7 (linux mac) and b6
> (win). Then there was a change in the beta-{localtest,cdntest} to point to
> the new builds (b8), but since we have another rule with higher priority,
> windows was still pointing to b6.
> 
> This is a good illustration that pulse messages may not be sufficient to
> test things properly.

I've been thinking about this too. The most basic thing that we need here is the Pulse equivalent of the "updates available on channel X". That shouldn't be too difficult to provide with the proviso that exception cases like not shipping on Windows will NOT be supported. Maybe we should start with that and then we might have a better idea on how to move forward if there's remaining issues?

Rail's right the full picture of what updates look like is much more complicated. That's why our own tests rely on configs full of historical metadata (like https://github.com/mozilla/build-tools/blob/master/release/patcher-configs/mozBeta-branch-patcher2.cfg).
Flags: needinfo?(bhearsum)
(In reply to Ben Hearsum [:bhearsum] from comment #34)
> I've been thinking about this too. The most basic thing that we need here is
> the Pulse equivalent of the "updates available on channel X". That shouldn't
> be too difficult to provide with the proviso that exception cases like not
> shipping on Windows will NOT be supported. Maybe we should start with that
> and then we might have a better idea on how to move forward if there's
> remaining issues?

Let me think about it...

If we get such a notification for the update channel only, we will know if it is a beta or a release build. We can automatically check FTP for the latest beta or release version. With that we have the version information and the build ids (may be different per platform).

If there are e.g. only new builds for OS X and Linux, we will see that we don't have to test Windows because the latest build cannot be found on FTP for this platform. So we could simply ignore this platform.

Logic which versions to use as source will be done by ourselves. We added some thoughts about that already at https://github.com/mozilla/mozmill-ci/issues/535.

Watershed release are a problem with Mozmill at the moment. With the update tests as rewritten for Marionette I want to support multiple update steps. So even with a watershed release in the middle, we will be able to test the updates to the latest build. In such a case we will only do the fallback case for the update test from watershed to latest, because the updates from source to watershed have been tested already earlier.

How does that sound?
(In reply to Henrik Skupin (:whimboo) from comment #35)
> (In reply to Ben Hearsum [:bhearsum] from comment #34)
> > I've been thinking about this too. The most basic thing that we need here is
> > the Pulse equivalent of the "updates available on channel X". That shouldn't
> > be too difficult to provide with the proviso that exception cases like not
> > shipping on Windows will NOT be supported. Maybe we should start with that
> > and then we might have a better idea on how to move forward if there's
> > remaining issues?
> 
> Let me think about it...
> 
> If we get such a notification for the update channel only, we will know if
> it is a beta or a release build. We can automatically check FTP for the
> latest beta or release version. With that we have the version information
> and the build ids (may be different per platform).

Henrik and I chatted and it looks like the only thing blocking doing _something_ here is the inclusion of the update channel in the properties existing Pulse messages that are sent out when updates become available on the localtest, cdntest, and live channels. I should be able to get these added without much trouble. I plan to use "update_channel" as the property name.

The current routing keys will be:
build.release-$branch-firefox_updates for localtest
build.release-$branch-firefox_ready_for_releasetest_testing for cdntest
build.release-$branch-update_shipping for the live channel

$branch will be things like "mozilla-beta" or "mozilla-release"

After bug 1105485 lands the localtest routing keys will change to things like:
build.release-$branch-firefox_$channel_updates

Where $channel is things like "beta" or "release". The localtest channel will be included in the properties of the Pulse message.

I may also change the cdntest routing key to get rid of the outdated "releasetest" part. I'll let you know if that ends up happening.
This should add an "update_channel" property to each builder that indicates a change in update availability. These properties should end up in the Pulse messages unless I'm missing something.
Attachment #8562841 - Flags: review?(rail)
Attachment #8562841 - Flags: review?(rail) → review+
Attachment #8562841 - Flags: checked-in+
(In reply to Ben Hearsum [:bhearsum] from comment #37)
> Created attachment 8562841 [details] [diff] [review]
> add channel to updates, releasetest, and update shipping builders
> 
> This should add an "update_channel" property to each builder that indicates
> a change in update availability. These properties should end up in the Pulse
> messages unless I'm missing something.

This patch went into production today. Henrik, the next messages you get should include "update_channel" in the properties for the mentioned builders.
Depends on: 1105485
I checked the notifications via Chris' logger. There is only one for beta on Feb 18th but no more for latest betas - but that should actually not being a problem for verification. Also as discussed early last week we are not going to finally use those Pulse notifications, but want to get our update tests with Marionette directly executed on the RelEng infrastructure.

Firefox Beta 36.0b10 with channel 'beta-localtest':
https://s3.amazonaws.com/mozilla-releng-pulsedata/build%2Frelease-mozilla-beta-firefox_updates%2Flog_uploaded%2F2015%2F02%2F18%2F03%2F04%2F03%3A04%3A47-3

Firefox Beta 37.0b3 with channel 'beta-cdntest':
https://s3.amazonaws.com/mozilla-releng-pulsedata/build%2Frelease-mozilla-beta-firefox_ready_for_releasetest_testing%2Flog_uploaded%2F2015%2F03%2F06%2F15%2F13%2F15%3A13%3A34-3

Firefox Beta 37.0b3 with channel 'beta':
https://s3.amazonaws.com/mozilla-releng-pulsedata/build%2Frelease-mozilla-beta-update_shipping%2Ffinished%2F2015%2F03%2F06%2F17%2F04%2F17%3A04%3A11-7

Ben, I think that we can close this bug now given that all is present. Thanks a lot!
Assignee: nobody → bhearsum
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.