Closed Bug 781129 Opened 12 years ago Closed 11 years ago

Notifications for outdated builds are getting send via Pulse

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Unassigned)

References

()

Details

(Whiteboard: [mozmill-test-failure][qa-automation-blocked])

Attachments

(1 file)

Attached file pulse notification
I have seen this already a couple of times but wasn't able to nail this down so far because I was too late. But today I have seen again a report from our Mozmill CI system which caught a failure in the update report:

http://mozmill-ci.blargon7.com/#/update/report/3491a2617d5af3ec9bb5c88aee015de5

Given that failure we were trying to upgrade a build from 20120801030520 to 20120802030533.

Thankfully I log the message we arrive via Pulse to the console. The following entry is visible:

INFO:automation:2012-08-02T05:23:06+01:00 - Product: firefox, Branch: mozilla-central, Platform: macosx64, Locale: fr
INFO:automation:Trigger tests for firefox 17.0a1 mac fr 20120802030533 20120801030520

I will attach the whole notification in a bit.

There were also some more notifications we do not obey. Here some examples:

INFO:automation:2012-08-02T05:23:29+01:00 - Product: firefox, Branch: mozilla-central, Platform: macosx64, Locale: it
INFO:automation:2012-08-02T05:23:53+01:00 - Product: firefox, Branch: mozilla-central, Platform: macosx64, Locale: kk
INFO:automation:2012-08-02T05:24:00+01:00 - Product: thunderbird, Branch: comm-central, Platform: win32, Locale: sr
INFO:automation:2012-08-02T05:24:16+01:00 - Product: firefox, Branch: mozilla-central, Platform: macosx64, Locale: hr
INFO:automation:2012-08-02T05:24:53+01:00 - Product: thunderbird, Branch: comm-aurora, Platform: linux64, Locale: pl
INFO:automation:2012-08-02T05:25:42+01:00 - Product: firefox, Branch: mozilla-central, Platform: macosx64, Locale: kn

It looks like that messages are stuck somewhere and getting send out at a random time.
I'm sorry, I don't really understand what you mean. In what way are the builds outdated?
possibly related to bug 781128?
Well, if there are getting sent build finished notifications through Pulse on Aug  8th for builds from 2012-08-02T05:23:06+01:00, I would rather call those builds and notifications outdated.

I'm not sure if there is anything related to the bug you pointed out, given I don't know the details. CC'ing Ed for possible better input.
I'm sorry I don't know anything about Pulse, this is a releng issue, rather than a sheriffing issue.
Blocks: 785649
The referenced bug was that new builds would use old buildids, which broke a whole bunch of stuff.

Unless this is still happening, I'm going to blame bug 781128 for this.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Thanks Chris. I will check that. I haven't seen such a situation in the last couple of weeks.
Not fixed. I have seen it again right now:

> INFO:automation:          jsshellUrl:   http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64/1348999570/jsshell-mac.zip
> INFO:automation:             project:   
> INFO:automation:            builddir:   m-cen-osx64-ntly
> INFO:automation:            filepath:   None
> INFO:automation:     packageFilename:   firefox-18.0a1.en-US.mac.dmg
> INFO:automation:             basedir:   /builds/slave/m-cen-osx64-ntly
> INFO:automation:completesnippetFilename:        build/obj-firefox/i386/dist/update/complete.update.snippet
> INFO:automation:          appVersion:   18.0a1
> INFO:automation:            comments:   
> INFO:automation:        purge_target:   12GB
> INFO:automation:            platform:   macosx64
> INFO:automation:              master:   http://buildbot-master30.srv.releng.scl3.mozilla.com:8001/
> INFO:automation:              branch:   mozilla-central
> INFO:automation:  partialMarFilename:   firefox-18.0a1.en-US.mac.partial.20120929191424-20120930030610.mar
> INFO:automation:      stage_platform:   macosx64
> INFO:automation:            revision:   a680fd777c3b92d81650dd51c8cb3e9e5faf6398
> INFO:automation:             product:   firefox
> INFO:automation:     completeMarSize:   45477330
> INFO:automation:          repository:   
> INFO:automation:         buildername:   OS X 10.7 mozilla-central nightly
> INFO:automation:             buildid:   20120930030610
> INFO:automation:      completeMarUrl:   http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/09/2012-09-30-03-06-10-mozilla-central/firefox-18.0a1.en-US.mac.complete.mar
> INFO:automation:         packageHash:   295e54cf07c17901542b9c26c56a5af34ecd7bd6c98b71aa974ce16e9c9a2938bf28cc40e55467d8761d36ab271d7c6e0b7cd1e5be53497185475263de63b907
> INFO:automation:     completeMarHash:   ef06fba7bbfd2b0ca6171d520a616263d24e2e73d48806cb7e2a66c47977f9d7af6cc62c5a615d9e2799c2fe450af1fac9c43cd59bf392a863b651085a1e2157
> INFO:automation:            hashType:   sha512
> INFO:automation:    previous_inipath:   previous/Contents/MacOS/application.ini
> INFO:automation:           scheduler:   mozilla-central nightly
> INFO:automation:          symbolsUrl:   http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64/1348999570/firefox-18.0a1.en-US.mac.crashreporter-symbols.zip
> INFO:automation:          packageUrl:   http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64/1348999570/firefox-18.0a1.en-US.mac.dmg
> INFO:automation:       partialMarUrl:   http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/09/2012-09-30-03-06-10-mozilla-central/firefox-18.0a1.en-US.mac.partial.20120929191424-20120930030610.mar
> INFO:automation:      purged_clobber:   False
> INFO:automation:       nightly_build:   True
> INFO:automation:         buildnumber:   33
> INFO:automation:            testsUrl:   http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64/1348999570/firefox-18.0a1.en-US.mac.tests.zip
> INFO:automation:    periodic_clobber:   False
> INFO:automation:      partialMarHash:   54044a09e7306d749c4dbaabf184d28463a555d5790cb945c33826f987951b5c5eae277b0f6f2e756f84c20a58221abd58a4663b3d7b7c6a6d15168dccad3925
> INFO:automation:      partialMarSize:   1930419
> INFO:automation:            builduid:   3a2d6e8e187b4abb822a7f6db9e3043c
> INFO:automation:       slavebuilddir:   m-cen-osx64-ntly
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
AIUI rabbit makes no guarantees about message delivery order or delay.

Our systems have not received that pulse message within the past 7 days.

Did you restart your pulse consumer around the same time? Perhaps this was an un-ack'ed message that got re-delivered?
http://www.rabbitmq.com/semantics.html describes message order guarantees
(In reply to Chris AtLee [:catlee] from comment #8)
> AIUI rabbit makes no guarantees about message delivery order or delay.

That would be pretty bad. So why do we rely on Pulse then? If that's the case I see a big gap here. :/

> Did you restart your pulse consumer around the same time? Perhaps this was
> an un-ack'ed message that got re-delivered?

I can't say that but I don't think so. We acknowledge messages right away. So if it would be the case we would see it more often. Also we do not make use of a resistant queue but get a new one for each reconnect.
The mozmill queues have collectively over 900 unack'd messages, that may be the root of this issue.

qa-auto@mozilla.com|mozmill_daily|mm-ci-master - 143 unack'd messages
qa-auto@mozilla.com|mozmill_daily|release3.qa.mtv1.mozilla.com - 226
qa-auto@mozilla.com|mozmill_l10n|release4-osx-106.qa.mtv1.mozilla.com - 179
qa-auto@mozilla.com|mozmill_release|mm-ci-master - 239
qa-auto@mozilla.com|mozmill_release|release3.qa.mtv1.mozilla.com - 204
I really can't see why that happens. It's the first action we are doing when receiving a new message:

https://github.com/whimboo/mozmill-ci/blob/master/pulse.py#L206
https://github.com/whimboo/mozmill-ci/blob/master/pulse.py#L171

As jgriffin mentioned on IRC its a very slow increase. So might this be something on the Pulse side?
I doubt it is on the pulse side, because the only queues I see with unack'd messages are the mozmill queues.  Could this be due to network problems between the machines hosting the mozmill automation and pulse?  I.e., the network is interrupted between the time pulse delivers the message and it gets acknowledged, or there are problems delivering the ack?
whimboo, can you figure out why there are so many unack'ed messages in your queues
Assignee: nobody → hskupin
As discussed in our Automation Developer Meeting we want to have a look in using pulsebuildmonitor. I filed the issue directly against our CI and will hopefully have time next week to look at this.

https://github.com/mozilla/mozmill-ci/issues/176
Happened again today on Mac OS X 10.7.5 (x86_64) in:
 
/testDirectUpdate/test3.js
http://mozmill-ondemand.blargon7.com/#/update/report/ad726b5c70cf80fbf8135edfca1a9522

and
 /testFallbackUpdate/test4.js:
http://mozmill-ondemand.blargon7.com/#/update/report/ad726b5c70cf80fbf8135edfca1a4a12
If ondemand builds are failing that's most likely a misconfiguration by the QA person who triggered the builds.

For the other jobs which are triggered by Pulse you do not have to report more issues. We know about them and I'm working on getting us moved to pulsebuildmonitor. This will happen in a couple of days. Once switched and we still discover this problem it would be helpful to comment here. Thanks.
We have switched to pulsebuildmonitor now. So hopefully this bug should be fixed. We will reopen if it happens again.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
This is not fixed. Today we got a pulse message for the following build:  Firefox 23.0a2 en-US on Linux Ubuntu 12.10 32bit (20130520004018

This build is two days old and there was an en-US build yesterday:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013/05/2013-05-21-00-40-18-mozilla-aurora/

I will retrieve and attach the pulse message in a bit.
Assignee: hskupin → nobody
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [mozmill-test-failure] → [mozmill-test-failure][qa-automation-blocked]
Drop that. As what I was able to see is that the buildid contained in the pulse message is smaller than the previous_buildid. I will file a new bug.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: