Closed Bug 1204488 Opened 9 years ago Closed 9 years ago

[pulsetranslator] Missing Pulse notifications for Firefox beta/release builds on normalized queue

Categories

(Webtools :: Pulse, defect)

defect
Not set
blocker

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

Details

(Whiteboard: [qa-automation-blocked])

Attachments

(2 files, 3 obsolete files)

For the last Firefox 41.0b9 beta candidate builds we haven't gotten any pulse message that the builds have been made available. As result no single test has been run.

All was working fine for Firefox 41.0b8 on Sep 7th for both staging and production:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=4ed5ddd4ab04&exclusion_profile=false

https://treeherder.allizom.org/#/jobs?repo=mozilla-beta&revision=4ed5ddd4ab04

First I thought it might have been because of the replacement of the CI master machines on bug 1200800. But I'm not sure anymore because staging was replaced on Sep 4th and processed the builds on Sep 7th. So I feel it would be more of an issue regarding any of the Pulse tools we make use of.

Or maybe Releng has changed the way what gets exposed in the pulse messages. Ben or Rail, has there anything been changed here lately?
(In reply to Henrik Skupin (:whimboo) from comment #0)
> Or maybe Releng has changed the way what gets exposed in the pulse messages.
> Ben or Rail, has there anything been changed here lately?

I should also put ni? on them. ;)
Flags: needinfo?(rail)
Flags: needinfo?(bhearsum)
Maybe this is due to the landing of bug 1150162?
I don't think we changed anything that would stop sending pulse messages for any buildbot jobs. Maybe it was some kind of hick up?
Flags: needinfo?(rail)
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Maybe this is due to the landing of bug 1150162?

This definitely didn't cause it. The release pulse messages aren't processed any differently than the CI ones, so I'd very surprised if they aren't coming through. I retriggered a release update verify job, and saw a "build.release-mozilla-release-macosx64_release_update_verify_6.5.started" message when watching on Pulse Inspector. I also saw "build.release-mozilla-release-macosx64_release_update_verify_6.5.finished" shortly after killing it.
Flags: needinfo?(bhearsum)
Ben, this bug is not about updates but about builds. Means routing keys like 'build.release-mozilla-beta-linux_build.8.log_uploaded'.
Asking for ni again to ensure that we clarify comment 4 and comment 5. Thanks.
Flags: needinfo?(bhearsum)
(In reply to Henrik Skupin (:whimboo) from comment #5)
> Ben, this bug is not about updates but about builds. Means routing keys like
> 'build.release-mozilla-beta-linux_build.8.log_uploaded'.

Doesn't matter, these messages aren't handled specially. I didn't trigger a build to test this, because it's more risky than triggering a test. Maybe you can watch in Pulse Inspector after the Beta starts today?
Flags: needinfo?(bhearsum)
Ok, so I have running a local debug script which will log all build messages containing 'release'. Lets see tomorrow what might be the cause of it. Thank you for the clarification.
So I was running the queue and got some build notifications from the builds exchange for release-mozilla-release and release-mozilla-esr38. I pushed them locally through pulsetranslator to check if the tool stumbles over a problem but all looks fine and the normalized build messages are getting created. Then I took some of them to push into mozmill-ci's pulse listener and also here everything is working fine. The Jenkins build is getting triggered.

Given those results I can only assume that something is wrong with publishing release builds to the normalized queue. Interestingly all works fine for nightly builds!?
Component: Infrastructure → Pulse
Flags: needinfo?(mcote)
Flags: needinfo?(jgriffin)
Product: Mozilla QA → Webtools
QA Contact: hskupin
Summary: Missing Pulse notifications for Firefox 41.0b9 builds (no single tests have been run) → [pulsetranslator] Missing Pulse notifications for Firefox beta/release builds on normalized queue
Version: Firefox 42 → other
This is definitely a problem with publishing release builds by pulsetranslator. Beside my local debug client for the exchange/build queue, I also setup one for exchange/build/normalized and received message differ in the fact that the normalized queue does not list any single release related message.
Whiteboard: [qa-automation-blocked]
Jonathan, I think it is time that we get someone else with Heroku access who is online during European time. I asked IT if they can help, but even the sysadmin doesn't have access to Heroku. It's like an isolated silo. :(
Attached file log.txt
fox2mike actually sent me some logs and it looks like we get timeout errors:

2015-09-15T06:30:56.762957+00:00 app[worker.1]: LogTimeoutError: key: build.release-mozilla-release-macosx64_repack_17.1.log_uploaded, url: http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/41.0-candidates/build1/logs/release-mozilla-release-macosx64_repack_17-bm84-build1-build1.txt.gz

The full log can be seen as attachment.

Fox2mike also restarted the service but I don't see that this helped. We are still failing.
No longer blocks: 1200800
Just out of curiosity, why don't you watch for #.finished instead of #.log_uploaded? Is this to make sure that treeherder has logs to process?
pulsetranslator listens for #.log_uploaded because logs are relevant to some of its consumers, e.g., ActiveData
Flags: needinfo?(jgriffin)
Oops, didn't mean to clear my needinfo.
Flags: needinfo?(jgriffin)
(In reply to Henrik Skupin (:whimboo) from comment #12)
> Created attachment 8661109 [details]
> log.txt
> 
> fox2mike actually sent me some logs and it looks like we get timeout errors:
> 
> 2015-09-15T06:30:56.762957+00:00 app[worker.1]: LogTimeoutError: key:
> build.release-mozilla-release-macosx64_repack_17.1.log_uploaded, url:
> http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/41.0-candidates/
> build1/logs/release-mozilla-release-macosx64_repack_17-bm84-build1-build1.
> txt.gz
> 
> The full log can be seen as attachment.
> 
> Fox2mike also restarted the service but I don't see that this helped. We are
> still failing.

Yes, I see these errors in papertrail.  This seems to be a side effect of the netapp migration.  The log url's being published to pulse by build are on http://stage.mozilla.org..., which doesn't exist any longer.  When browsing, it gets redirected to archive.mozilla.org, but apparently that doesn't happen in pulsetranslator.  I'll see if I can work around it there, but it would be better to have buildbot publish the actual log url.
Clearing my NI since jgriffin is on it.
Flags: needinfo?(mcote)
So after talking to Jonathan on IRC we figured that the issue should be located in the get_url_info() method of loghandler. There we only check for a 200 response code and wait until we reach the timeout. Given that it is a redirect we will never get this code returned:

https://github.com/mozilla/pulsetranslator/blob/master/pulsetranslator/loghandler.py#L26

I suggested to switch to the requests module so we do not have to differentiate between HTTPSConnection and HTTPConnection.

Given the importance of this issue for us I will have a look at it.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
The reason why we are failing is that we are doing a HEAD request, which returns immediately but not following the redirect. So what we would need is an additional check for 301 and retry the new location.

A patch should be ready soon.
Attached patch Patch v1 (obsolete) — Splinter Review
Here the patch which will fix this problem and also enhances what we actually log to the loghandler error log. It should give us a better insight in the future what's actually wrong with the HEAD request.
Attachment #8661423 - Flags: review?(jgriffin)
Attached patch Patch v1.1 (obsolete) — Splinter Review
Updated patch so we also log an error if the logurl is 404 or any other kind of failure.
Attachment #8661423 - Attachment is obsolete: true
Attachment #8661423 - Flags: review?(jgriffin)
Attachment #8661425 - Flags: review?(jgriffin)
Attached patch Patch v1.2 (obsolete) — Splinter Review
Well, it's time for bed. This time without the debugging code. Sorry.
Attachment #8661425 - Attachment is obsolete: true
Attachment #8661425 - Flags: review?(jgriffin)
Attachment #8661427 - Flags: review?(jgriffin)
Comment on attachment 8661427 [details] [diff] [review]
Patch v1.2

Review of attachment 8661427 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, thanks

::: pulsetranslator/loghandler.py
@@ +33,5 @@
>  
> +            # In case of a redirect make another head request for the new location.
> +            if resp.status_code == 301:
> +                resp.raise_for_status()
> +                resp = requests.head(resp.headers['location'])

resp.raise_for_status() should be after the new requests.head()
Attachment #8661427 - Flags: review?(jgriffin) → review+
Flags: needinfo?(jgriffin)
Attached patch Patch v2.0Splinter Review
Fixed line ordering. I will push it in a moment.
Attachment #8661427 - Attachment is obsolete: true
Attachment #8661468 - Flags: review+
Landed patch on master as: https://github.com/mozilla/pulsetranslator/commit/9b759687fdde84830f95f9fa2904d640af510bee

Jonathan, can you please publish?
Flags: needinfo?(jgriffin)
remote: Compressing source files... done.
remote: Building source:
remote: 
remote: -----> Python app detected
remote: -----> Installing dependencies with pip
remote:        Collecting requests==2.7.0 (from -r requirements.txt (line 3))
remote:          Downloading requests-2.7.0-py2.py3-none-any.whl (470kB)
remote:        Installing collected packages: requests
remote:        Successfully installed requests-2.7.0
remote: You are using pip version 7.1.0, however version 7.1.2 is available.
remote: You should consider upgrading via the 'pip install --upgrade pip' command.
remote: 
remote: -----> Discovering process types
remote:        Procfile declares types -> worker
remote: 
remote: -----> Compressing... done, 45.6MB
remote: -----> Launching... done, v21
remote:        https://pulsetranslator.herokuapp.com/ deployed to Heroku
remote: 
remote: Verifying deploy... done.
To https://git.heroku.com/pulsetranslator.git
   15415df..9b75968  master -> master
Flags: needinfo?(jgriffin)
This caused:

 Sep 15 14:49:17 pulsetranslator app/worker.2: Traceback (most recent call last): 
Sep 15 14:49:17 pulsetranslator app/worker.2:   File "pulsetranslator/runservice.py", line 12, in <module> 
Sep 15 14:49:17 pulsetranslator app/worker.2:     from pulsetranslator import PulseBuildbotTranslator 
Sep 15 14:49:17 pulsetranslator app/worker.2:   File "/app/pulsetranslator/pulsetranslator.py", line 21, in <module> 
Sep 15 14:49:17 pulsetranslator app/worker.2:     from loghandler import LogHandler 
Sep 15 14:49:17 pulsetranslator app/worker.2:   File "/app/pulsetranslator/loghandler.py", line 13, in <module> 
Sep 15 14:49:17 pulsetranslator app/worker.2:     from .translatorexceptions import LogTimeoutError 
Sep 15 14:49:17 pulsetranslator app/worker.2: ValueError: Attempted relative import in non-package 

fixing
fixed with

remote: Compressing source files... done.
remote: Building source:
remote: 
remote: -----> Python app detected
remote: -----> Installing dependencies with pip
remote: You are using pip version 7.1.0, however version 7.1.2 is available.
remote: You should consider upgrading via the 'pip install --upgrade pip' command.
remote: 
remote: -----> Discovering process types
remote:        Procfile declares types -> worker
remote: 
remote: -----> Compressing... done, 45.6MB
remote: -----> Launching... done, v22
remote:        https://pulsetranslator.herokuapp.com/ deployed to Heroku
remote: 
remote: Verifying deploy.... done.
To https://git.heroku.com/pulsetranslator.git
   9b75968..c0aa93c  master -> master
> ValueError: Attempted relative import in non-package

I really do not understand this failure. The folder contains a __init__.py file so it should be a module directory. Also all is working fine for me locally with the relative import. Jonathan, any idea why that happens? Maybe it's how it gets published to Heroku?

The actual fix for this landed by jgriffin as:
https://github.com/mozilla/pulsetranslator/commit/c0aa93c830f16d7b5396bf3ae6b88c24b0dd5d87

Thinking more about the 301 check it might have been put into a loop until we have a status code not equal to it. But we won't have multiple redirects for our stuff so fixing it might not be that important right now.

I'm closing this bug and will keep an eye on it for the next beta release or a new build for the 41.0 release if it has to happen.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Hmm it seems to work when calling 'runtranslator', but not if invoking python runservice.py directly.
Maybe its the chained import of relative modules. Anyway I don't think it's important to spend any more time on it. Thanks for fixing it!
So it's working fine again.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: