Bugzilla

Comment 3

•

11 years ago

Known, pulse has been broken for a while. See the bug this one is dependent on.

Assignee: server-ops-devservices → shyam

Status: NEW → ASSIGNED

Depends on: 866467

Comment 4

•

11 years ago

This is kinda blocking any QA related automated testing. Not sure what's going on here, but the same happens for our staging instance in MV. So that it is not a network issue in SCL3.

As Dave pointed out we noticed it yesterday, but it started to fail on June 1st. Since then no other Pulse message made it through. The exact time might be a little bit earlier than 2013-06-01T14:12:18.000Z.

Shyam, the referenced bug is secured so I cannot look into it. Is there any ETA when this will work again?

Severity: major → blocker

Comment 5

•

11 years ago

(In reply to Henrik Skupin (:whimboo) from comment #4)

> Shyam, the referenced bug is secured so I cannot look into it. Is there any
> ETA when this will work again?

Sorry, I just opened that up. It's a problem in the code, it'll work when that's fixed.

Comment 6

•

11 years ago

I don't think these are related.  The server side of pulse is mostly made of two things: (a) a RabbitMQ service and (b) a Django web app.  Almost everything uses (a).  (b) is mostly static content (docs) with a couple dynamic pages.  It has been down since the end of April and is not particularly important.

(a) appears to be working, based on the dashboard, though one component of the whole system--the pulse translator--appears to be stuck.  I think I need jgriffin to look at this.

We will also fully document pulse for future problems if we aren't around.

Comment 7

•

11 years ago

Mark, does only Jonathan has access to the VM pulse-translator runs on? When will he be back? i tried to find him yesterday on IRC but he was always afk. Is it just a restart of the daemon? I would assume someone else could do that too.

Comment 8

•

11 years ago

Yeah I think he is the only one.  I don't know much about the pulsetranslator but I believe it is just one process.  Shyam, can you either give me access to pulsetranslator.ateam.phx1.mozilla.com (better for the long run) or run "/etc/init.d/pulsetranslator restart" on it?

Flags: needinfo?(shyam)

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

11 years ago

I did some work on pulsetranslator a couple of weeks back. So if something is broken please let me know and I can help to get it fixed.

Comment 10

•

11 years ago

Would the translator make sense as a pulse shim, along with the hg shim, ftp shim, etc.?  I realize many of those shims aren't in use, but at least they have a nice slot for being managed, upgraded, monitored, etc.

Updated

•

11 years ago

No longer depends on: 866467

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

11 years ago

Dustin, not entirely sure if I understand you, but as far as I know (and I'm only really learning about the whole system now), the translator is a service that listens for build messages, normalizes them into a consistent format, then republishes them to *.normalized queues.  So it pretty much has to be running as a separate service somewhere.

Comment 12

•

11 years ago

Oh, ok.  The shims are all crontasks that poll things.  They run on the app host.

    # activate some shims - these should each only be on one host!
    include webapp::pulse::shim::heartbeat
    webapp::pulse::shim::hg {
        'mozilla-central': ;
        'users/jgriffin_mozilla.com/synkme': ;
    }   
    # these will likely not be used, and are not implemented in puppet
    #include webapp::pulse::shim::bugzilla
    #include webapp::pulse::shim::ftp

anyway, we could certainly run the translator on the app host, as well, even if it is a service rather than a crontask.  That would let new versions be deployed using the "normal" webapp push process, just like for the Django app.  It's up to you, really - just an idea.

Updated

•

11 years ago

Component: Server Operations: Developer Services → General

Flags: needinfo?(shyam)

Product: mozilla.org → Testing

QA Contact: shyam

Version: other → unspecified

Comment 13

•

11 years ago

Whee. This isn't dev services. I'll kick the service for now Mark, and see if I can get you access.

Assignee: shyam → nobody

Comment 14

•

11 years ago

And done. 

[shyam@pulsetranslator.ateam.phx1 ~]$ sudo /etc/init.d/pulsetranslator restart
Stopping pulsetranslator:                                  [  OK  ]
Starting pulsetranslator: daemon --user webtools /home/webtools/apps/pulsetranslator/bin/runtranslator --daemon --pidfile /home/webtools/apps/pulsetranslator/translator.pid --logfile /home/webtools/apps/pulsetranslator/stdout.log --durable --logdir /home/webtools/apps/pulsetranslator/logs
                                                           [  OK  ]

Jonathan Griffin (:jgriffin)

Comment 15

•

11 years ago

Dustin: oh yeah, definitely we would like to transition this to a completely IT-managed system.  We have plans to fix and tighten up a variety of pulse-related stuff, all of which ended up in our lap a while ago.

Shyam: thanks, I can see the translator queue being drained.  I don't know for sure if this will fix this particular bug, but it will probably fix bug 879204 at least.

Comment 16

•

11 years ago

The pulstranslator being down would absolutely not cause connection timeouts for a pulse consumer, since consumers neither connect to nor have any knowledge of the pulsetranslator.

Connection timeouts mean either pulse is having problems itself, or there are some transient network problems between the consumer and pulse.

Comment 17

•

11 years ago

Just to let you know our mozmill-ci consumer is working again and already got a couple of notifications. Thanks.

Ed Morley [:emorley]

Updated

•

11 years ago

Severity: blocker → critical

Jonathan Griffin (:jgriffin)

Comment 18

•

11 years ago

What's left to do on this bug? Jonathan or Shyam?

Comment 19

•

11 years ago

There doesn't seem to be anything actionable on the pulse side.  WORKSFORME?