Closed Bug 848681 Opened 11 years ago Closed 11 years ago

[socorro-prod] python script hanging on abrt.socket

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: cturra)

References

Details

One of the Socorro cron jobs (that's been running for years ;)) has developed an issue within the last month where the scripts it calls are hanging with (from strace):

connect(4, {sa_family=AF_FILE, path="/var/run/abrt/abrt.socket"}, 27

I dug up https://partner-bugzilla.redhat.com/show_bug.cgi?id=614752 but maybe someone more qualified than me can take a look pls? 

Thanks!
Depends on: 836671
Blocks: 836671
No longer depends on: 836671
This is still happening, any ideas?
Severity: normal → critical
grabbing the bug and dropping severity so it stops paging on-call.
Assignee: server-ops-webops → cturra
Severity: critical → major
(In reply to Robert Helmer [:rhelmer] from comment #1)
> This is still happening, any ideas?

Can you supply a hostname where you have experienced this?  Some quick research points to a socket closing down, or not being available.  I'm curious if this could be related to nic driver issues we have on some servers.
(In reply to Rick Bryce [:rbryce] from comment #3)
> (In reply to Robert Helmer [:rhelmer] from comment #1)
> > This is still happening, any ideas?
> 
> Can you supply a hostname where you have experienced this?  Some quick
> research points to a socket closing down, or not being available.  I'm
> curious if this could be related to nic driver issues we have on some
> servers.

The host in question for bug 836671 is sp-admin01.phx1.mozilla.com
(In reply to Robert Helmer [:rhelmer] from comment #4)
> (In reply to Rick Bryce [:rbryce] from comment #3)
> > (In reply to Robert Helmer [:rhelmer] from comment #1)
> > > This is still happening, any ideas?
> > 
> > Can you supply a hostname where you have experienced this?  Some quick
> > research points to a socket closing down, or not being available.  I'm
> > curious if this could be related to nic driver issues we have on some
> > servers.
> 
> The host in question for bug 836671 is sp-admin01.phx1.mozilla.com

sp-admin01 has the dreaded bnx2x driver.  I can't confirm, but I might bet real money the older firmware on the nic is to blame.  There is an update we can perform, but it would require downtime and reboots.   Again, I am not certain this is the issue, but some research into the error message points to networking.
:rbryce that's along the lines of what i was thinking also. 

*note: abrt is a daemon that watches for application crashes and collects/reports on these. 

*funny note: this is a crash report (abrt) issues for our crash reporting service (socorro) o_O
:rbryce - could you schedule this node for a nic driver review/update?
Flags: needinfo?(rbryce)
(In reply to Chris Turra [:cturra] from comment #7)
> :rbryce - could you schedule this node for a nic driver review/update?

Im going to do the firmware upgrade tomorrow @ 10:00am PDT. I will be sending out a notice to socorro-dev@m.c announcing the time and expected impact shortly.
Flags: needinfo?(rbryce)
(In reply to Rick Bryce [:rbryce] from comment #8)
> (In reply to Chris Turra [:cturra] from comment #7)
> > :rbryce - could you schedule this node for a nic driver review/update?
> 
> Im going to do the firmware upgrade tomorrow @ 10:00am PDT. I will be
> sending out a notice to socorro-dev@m.c announcing the time and expected
> impact shortly.

We postponed this until the morning of April 9th
SP-Admin01 has been fully upgrades.  Firmware updates, and OS upgrades were successful.

Linux sp-admin01.phx1.mozilla.com 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Feb 20 12:17:37 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
:rhelmer - i am going to mark this bug as r/fixed per the work :rbryce did. please re-open the bug if the abrt.socket error returns.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.