Closed Bug 848681 Opened 12 years ago Closed 12 years ago

[socorro-prod] python script hanging on abrt.socket

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: cturra)

References

Details

One of the Socorro cron jobs (that's been running for years ;)) has developed an issue within the last month where the scripts it calls are hanging with (from strace): connect(4, {sa_family=AF_FILE, path="/var/run/abrt/abrt.socket"}, 27 I dug up https://partner-bugzilla.redhat.com/show_bug.cgi?id=614752 but maybe someone more qualified than me can take a look pls? Thanks!
Depends on: 836671
Blocks: 836671
No longer depends on: 836671
This is still happening, any ideas?
Severity: normal → critical
grabbing the bug and dropping severity so it stops paging on-call.
Assignee: server-ops-webops → cturra
Severity: critical → major
(In reply to Robert Helmer [:rhelmer] from comment #1) > This is still happening, any ideas? Can you supply a hostname where you have experienced this? Some quick research points to a socket closing down, or not being available. I'm curious if this could be related to nic driver issues we have on some servers.
(In reply to Rick Bryce [:rbryce] from comment #3) > (In reply to Robert Helmer [:rhelmer] from comment #1) > > This is still happening, any ideas? > > Can you supply a hostname where you have experienced this? Some quick > research points to a socket closing down, or not being available. I'm > curious if this could be related to nic driver issues we have on some > servers. The host in question for bug 836671 is sp-admin01.phx1.mozilla.com
(In reply to Robert Helmer [:rhelmer] from comment #4) > (In reply to Rick Bryce [:rbryce] from comment #3) > > (In reply to Robert Helmer [:rhelmer] from comment #1) > > > This is still happening, any ideas? > > > > Can you supply a hostname where you have experienced this? Some quick > > research points to a socket closing down, or not being available. I'm > > curious if this could be related to nic driver issues we have on some > > servers. > > The host in question for bug 836671 is sp-admin01.phx1.mozilla.com sp-admin01 has the dreaded bnx2x driver. I can't confirm, but I might bet real money the older firmware on the nic is to blame. There is an update we can perform, but it would require downtime and reboots. Again, I am not certain this is the issue, but some research into the error message points to networking.
:rbryce that's along the lines of what i was thinking also. *note: abrt is a daemon that watches for application crashes and collects/reports on these. *funny note: this is a crash report (abrt) issues for our crash reporting service (socorro) o_O
:rbryce - could you schedule this node for a nic driver review/update?
Flags: needinfo?(rbryce)
(In reply to Chris Turra [:cturra] from comment #7) > :rbryce - could you schedule this node for a nic driver review/update? Im going to do the firmware upgrade tomorrow @ 10:00am PDT. I will be sending out a notice to socorro-dev@m.c announcing the time and expected impact shortly.
Flags: needinfo?(rbryce)
(In reply to Rick Bryce [:rbryce] from comment #8) > (In reply to Chris Turra [:cturra] from comment #7) > > :rbryce - could you schedule this node for a nic driver review/update? > > Im going to do the firmware upgrade tomorrow @ 10:00am PDT. I will be > sending out a notice to socorro-dev@m.c announcing the time and expected > impact shortly. We postponed this until the morning of April 9th
SP-Admin01 has been fully upgrades. Firmware updates, and OS upgrades were successful. Linux sp-admin01.phx1.mozilla.com 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Feb 20 12:17:37 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
:rhelmer - i am going to mark this bug as r/fixed per the work :rbryce did. please re-open the bug if the abrt.socket error returns.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.