Closed Bug 817718 Opened 12 years ago Closed 11 years ago

Ongoing failure of multiple cron jobs on sp-admin01, since Dec 2

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P3)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: selenamarie, Assigned: dmaher)

References

Details

We have a bunch of cron jobs running on sp-admin01 that failed on Sunday afternoon, with log entries like this: 

(accessing hbase:)
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection

And another proc was connecting to the Postgres database and it didn't log any errors, but never returned: 

socorro  12043 12034  0 Dec02 ?        00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_update_adus.sh
socorro  12050 12043  0 Dec02 ?        00:00:00 /usr/bin/python2.6 /data/socorro/application/scripts/startUpdateADUs.py

We found some simple queries running on our prod database (select * from priorityjobs), with source IPs that looked like Zeus network addresses, but don't resolve so I had trouble figuring much else out: 

216.81.8.10.in-addr.arpa. 3600     IN      PTR     pp-zlb09.vlan81.phx.mozilla.com.

Any chance there are logs in Zeus-land that could help explain the mysterious source of these problems?
Happened again today at 15:40 PT

In PostgreSQL pg_stat_activity: 

breakpad=# select * from pg_stat_activity where usename = 'monitor'; 
-[ RECORD 1 ]----+--------------------------------
datid            | 16408
datname          | breakpad
pid              | 22786
usesysid         | 40443
usename          | monitor
application_name | 
client_addr      | 10.8.70.202
client_hostname  | pp-zlb01.vlan70.phx.mozilla.com
client_port      | 9413
backend_start    | 2012-12-03 09:44:13.349377-08
xact_start       | 2012-12-03 09:44:13.354588-08
query_start      | 2012-12-03 15:40:35.914917-08
state_change     | 2012-12-03 15:40:35.915023-08
waiting          | f
state            | idle in transaction
query            | select * from priorityjobs;
-[ RECORD 2 ]----+--------------------------------
datid            | 16408
datname          | breakpad
pid              | 22787
usesysid         | 40443
usename          | monitor
application_name | 
client_addr      | 10.8.70.202
client_hostname  | pp-zlb01.vlan70.phx.mozilla.com
client_port      | 9415
backend_start    | 2012-12-03 09:44:13.351829-08
xact_start       | 
query_start      | 2012-12-03 15:39:35.852209-08
state_change     | 2012-12-03 15:39:35.852256-08
waiting          | f
state            | idle
query            | COMMIT
-[ RECORD 3 ]----+--------------------------------
datid            | 16408
datname          | breakpad
pid              | 22788
usesysid         | 40443
usename          | monitor
application_name | 
client_addr      | 10.8.70.202
client_hostname  | pp-zlb01.vlan70.phx.mozilla.com
client_port      | 9418
backend_start    | 2012-12-03 09:44:13.355104-08
xact_start       | 
query_start      | 2012-12-03 15:40:51.439235-08
state_change     | 2012-12-03 15:40:51.439276-08
waiting          | f
state            | idle
query            | COMMIT
And the cron_submitter jobs are stacking up: 
[sdeckelmann@sp-admin01.phx1 socorro]$ ls -tlr /var/tmp/*.lock
-rw-r--r-- 1 socorro socorro 0 Dec  3 15:45 /var/tmp/crash-reports.allizom.org.lock
[sdeckelmann@sp-admin01.phx1 socorro]$ ps -ef | grep crash
socorro   4792  4788  0 14:00 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
socorro   4795  4792  0 14:00 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
socorro   4799  4795  0 14:00 ?        00:00:01 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
2103     32230 12082  0 15:45 pts/1    00:00:00 grep crash
[sdeckelmann@sp-admin01.phx1 socorro]$ 

And we recorded some network timeouts: 
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: handled exception, timed out

Last time that file was written to was: 
-rwxr-xr-x 1 socorro socorro 6340840 Dec  3 14:14 /var/log/socorro/cron_submitter-crash-reports.allizom.org.log
Summary: Failure of multiple database-accessing cron jobs on Dec 2 around 16:00-18:00 → Ongoing failure of multiple cron jobs on sp-admin01, since Dec 2
Severity: normal → major
You probably want webops for Zeus issues.
Assignee: network-operations → server-ops-webops
Component: Server Operations: Netops → Server Operations: Web Operations
QA Contact: ravi → nmaul
Blocks: 817458
Going to grab this till I can tech a webop
Assignee: server-ops-webops → shyam
Going to grab this till I can reach a webop
Dec 2 around 16:00-18:00 *might* have been scheduled maintenance for Zeus. CC'ing :dumitru who was performing the maintenance for more info/better timelines.
(In reply to Shyam Mani [:fox2mike] from comment #5)
> Going to grab this till I can reach a webop

Don't worry sir, I'm from the Webops.
Assignee: shyam → dmaher
Severity: major → normal
Status: NEW → ASSIGNED
Priority: -- → P3
Hello,

(In reply to Selena Deckelmann :selena from comment #1)
> Happened again today at 15:40 PT
> 
> In PostgreSQL pg_stat_activity: 
> 
> breakpad=# select * from pg_stat_activity where usename = 'monitor'; 

What machine, exactly, is postgres running on ?
Flags: needinfo?(sdeckelmann)
Sorry, I'm unclear as to the src and dst of the problem here..  Can you provide clarity on what the cron job is trying to do and from where?  And specifically the problem for each of stage and prod (both are mentioned but the output above looks to be stage, and stage/prod are on 2 different zlb clusters right now.. ugh)
(In reply to Ashish Vijayaram [:ashish] from comment #6)
> Dec 2 around 16:00-18:00 *might* have been scheduled maintenance for Zeus.
> CC'ing :dumitru who was performing the maintenance for more info/better
> timelines.

Nope. PHX1 was completed on Nov. 30 ( https://bugzilla.mozilla.org/show_bug.cgi?id=816201#c5 ), so unrelated.
Sorry for the inexact information.  Here we go: 


Cron jobs that are failing on sp-admin01 include: 

socorro    738   732  0 00:05 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh > /var/log/socorro/cron_libraries.log 2>&1
socorro   2756  2747  0 Dec03 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh > /var/log/socorro/cron_libraries.log 2>&1
socorro  12043 12034  0 Dec02 ?        00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_update_adus.sh
socorro  18831 18824  0 Dec03 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1

(multiple instances are hung at this point of each, except for cron_update_adus)

cron_libraries.sh failing is new as of this morning!  I don't know what all it connects to.

cron_submitter.sh connects to Postgres on tp-socorro01-master01, socorro01.stage and socorro01.dev Postgres, port 5432 database breakpad, AND production HBase for socorro.

cron_update_adu.sh connects to tp-socorro01-master01 port 5432

Some of these scripts are ancient and haven't failed since I started at Mozilla (September?).

I am also seeing hung commands on tp-socorro01-master01 from user 'monitor' running "select * from priorityjobs", which I'm imagining is a hung production processor system.
Flags: needinfo?(sdeckelmann)
These ganglia alerts are relevant to this problem: 

09:07 < nagios-phx1> | Tue 09:07:47 PST [183] socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Last Reports Update is CRITICAL: CHECKGANGLIA CRITICAL: 
          last_record_reports is 57804.00 (http://m.allizom.org/Ganglia+-+PostgreSQL+Last+Reports+Update)
10:17 < nagios-phx1> | Tue 10:17:38 PST [102] socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Last Reports Update is CRITICAL: CHECKGANGLIA CRITICAL: 
          last_record_reports is 62005.00 (http://m.allizom.org/Ganglia+-+PostgreSQL+Last+Reports+Update)
11:17 < nagios-phx1> | Tue 11:17:38 PST [116] socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Last Reports Update is CRITICAL: CHECKGANGLIA CRITICAL: 
          last_record_reports is 65605.00 (http://m.allizom.org/Ganglia+-+PostgreSQL+Last+Reports+Update)
Blocks: 818413
Was there any progress made on this bug ?  I know that :solarce was helping out the other day - anything to report ?
Flags: needinfo?
Nagios still alerting: 

00:07 < nagios-phx1> | Fri 00:07:02 PST [194] sp-admin01.phx1.mozilla.com:Socorro Admin - cron_submitter-crash-reports.allizom.org.log is CRITICAL: FILE_AGE CRITICAL: 
          /var/log/socorro/cron_submitter-crash-reports.allizom.org.log is 23949 seconds old and 15615329 bytes 
          (http://m.allizom.org/Socorro+Admin+-+cron_submitter-crash-reports.allizom.org.log)
01:17 < nagios-phx1> | Fri 01:17:01 PST [114] sp-admin01.phx1.mozilla.com:Socorro Admin - cron_submitter-crash-reports.allizom.org.log is CRITICAL: FILE_AGE CRITICAL: 
          /var/log/socorro/cron_submitter-crash-reports.allizom.org.log is 28149 seconds old and 15615329 bytes 
          (http://m.allizom.org/Socorro+Admin+-+cron_submitter-crash-reports.allizom.org.log)


And on sp-admin01, the following crons are hung: 
root      2218     1  0 Dec05 ?        00:00:01 crond
socorro  10707 10702  0 Dec06 ?        00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_update_adus.sh
socorro  11454 11453  0 Dec06 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
socorro  11455 11454  0 Dec06 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
socorro  11456 11455  0 Dec06 ?        00:00:01 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
2103     14229 14189  0 07:53 pts/0    00:00:00 grep cron
socorro  30226 30222  0 00:05 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh > /var/log/socorro/cron_libraries.log 2>&1
socorro  30228 30226  0 00:05 ?        00:00:00 /bin/sh /data/socorro/application/scripts/crons/cron_libraries.sh
Flags: needinfo?
Blocks: 819403
Blocks: 819881
See Also: → 771218
This is also happening: 

-[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
datid            | 16408
datname          | breakpad
pid              | 27848
usesysid         | 40443
usename          | monitor
application_name | 
client_addr      | 10.8.70.209
client_hostname  | 
client_port      | 51091
backend_start    | 2012-12-09 11:17:04.16186+00
xact_start       | 2012-12-10 17:38:45.648979+00
query_start      | 2012-12-10 17:39:45.958193+00
state_change     | 2012-12-10 17:39:45.958315+00
waiting          | f
state            | idle in transaction
query            | select * from priorityjobs;
-[ RECORD 2 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------
Eh nevermind, that comment 15 is normal behavior.
Sorry to bother y'all, but I haven't received the daily crash dumps since 2012-12-14.  They are generated by scripts/crons/cron_daily_reports.sh using scripts/startDailyUrl.py and uploaded to fs1 and a server (sisyphus.bughunter.ateam.phx1.mozilla.com) that I control. Is it a known issue covered by this bug or should I file another bug about the issue?
(In reply to Bob Clary [:bc:] from comment #17)
> Sorry to bother y'all, but I haven't received the daily crash dumps since
> 2012-12-14.  They are generated by scripts/crons/cron_daily_reports.sh using
> scripts/startDailyUrl.py and uploaded to fs1 and a server
> (sisyphus.bughunter.ateam.phx1.mozilla.com) that I control. Is it a known
> issue covered by this bug or should I file another bug about the issue?

working on it over in bug 819881 comment 30
The issues in this bug are largely resolved, except for timeouts to HBase. 

Those issues are being tracked in bug 827387, so closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.