817718 - Ongoing failure of multiple cron jobs on sp-admin01, since Dec 2

Reporter

Description

•

12 years ago

We have a bunch of cron jobs running on sp-admin01 that failed on Sunday afternoon, with log entries like this: 

(accessing hbase:)
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection

And another proc was connecting to the Postgres database and it didn't log any errors, but never returned: 

socorro  12043 12034  0 Dec02 ?        00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_update_adus.sh
socorro  12050 12043  0 Dec02 ?        00:00:00 /usr/bin/python2.6 /data/socorro/application/scripts/startUpdateADUs.py

We found some simple queries running on our prod database (select * from priorityjobs), with source IPs that looked like Zeus network addresses, but don't resolve so I had trouble figuring much else out: 

216.81.8.10.in-addr.arpa. 3600     IN      PTR     pp-zlb09.vlan81.phx.mozilla.com.

Any chance there are logs in Zeus-land that could help explain the mysterious source of these problems?

Selena Deckelmann :selenamarie :selena

Reporter

Comment 1

•

12 years ago

Happened again today at 15:40 PT

In PostgreSQL pg_stat_activity: 

breakpad=# select * from pg_stat_activity where usename = 'monitor'; 
-[ RECORD 1 ]----+--------------------------------
datid            | 16408
datname          | breakpad
pid              | 22786
usesysid         | 40443
usename          | monitor
application_name | 
client_addr      | 10.8.70.202
client_hostname  | pp-zlb01.vlan70.phx.mozilla.com
client_port      | 9413
backend_start    | 2012-12-03 09:44:13.349377-08
xact_start       | 2012-12-03 09:44:13.354588-08
query_start      | 2012-12-03 15:40:35.914917-08
state_change     | 2012-12-03 15:40:35.915023-08
waiting          | f
state            | idle in transaction
query            | select * from priorityjobs;
-[ RECORD 2 ]----+--------------------------------
datid            | 16408
datname          | breakpad
pid              | 22787
usesysid         | 40443
usename          | monitor
application_name | 
client_addr      | 10.8.70.202
client_hostname  | pp-zlb01.vlan70.phx.mozilla.com
client_port      | 9415
backend_start    | 2012-12-03 09:44:13.351829-08
xact_start       | 
query_start      | 2012-12-03 15:39:35.852209-08
state_change     | 2012-12-03 15:39:35.852256-08
waiting          | f
state            | idle
query            | COMMIT
-[ RECORD 3 ]----+--------------------------------
datid            | 16408
datname          | breakpad
pid              | 22788
usesysid         | 40443
usename          | monitor
application_name | 
client_addr      | 10.8.70.202
client_hostname  | pp-zlb01.vlan70.phx.mozilla.com
client_port      | 9418
backend_start    | 2012-12-03 09:44:13.355104-08
xact_start       | 
query_start      | 2012-12-03 15:40:51.439235-08
state_change     | 2012-12-03 15:40:51.439276-08
waiting          | f
state            | idle
query            | COMMIT

Selena Deckelmann :selenamarie :selena

Reporter

Comment 2

•

12 years ago

And the cron_submitter jobs are stacking up: 
[sdeckelmann@sp-admin01.phx1 socorro]$ ls -tlr /var/tmp/*.lock
-rw-r--r-- 1 socorro socorro 0 Dec  3 15:45 /var/tmp/crash-reports.allizom.org.lock
[sdeckelmann@sp-admin01.phx1 socorro]$ ps -ef | grep crash
socorro   4792  4788  0 14:00 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
socorro   4795  4792  0 14:00 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
socorro   4799  4795  0 14:00 ?        00:00:01 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
2103     32230 12082  0 15:45 pts/1    00:00:00 grep crash
[sdeckelmann@sp-admin01.phx1 socorro]$ 

And we recorded some network timeouts: 
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: handled exception, timed out

Last time that file was written to was: 
-rwxr-xr-x 1 socorro socorro 6340840 Dec  3 14:14 /var/log/socorro/cron_submitter-crash-reports.allizom.org.log

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

12 years ago

Summary: Failure of multiple database-accessing cron jobs on Dec 2 around 16:00-18:00 → Ongoing failure of multiple cron jobs on sp-admin01, since Dec 2

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

12 years ago

Severity: normal → major

Laura Thomson :laura

Comment 3

•

12 years ago

You probably want webops for Zeus issues.

Assignee: network-operations → server-ops-webops

Component: Server Operations: Netops → Server Operations: Web Operations

QA Contact: ravi → nmaul

Laura Thomson :laura

Updated

•

12 years ago

Blocks: 817458

Shyam Mani [:fox2mike]

Comment 4

•

12 years ago

Going to grab this till I can tech a webop

Assignee: server-ops-webops → shyam

Shyam Mani [:fox2mike]

Comment 5

•

12 years ago

Going to grab this till I can reach a webop

Ashish Vijayaram [:ashish]

Comment 6

•

12 years ago

Dec 2 around 16:00-18:00 *might* have been scheduled maintenance for Zeus. CC'ing :dumitru who was performing the maintenance for more info/better timelines.

Daniel Maher [:phrawzty]

Assignee

Comment 7

•

12 years ago

(In reply to Shyam Mani [:fox2mike] from comment #5)
> Going to grab this till I can reach a webop

Don't worry sir, I'm from the Webops.

Assignee: shyam → dmaher

Severity: major → normal

Status: NEW → ASSIGNED

Priority: -- → P3

Daniel Maher [:phrawzty]

Assignee

Comment 8

•

12 years ago

Hello,

(In reply to Selena Deckelmann :selena from comment #1)
> Happened again today at 15:40 PT
> 
> In PostgreSQL pg_stat_activity: 
> 
> breakpad=# select * from pg_stat_activity where usename = 'monitor'; 

What machine, exactly, is postgres running on ?

Flags: needinfo?(sdeckelmann)

Corey Shields [:cshields]

Comment 9

•

12 years ago

Sorry, I'm unclear as to the src and dst of the problem here..  Can you provide clarity on what the cron job is trying to do and from where?  And specifically the problem for each of stage and prod (both are mentioned but the output above looks to be stage, and stage/prod are on 2 different zlb clusters right now.. ugh)

Dumitru Gherman [:dumitru]

Comment 10

•

12 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #6)
> Dec 2 around 16:00-18:00 *might* have been scheduled maintenance for Zeus.
> CC'ing :dumitru who was performing the maintenance for more info/better
> timelines.

Nope. PHX1 was completed on Nov. 30 ( https://bugzilla.mozilla.org/show_bug.cgi?id=816201#c5 ), so unrelated.

Selena Deckelmann :selenamarie :selena

Reporter

Comment 11

•

12 years ago

Sorry for the inexact information.  Here we go: 


Cron jobs that are failing on sp-admin01 include: 

socorro    738   732  0 00:05 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh > /var/log/socorro/cron_libraries.log 2>&1
socorro   2756  2747  0 Dec03 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh > /var/log/socorro/cron_libraries.log 2>&1
socorro  12043 12034  0 Dec02 ?        00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_update_adus.sh
socorro  18831 18824  0 Dec03 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1

(multiple instances are hung at this point of each, except for cron_update_adus)

cron_libraries.sh failing is new as of this morning!  I don't know what all it connects to.

cron_submitter.sh connects to Postgres on tp-socorro01-master01, socorro01.stage and socorro01.dev Postgres, port 5432 database breakpad, AND production HBase for socorro.

cron_update_adu.sh connects to tp-socorro01-master01 port 5432

Some of these scripts are ancient and haven't failed since I started at Mozilla (September?).

I am also seeing hung commands on tp-socorro01-master01 from user 'monitor' running "select * from priorityjobs", which I'm imagining is a hung production processor system.

Flags: needinfo?(sdeckelmann)

Selena Deckelmann :selenamarie :selena

Reporter

Comment 12

•

12 years ago

These ganglia alerts are relevant to this problem: 

09:07 < nagios-phx1> | Tue 09:07:47 PST [183] socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Last Reports Update is CRITICAL: CHECKGANGLIA CRITICAL: 
          last_record_reports is 57804.00 (http://m.allizom.org/Ganglia+-+PostgreSQL+Last+Reports+Update)
10:17 < nagios-phx1> | Tue 10:17:38 PST [102] socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Last Reports Update is CRITICAL: CHECKGANGLIA CRITICAL: 
          last_record_reports is 62005.00 (http://m.allizom.org/Ganglia+-+PostgreSQL+Last+Reports+Update)
11:17 < nagios-phx1> | Tue 11:17:38 PST [116] socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Last Reports Update is CRITICAL: CHECKGANGLIA CRITICAL: 
          last_record_reports is 65605.00 (http://m.allizom.org/Ganglia+-+PostgreSQL+Last+Reports+Update)

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

12 years ago

Blocks: 818413

Daniel Maher [:phrawzty]

Assignee

Comment 13

•

12 years ago

Was there any progress made on this bug ?  I know that :solarce was helping out the other day - anything to report ?

Flags: needinfo?

Selena Deckelmann :selenamarie :selena

Reporter

Comment 14

•

12 years ago

Nagios still alerting: 

00:07 < nagios-phx1> | Fri 00:07:02 PST [194] sp-admin01.phx1.mozilla.com:Socorro Admin - cron_submitter-crash-reports.allizom.org.log is CRITICAL: FILE_AGE CRITICAL: 
          /var/log/socorro/cron_submitter-crash-reports.allizom.org.log is 23949 seconds old and 15615329 bytes 
          (http://m.allizom.org/Socorro+Admin+-+cron_submitter-crash-reports.allizom.org.log)
01:17 < nagios-phx1> | Fri 01:17:01 PST [114] sp-admin01.phx1.mozilla.com:Socorro Admin - cron_submitter-crash-reports.allizom.org.log is CRITICAL: FILE_AGE CRITICAL: 
          /var/log/socorro/cron_submitter-crash-reports.allizom.org.log is 28149 seconds old and 15615329 bytes 
          (http://m.allizom.org/Socorro+Admin+-+cron_submitter-crash-reports.allizom.org.log)


And on sp-admin01, the following crons are hung: 
root      2218     1  0 Dec05 ?        00:00:01 crond
socorro  10707 10702  0 Dec06 ?        00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_update_adus.sh
socorro  11454 11453  0 Dec06 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
socorro  11455 11454  0 Dec06 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
socorro  11456 11455  0 Dec06 ?        00:00:01 /bin/sh -c /data/socorro/application/scripts/crons/cron_submitter.sh crash-reports.allizom.org 1000 > /dev/null 2>&1
2103     14229 14189  0 07:53 pts/0    00:00:00 grep cron
socorro  30226 30222  0 00:05 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh > /var/log/socorro/cron_libraries.log 2>&1
socorro  30228 30226  0 00:05 ?        00:00:00 /bin/sh /data/socorro/application/scripts/crons/cron_libraries.sh

Flags: needinfo?

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

12 years ago

Blocks: 819403

Laura Thomson :laura

Updated

•

12 years ago

Blocks: 819881

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

12 years ago

Comment 15

•

12 years ago

This is also happening: 

-[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
datid            | 16408
datname          | breakpad
pid              | 27848
usesysid         | 40443
usename          | monitor
application_name | 
client_addr      | 10.8.70.209
client_hostname  | 
client_port      | 51091
backend_start    | 2012-12-09 11:17:04.16186+00
xact_start       | 2012-12-10 17:38:45.648979+00
query_start      | 2012-12-10 17:39:45.958193+00
state_change     | 2012-12-10 17:39:45.958315+00
waiting          | f
state            | idle in transaction
query            | select * from priorityjobs;
-[ RECORD 2 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------

Selena Deckelmann :selenamarie :selena

Reporter

Comment 16

•

12 years ago

Eh nevermind, that comment 15 is normal behavior.

Bob Clary [:bc] (inactive)

Comment 17

•

12 years ago

Sorry to bother y'all, but I haven't received the daily crash dumps since 2012-12-14.  They are generated by scripts/crons/cron_daily_reports.sh using scripts/startDailyUrl.py and uploaded to fs1 and a server (sisyphus.bughunter.ateam.phx1.mozilla.com) that I control. Is it a known issue covered by this bug or should I file another bug about the issue?

Robert Helmer [:rhelmer]

Comment 18

•

12 years ago

(In reply to Bob Clary [:bc:] from comment #17)
> Sorry to bother y'all, but I haven't received the daily crash dumps since
> 2012-12-14.  They are generated by scripts/crons/cron_daily_reports.sh using
> scripts/startDailyUrl.py and uploaded to fs1 and a server
> (sisyphus.bughunter.ateam.phx1.mozilla.com) that I control. Is it a known
> issue covered by this bug or should I file another bug about the issue?

working on it over in bug 819881 comment 30

Selena Deckelmann :selenamarie :selena

Reporter

Comment 19

•

11 years ago

The issues in this bug are largely resolved, except for timeouts to HBase. 

Those issues are being tracked in bug 827387, so closing this bug.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard