Closed Bug 819881 Opened 7 years ago Closed 7 years ago

Socorro - No crash correlations since Dec 3 except on Dec 6 - needs manual cron re-run

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scoobidiver, Assigned: dmaher)

References

(Blocks 1 open bug)

Details

(Whiteboard: [triaged 20121210][push interrupt])

Crash correlations have been no longer computed since December 3 except for December 6.

See https://crash-analysis.mozilla.com/crash_analysis/
Severity: normal → critical
This doesn't depend exactly on bug 817718, but that is the root cause of this problem.
Depends on: 817718
Need to manually re-run these.

For the dates 3, 4, 5, 7, 8, 9 of December, we need to run the script in scripts/crons/cron_libraries.sh

The catch is that it does not accept parameters, but begins by calculating the date:
WEEK=`date -d 'last monday' '+%Y%m%d'`
DATE=`date '+%Y%m%d'`

You will need to manually change these to the appropriate Monday (3/12) and the appropriate date (from the range above) for each run.  Sorry for the inconvenience.
Assignee: nobody → server-ops-webops
Component: General → Server Operations: Web Operations
Product: Socorro → mozilla.org
QA Contact: nmaul
Version: unspecified → other
Summary: No crash correlations since Dec 3 except on Dec 6 → Socorro - No crash correlations since Dec 3 except on Dec 6 - needs manual cron re-run
Assignee: server-ops-webops → eziegenhorn
Assignee: eziegenhorn → dmaher
Severity: critical → normal
Status: NEW → ASSIGNED
Priority: -- → P2
Whiteboard: [triaged 20121210][push interrupt]
I have so far been unsuccessful in actually getting this script to run.  It enters the first loop (Firefox 17.0.1) and executes the psql command.  This creates "/tmp/Firefox_17.0.1.log", which reveals this :

DEBUG Ooid: "0700f620-d0c5-4fee-bd95-c8df62121209"
DEBUG Ooid: "211d0aff-abe2-4da5-bb7a-ccc9e2121209"
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: handled exception, timed out

That is where it stops, hung, forever.
The psql command you're referring to is piping its output to thrift (hbase) on port 10.8.81.209:9090. The "retry_wrapper" error is really coming from the hbaseclient library, not postgres.
Today, the first phase of the cron_libraries.sh script will run, but it hangs indefinitely on the second run.  The key difference between yesterday and today is that the script does NOT hang on the psql command (which is good) - it hangs on this :

$PYTHON /data/crash-data-tools/per-crash-core-count.py -p ${I} -r ${J} -f /tmp/${I}_${J}.tar > /tmp/${DATE}_${I}_${J}-core-counts.txt

For Firefox version 18.0 .  It's worth noting that for this version there are no OOIDs in the psql-generated log, so I suspect that per-crash-core-count.py can't handle having "nothing" to do.

Still working on it w/ :laura.
Nevermind, the psql step is hanging forever after all.

DEBUG Ooid: "c387c476-52a7-42f6-9daa-8b3752121210"
DEBUG Ooid: "ec77f931-0b1d-445e-bcc7-ed4e22121210"
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: handled exception, timed out
Troubleshooting notes regarding the thrift side of things..

Started out by taking a look at logs on the thrift cluster.  Discovered there were several OOMEs recorded on all the nodes.  It looks like they might have gotten backlogged at some point and then never fully recovered (possibly due to a logjam).

We bounced the thrift process on all the nodes and things quickly recovered and looked clean for normal crashmover/monitor/processors.  We then kicked off the script in question here and it seems to be running fine now.
Daniel restarted thrift and the cron is now running apparently successfully.  Let's see if that fully solves the problem.  Check back in 24 hours.
cron_libraries.sh just finished and looks to have completed without error. the following is the stdout from the run:

[root@sp-admin01.phx1 socorro]# bash cron_libraries_819881.sh | tee /tmp/cron_libraries-`date`.log
[2012-12-11 19:08:39] Phase 1 start: Firefox
[2012-12-11 19:08:40] Phase 1.1 start: 17.0.1
[2012-12-11 19:08:40] ++ psql; generating /tmp/Firefox_17.0.1.log
[2012-12-11 19:43:03] per-crash-core-count.py; generating /tmp/20121211_Firefox_17.0.1-core-counts.txt
[2012-12-11 19:49:32] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_17.0.1-interesting-modules.txt
[2012-12-11 19:58:09] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_17.0.1-interesting-modules-with-versions.txt
[2012-12-11 20:07:56] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_17.0.1-interesting-addons.txt
[2012-12-11 20:13:12] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_17.0.1-interesting-addons-with-versions.txt
[2012-12-11 20:18:08] Phase 1.1 end: 17.0.1
[2012-12-11 20:18:08] Phase 1.1 start: 18.0
[2012-12-11 20:18:08] ++ psql; generating /tmp/Firefox_18.0.log
[2012-12-11 20:30:17] per-crash-core-count.py; generating /tmp/20121211_Firefox_18.0-core-counts.txt
[2012-12-11 20:32:08] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_18.0-interesting-modules.txt
[2012-12-11 20:34:41] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_18.0-interesting-modules-with-versions.txt
[2012-12-11 20:37:32] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_18.0-interesting-addons.txt
[2012-12-11 20:39:13] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_18.0-interesting-addons-with-versions.txt
[2012-12-11 20:40:53] Phase 1.1 end: 18.0
[2012-12-11 20:40:53] Phase 1.1 start: 16.0.2
[2012-12-11 20:40:53] ++ psql; generating /tmp/Firefox_16.0.2.log
[2012-12-11 20:49:05] per-crash-core-count.py; generating /tmp/20121211_Firefox_16.0.2-core-counts.txt
[2012-12-11 20:50:18] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_16.0.2-interesting-modules.txt
[2012-12-11 20:51:59] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_16.0.2-interesting-modules-with-versions.txt
[2012-12-11 20:53:49] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_16.0.2-interesting-addons.txt
[2012-12-11 20:54:53] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_16.0.2-interesting-addons-with-versions.txt
[2012-12-11 20:55:56] Phase 1.1 end: 16.0.2
[2012-12-11 20:55:56] Phase 1 end: Firefox
[2012-12-11 20:55:56] Phase 1 start: Thunderbird
[2012-12-11 20:55:56] Phase 1.1 start: 17.0
[2012-12-11 20:55:56] ++ psql; generating /tmp/Thunderbird_17.0.log
[2012-12-11 20:59:45] per-crash-core-count.py; generating /tmp/20121211_Thunderbird_17.0-core-counts.txt
[2012-12-11 21:00:25] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_17.0-interesting-modules.txt
[2012-12-11 21:01:21] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_17.0-interesting-modules-with-versions.txt
[2012-12-11 21:02:20] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_17.0-interesting-addons.txt
[2012-12-11 21:02:55] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_17.0-interesting-addons-with-versions.txt
[2012-12-11 21:03:30] Phase 1.1 end: 17.0
[2012-12-11 21:03:30] Phase 1.1 start: 16.0.2
[2012-12-11 21:03:30] ++ psql; generating /tmp/Thunderbird_16.0.2.log
[2012-12-11 21:06:29] per-crash-core-count.py; generating /tmp/20121211_Thunderbird_16.0.2-core-counts.txt
[2012-12-11 21:07:03] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_16.0.2-interesting-modules.txt
[2012-12-11 21:07:46] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_16.0.2-interesting-modules-with-versions.txt
[2012-12-11 21:08:29] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_16.0.2-interesting-addons.txt
[2012-12-11 21:08:59] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_16.0.2-interesting-addons-with-versions.txt
[2012-12-11 21:09:29] Phase 1.1 end: 16.0.2
[2012-12-11 21:09:29] Phase 1.1 start: 15.0.1
[2012-12-11 21:09:29] ++ psql; generating /tmp/Thunderbird_15.0.1.log
[2012-12-11 21:09:45] per-crash-core-count.py; generating /tmp/20121211_Thunderbird_15.0.1-core-counts.txt
[2012-12-11 21:09:47] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_15.0.1-interesting-modules.txt
[2012-12-11 21:09:50] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_15.0.1-interesting-modules-with-versions.txt
[2012-12-11 21:09:53] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_15.0.1-interesting-addons.txt
[2012-12-11 21:09:55] per-crash-interesting-modules.py; generating /tmp/20121211_Thunderbird_15.0.1-interesting-addons-with-versions.txt
[2012-12-11 21:09:56] Phase 1.1 end: 15.0.1
[2012-12-11 21:09:56] Phase 1 end: Thunderbird
[2012-12-11 21:09:56] Phase 1 start: SeaMonkey
[2012-12-11 21:09:58] Phase 1.1 start: 2.14.1
[2012-12-11 21:09:58] ++ psql; generating /tmp/SeaMonkey_2.14.1.log
[2012-12-11 21:10:18] per-crash-core-count.py; generating /tmp/20121211_SeaMonkey_2.14.1-core-counts.txt
[2012-12-11 21:10:20] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.14.1-interesting-modules.txt
[2012-12-11 21:10:24] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.14.1-interesting-modules-with-versions.txt
[2012-12-11 21:10:28] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.14.1-interesting-addons.txt
[2012-12-11 21:10:31] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.14.1-interesting-addons-with-versions.txt
[2012-12-11 21:10:33] Phase 1.1 end: 2.14.1
[2012-12-11 21:10:33] Phase 1.1 start: 2.13.2
[2012-12-11 21:10:33] ++ psql; generating /tmp/SeaMonkey_2.13.2.log
[2012-12-11 21:10:37] per-crash-core-count.py; generating /tmp/20121211_SeaMonkey_2.13.2-core-counts.txt
[2012-12-11 21:10:38] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.13.2-interesting-modules.txt
[2012-12-11 21:10:39] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.13.2-interesting-modules-with-versions.txt
[2012-12-11 21:10:40] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.13.2-interesting-addons.txt
[2012-12-11 21:10:41] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.13.2-interesting-addons-with-versions.txt
[2012-12-11 21:10:42] Phase 1.1 end: 2.13.2
[2012-12-11 21:10:42] Phase 1.1 start: 2.0.14
[2012-12-11 21:10:42] ++ psql; generating /tmp/SeaMonkey_2.0.14.log
[2012-12-11 21:10:43] per-crash-core-count.py; generating /tmp/20121211_SeaMonkey_2.0.14-core-counts.txt
[2012-12-11 21:10:43] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.0.14-interesting-modules.txt
[2012-12-11 21:10:44] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.0.14-interesting-modules-with-versions.txt
[2012-12-11 21:10:45] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.0.14-interesting-addons.txt
[2012-12-11 21:10:46] per-crash-interesting-modules.py; generating /tmp/20121211_SeaMonkey_2.0.14-interesting-addons-with-versions.txt
[2012-12-11 21:10:47] Phase 1.1 end: 2.0.14
[2012-12-11 21:10:47] Phase 1 end: SeaMonkey
[2012-12-11 21:10:47] Phase 1 start: Camino
[2012-12-11 21:10:47] Phase 1.1 start: 2.1.2
[2012-12-11 21:10:47] ++ psql; generating /tmp/Camino_2.1.2.log
[2012-12-11 21:10:50] per-crash-core-count.py; generating /tmp/20121211_Camino_2.1.2-core-counts.txt
[2012-12-11 21:10:50] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.1.2-interesting-modules.txt
[2012-12-11 21:10:51] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.1.2-interesting-modules-with-versions.txt
[2012-12-11 21:10:52] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.1.2-interesting-addons.txt
[2012-12-11 21:10:53] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.1.2-interesting-addons-with-versions.txt
[2012-12-11 21:10:54] Phase 1.1 end: 2.1.2
[2012-12-11 21:10:54] Phase 1.1 start: 2.0.4
[2012-12-11 21:10:54] ++ psql; generating /tmp/Camino_2.0.4.log
[2012-12-11 21:10:54] per-crash-core-count.py; generating /tmp/20121211_Camino_2.0.4-core-counts.txt
[2012-12-11 21:10:54] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.0.4-interesting-modules.txt
[2012-12-11 21:10:55] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.0.4-interesting-modules-with-versions.txt
[2012-12-11 21:10:55] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.0.4-interesting-addons.txt
[2012-12-11 21:10:56] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.0.4-interesting-addons-with-versions.txt
[2012-12-11 21:10:56] Phase 1.1 end: 2.0.4
[2012-12-11 21:10:56] Phase 1.1 start: 2.0.7
[2012-12-11 21:10:56] ++ psql; generating /tmp/Camino_2.0.7.log
[2012-12-11 21:10:57] per-crash-core-count.py; generating /tmp/20121211_Camino_2.0.7-core-counts.txt
[2012-12-11 21:10:57] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.0.7-interesting-modules.txt
[2012-12-11 21:10:57] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.0.7-interesting-modules-with-versions.txt
[2012-12-11 21:10:58] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.0.7-interesting-addons.txt
[2012-12-11 21:10:58] per-crash-interesting-modules.py; generating /tmp/20121211_Camino_2.0.7-interesting-addons-with-versions.txt
[2012-12-11 21:10:59] Phase 1.1 end: 2.0.7
[2012-12-11 21:10:59] Phase 1 end: Camino
[2012-12-11 21:10:59] Phase 2 start: Firefox
[2012-12-11 21:10:59] Phase 2.1 start: 18.0
[2012-12-11 21:10:59] ++ psql; generating /tmp/Firefox_18.0.log
[2012-12-11 21:19:25] per-crash-core-count.py; generating /tmp/20121211_Firefox_18.0-core-counts.txt
[2012-12-11 21:21:19] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_18.0-interesting-modules.txt
[2012-12-11 21:23:53] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_18.0-interesting-modules-with-versions.txt
[2012-12-11 21:26:46] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_18.0-interesting-addons.txt
[2012-12-11 21:28:26] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_18.0-interesting-addons-with-versions.txt
[2012-12-11 21:30:05] Phase 2.1 end: 18.0
[2012-12-11 21:30:05] Phase 2.1 start: 19.0a2
[2012-12-11 21:30:05] ++ psql; generating /tmp/Firefox_19.0a2.log
[2012-12-11 21:30:50] per-crash-core-count.py; generating /tmp/20121211_Firefox_19.0a2-core-counts.txt
[2012-12-11 21:30:57] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_19.0a2-interesting-modules.txt
[2012-12-11 21:31:06] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_19.0a2-interesting-modules-with-versions.txt
[2012-12-11 21:31:16] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_19.0a2-interesting-addons.txt
[2012-12-11 21:31:22] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_19.0a2-interesting-addons-with-versions.txt
[2012-12-11 21:31:28] Phase 2.1 end: 19.0a2
[2012-12-11 21:31:28] Phase 2.1 start: 20.0a1
[2012-12-11 21:31:28] ++ psql; generating /tmp/Firefox_20.0a1.log
[2012-12-11 21:32:28] per-crash-core-count.py; generating /tmp/20121211_Firefox_20.0a1-core-counts.txt
[2012-12-11 21:32:36] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_20.0a1-interesting-modules.txt
[2012-12-11 21:32:47] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_20.0a1-interesting-modules-with-versions.txt
[2012-12-11 21:32:58] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_20.0a1-interesting-addons.txt
[2012-12-11 21:33:05] per-crash-interesting-modules.py; generating /tmp/20121211_Firefox_20.0a1-interesting-addons-with-versions.txt
[2012-12-11 21:33:12] Phase 2.1 end: 20.0a1
[2012-12-11 21:33:12] Phase 2 end: Firefox
[2012-12-11 21:33:12] find /tmp -name 20121211\* -type f -size +500k | xargs gzip -9
[2012-12-11 21:33:58] mkdir /mnt/crashanalysis/crash_analysis/20121211
[2012-12-11 21:33:58] cp /tmp/20121211* /mnt/crashanalysis/crash_analysis/20121211/
[2012-12-11 21:33:59] rm -f /tmp/20121211*
Each 20121211_Firefox_<n>* file (n=16.0.2, 17.0.1, 18) in https://crash-analysis.mozilla.com/crash_analysis/20121211/ send a 403 Forbidden error.
Failed again today, so we'll need to re-run cron_libraries.sh (and fix the permissions if needed).

cron_submitter is also failing.  I also have cronmail from cron_daily_matviews and cron_bugzilla and cron_daily_adus, so we should see if those have recovered and if not re-run them by hand.

Another thrift restart might be in order.  (deinspanjer?)
https://crash-analysis.mozilla.com/crash_analysis/ is missing the newest rounds of CSV (for yesterday) and correlations (for today) - there are multiple directories missing completely there, btw.
Laura, was the failure still Thrift timeouts?

tmary, could you check whether we can determine any possible underlying cause in the Thrift layer?

We might need to step up the plans to try to get some better debugging into the Python Thrift client layer to sort this out.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #12)
> https://crash-analysis.mozilla.com/crash_analysis/ is missing the newest
> rounds of CSV (for yesterday) and correlations (for today) - there are
> multiple directories missing completely there, btw.

Is there a separate bug for the daily CSV dump?

The status on both correlations and CSV is:

* working on backfilling now
* watching for any network or thrift problems (still trying to isolate the problem)
* adding debugging and backfill support to these jobs

Right now these need to be done by hand and it's somewhat laborious.
(In reply to Robert Helmer [:rhelmer] from comment #14)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #12)
> > https://crash-analysis.mozilla.com/crash_analysis/ is missing the newest
> > rounds of CSV (for yesterday) and correlations (for today) - there are
> > multiple directories missing completely there, btw.
> 
> Is there a separate bug for the daily CSV dump?
> 
> The status on both correlations and CSV is:

CSVs have been backfilled and pushed out for missing days. Correlations should be done tomorrow.

Hopefully everything will run as expected overnight for today's report.
(In reply to Robert Helmer [:rhelmer] from comment #14)
> Is there a separate bug for the daily CSV dump?

I was was told to just mention the CSVs here. We can have a separate bug if you like, though.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #16)
> (In reply to Robert Helmer [:rhelmer] from comment #14)
> > Is there a separate bug for the daily CSV dump?
> 
> I was was told to just mention the CSVs here. We can have a separate bug if
> you like, though.

Nope fine w/ me, I will use this bug to report status :)
(In reply to Robert Helmer [:rhelmer] from comment #15)
> CSVs have been backfilled and pushed out for missing days.
There are still missing on December 3 and 4.
This morning's sitrep:
* https://crash-analysis.mozilla.com/crash_analysis/ is missing reports for 1213

Errors from cronmail overnight:
* error (lock): lock already exists for cron_update_adus pid 23074
* Cron job startReportsClean exited non-zero: 1
*  /bin/sh: /data/bin/cron_daily_reports.sh: No such file or directory
* Cron job startDailyMatviews exited non-zero: 9

phrawtzy: can we see which of these jobs later ran successfully, and re-run anything that didn't by hand?  I already know we need to do the following:
* run cron_libraries for 12/3, 12/4 and 12/13.  (Wonder why this one didn't send cron mail)
(In reply to Laura Thomson :laura from comment #19)
> This morning's sitrep:
> * https://crash-analysis.mozilla.com/crash_analysis/ is missing reports for
> 1213
> 
> Errors from cronmail overnight:
> * error (lock): lock already exists for cron_update_adus pid 23074
> * Cron job startReportsClean exited non-zero: 1
> *  /bin/sh: /data/bin/cron_daily_reports.sh: No such file or directory

I don't see this for production (but I do for stage). Running this by hand I don't get any errors, but no output either. Debugging it now.
I figured out the problem(s) with cron_reports.sh (which generates the daily CSV report), and why they have anything to do with the totally separate cron_libraries.sh (correlation reports):

1) cron_reports.sh does not create it's own output directory on crash_analysis, if cron_libraries fails then it can't write
2) cron_reports.sh logs in such a way that we don't see problems in the main log, cron mail, or even running w/ bash -x

Going to fix both of these and get it checked into the IT repo. We'll pull this into the main Socorro repo as a separate bug.
(In reply to Robert Helmer [:rhelmer] from comment #21)
> I figured out the problem(s) with cron_reports.sh (which generates the daily
> CSV report), and why they have anything to do with the totally separate
> cron_libraries.sh (correlation reports):
> 
> 1) cron_reports.sh does not create it's own output directory on
> crash_analysis, if cron_libraries fails then it can't write
> 2) cron_reports.sh logs in such a way that we don't see problems in the main
> log, cron mail, or even running w/ bash -x
> 
> Going to fix both of these and get it checked into the IT repo. We'll pull
> this into the main Socorro repo as a separate bug.

I've also added a trivial backfill feature to this script, testing it now for rerunning 12/03 and 12/04.
(In reply to Robert Helmer [:rhelmer] from comment #22)
> (In reply to Robert Helmer [:rhelmer] from comment #21)
> > I figured out the problem(s) with cron_reports.sh (which generates the daily
> > CSV report), and why they have anything to do with the totally separate
> > cron_libraries.sh (correlation reports):
> > 
> > 1) cron_reports.sh does not create it's own output directory on
> > crash_analysis, if cron_libraries fails then it can't write
> > 2) cron_reports.sh logs in such a way that we don't see problems in the main
> > log, cron mail, or even running w/ bash -x
> > 
> > Going to fix both of these and get it checked into the IT repo. We'll pull
> > this into the main Socorro repo as a separate bug.
> 
> I've also added a trivial backfill feature to this script, testing it now
> for rerunning 12/03 and 12/04.

OK daily CSV done for 12/03, 12/04 and 12/13.
I don't know how many of these are related to bug 822106 but here's what I see this morning:
* cron_status is giving errors but appears to be running (looking at https://crash-stats.mozilla.com/status).  Might be taking longer than 5 minutes to run, but that would seem odd.  Error is:
error (lock): lock already exists for cron_status pid 31847
* cron_bugzilla is reporting lock errors, last time at 11.11am ET, message:
error (lock): lock already exists for cron_bugzilla pid 28682
* cron_ftp_scraper is reporting lock errors, last time at 11.05am ET, message:
error (lock): lock already exists for cron_ftpscraper pid 25538
* cron_duplicates at 11.02 ET:
error (lock): lock already exists for cron_duplicates pid 23685
* cron_reportsclean at 10.45 ET:
error (lock): lock already exists for cron_reportsclean pid 13258
* cron_daily reports at around 6.55 ET:
/bin/sh: /data/bin/cron_daily_reports.sh: No such file or directory
and on /data/bin/cron_daily_reports.sh >> /var/log/socorro/cron_daily_reports.log
/bin/sh: /data/bin/cron_daily_reports.sh: Permission denied
* cron_daily_matviews at 5.00 ET
error (lock): lock already exists for cron_daily_matviews pid 13259

On the bright side, cron_libraries.sh appears to have run, because the output is up on crash-analysis.  It did report this error in logging though, at 3.50am:
Cron <socorro@sp-admin01> /data/socorro/application/scripts/crons/cron_libraries.sh >> /var/log/socorro/cron_libraries.log
find: `/tmp/ssh-IvjMpz9481': Permission denied
find: `/tmp/socorro-install-29324-pdH': Permission denied
find: `/tmp/ssh-MBSAu10078': Permission denied
find: `/tmp/ssh-eQOLEe6397': Permission denied
find: `/tmp/ssh-IbkxaQ4548': Permission denied
find: `/tmp/socorro-install-30477-r3S': Permission denied
find: `/tmp/atop.d': Permission denied
find: `/tmp/hsperfdata_infrasec': Permission denied
find: `/tmp/socorro-install-30431-38m': Permission denied
find: `/tmp/socorro-install-12739-bRN': Permission denied
find: `/tmp/ssh-VPCZt11503': Permission denied
find: `/tmp/socorro-install-13837-vuJ': Permission denied
find: `/tmp/ssh-VEGSEm9925': Permission denied
(In reply to Laura Thomson :laura from comment #24)
> On the bright side, cron_libraries.sh appears to have run, because the
> output is up on crash-analysis.

Of course, that output is also missing entries because of bug 822102 and probably should be recreated once (re)processing has caught up.
(In reply to Laura Thomson :laura from comment #24)
> I don't know how many of these are related to bug 822106 but here's what I
> see this morning:

I'll take a look at the various locking complaints. Things could be running more slowly than usual for some reason.

> /bin/sh: /data/bin/cron_daily_reports.sh: No such file or directory
> and on /data/bin/cron_daily_reports.sh >>
> /var/log/socorro/cron_daily_reports.log
> /bin/sh: /data/bin/cron_daily_reports.sh: Permission denied


/data/bin/cron_daily_reports.sh is not executable and should be.


> On the bright side, cron_libraries.sh appears to have run, because the
> output is up on crash-analysis.  It did report this error in logging though,
> at 3.50am:
> Cron <socorro@sp-admin01>
> /data/socorro/application/scripts/crons/cron_libraries.sh >>
> /var/log/socorro/cron_libraries.log
> find: `/tmp/ssh-IvjMpz9481': Permission denied
> find: `/tmp/socorro-install-29324-pdH': Permission denied
> find: `/tmp/ssh-MBSAu10078': Permission denied
> find: `/tmp/ssh-eQOLEe6397': Permission denied
> find: `/tmp/ssh-IbkxaQ4548': Permission denied
> find: `/tmp/socorro-install-30477-r3S': Permission denied
> find: `/tmp/atop.d': Permission denied
> find: `/tmp/hsperfdata_infrasec': Permission denied
> find: `/tmp/socorro-install-30431-38m': Permission denied
> find: `/tmp/socorro-install-12739-bRN': Permission denied
> find: `/tmp/ssh-VPCZt11503': Permission denied
> find: `/tmp/socorro-install-13837-vuJ': Permission denied
> find: `/tmp/ssh-VEGSEm9925': Permission denied

This is nothing new unfortunately, it does not do a very good job of cleaning up after itself.
(In reply to Robert Helmer [:rhelmer] from comment #26)
> > /bin/sh: /data/bin/cron_daily_reports.sh: No such file or directory
> > and on /data/bin/cron_daily_reports.sh >>
> > /var/log/socorro/cron_daily_reports.log
> > /bin/sh: /data/bin/cron_daily_reports.sh: Permission denied
> 
> 
> /data/bin/cron_daily_reports.sh is not executable and should be.

Completed (on sp-admin01.phx1).
Hi, I haven't wanted to pollute this bug but I haven't gotten any daily crash dumps (%Y%m%d-crashdata.csv.gz) since 2012-12-13 on either fs1.corpdmz.scl3.mozilla.com in /data/security_group/crash_urls/ or on sisyphus.bughunter.ateam.phx1.mozilla.com in /work/mozilla/crash-reports/. I assume this is due to bug 817718.

rhelmer: These appear to be generated in scripts/crons/cron_daily_reports.sh using scripts/startDailyUrl.py. Would it be possible to backfill the daily dumps before everyone bails for Christmas?
(In reply to Jake Maul [:jakem] from comment #27)
> (In reply to Robert Helmer [:rhelmer] from comment #26)
> > > /bin/sh: /data/bin/cron_daily_reports.sh: No such file or directory
> > > and on /data/bin/cron_daily_reports.sh >>
> > > /var/log/socorro/cron_daily_reports.log
> > > /bin/sh: /data/bin/cron_daily_reports.sh: Permission denied
> > 
> > 
> > /data/bin/cron_daily_reports.sh is not executable and should be.
> 
> Completed (on sp-admin01.phx1).

Looks like this is still not executable :(
(In reply to Bob Clary [:bc:] from comment #28)
> Hi, I haven't wanted to pollute this bug but I haven't gotten any daily
> crash dumps (%Y%m%d-crashdata.csv.gz) since 2012-12-13 on either
> fs1.corpdmz.scl3.mozilla.com in /data/security_group/crash_urls/ or on
> sisyphus.bughunter.ateam.phx1.mozilla.com in /work/mozilla/crash-reports/. I
> assume this is due to bug 817718.
> 
> rhelmer: These appear to be generated in scripts/crons/cron_daily_reports.sh
> using scripts/startDailyUrl.py. Would it be possible to backfill the daily
> dumps before everyone bails for Christmas?

Sure thing, will start generating these now.
(In reply to Robert Helmer [:rhelmer] from comment #30)
> (In reply to Bob Clary [:bc:] from comment #28)
> > Hi, I haven't wanted to pollute this bug but I haven't gotten any daily
> > crash dumps (%Y%m%d-crashdata.csv.gz) since 2012-12-13 on either
> > fs1.corpdmz.scl3.mozilla.com in /data/security_group/crash_urls/ or on
> > sisyphus.bughunter.ateam.phx1.mozilla.com in /work/mozilla/crash-reports/. I
> > assume this is due to bug 817718.
> > 
> > rhelmer: These appear to be generated in scripts/crons/cron_daily_reports.sh
> > using scripts/startDailyUrl.py. Would it be possible to backfill the daily
> > dumps before everyone bails for Christmas?
> 
> Sure thing, will start generating these now.

Done, how does it look now?
All the missing CSV files seem to be back. Thanks!
yes, thanks!
dcd(In reply to Robert Helmer [:rhelmer] from comment #29)
> (In reply to Jake Maul [:jakem] from comment #27)
> > (In reply to Robert Helmer [:rhelmer] from comment #26)
> > > > /bin/sh: /data/bin/cron_daily_reports.sh: No such file or directory
> > > > and on /data/bin/cron_daily_reports.sh >>
> > > > /var/log/socorro/cron_daily_reports.log
> > > > /bin/sh: /data/bin/cron_daily_reports.sh: Permission denied
> > > 
> > > 
> > > /data/bin/cron_daily_reports.sh is not executable and should be.
> > 
> > Completed (on sp-admin01.phx1).
> 
> Looks like this is still not executable :(

Fixed in Puppet

notice: /File[/data/bin/cron_daily_reports.sh]/mode: mode changed '0644' to '0755'

[root@sp-admin01.phx1 bin]# ls -lah /data/bin/cron_daily_reports.sh
-rwxr-xr-x 1 root root 1.2K Dec 14 11:16 /data/bin/cron_daily_reports.sh

The next run should work properly, ping me on IRC if it does not
OK thanks everybody! This *should* work without hand-holding again, please reopen if not.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
The csv and module files are missing for December 19.

In addition, there are three unexpected files in the parent directory: https://crash-analysis.mozilla.com/crash_analysis/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Scoobidiver from comment #36)
> The csv and module files are missing for December 19.

Hmm I see them here, am I missing something? :)
https://crash-analysis.mozilla.com/crash_analysis/20121219/
 
> In addition, there are three unexpected files in the parent directory:
> https://crash-analysis.mozilla.com/crash_analysis/

Judging by the dates on these files this was probably due to debugging, I just cleaned this up.
(In reply to Robert Helmer [:rhelmer] from comment #37)
> (In reply to Scoobidiver from comment #36)
> > The csv and module files are missing for December 19.
> Hmm I see them here, am I missing something? :)
> https://crash-analysis.mozilla.com/crash_analysis/20121219/
It's built 26 hours after other files but it seems usual.

Anyway, two module files are missing: https://crash-analysis.mozilla.com/crash_analysis/modulelist/
(In reply to Scoobidiver from comment #38)
> (In reply to Robert Helmer [:rhelmer] from comment #37)
> > (In reply to Scoobidiver from comment #36)
> > > The csv and module files are missing for December 19.
> > Hmm I see them here, am I missing something? :)
> > https://crash-analysis.mozilla.com/crash_analysis/20121219/
> It's built 26 hours after other files but it seems usual.
> 
> Anyway, two module files are missing:
> https://crash-analysis.mozilla.com/crash_analysis/modulelist/

Hmm I see this as the latest which I'd expect:
	20121218-modulelist.txt	19-Dec-2012 17:35 

This job doesn't run until 17:00 (5 PM) Pacific, so I'd expect to see yesterday's data get uploaded in about 7 hours.

Not sure why it runs so late, we could look into it and possibly change that.
It worked for me. ssh fs1.corpdmz.scl3.mozilla.com 'ls /data/security_group/crash_urls/' shows 20121219-crashdata.csv.gz and sisyphus.bughunter.ateam.phx1.mozilla.com also got 20121219-crashdata.csv.gz.
The issues reported in comment 36 are unrelated to the original issue.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.