Closed
Bug 836671
Opened 12 years ago
Closed 12 years ago
Crash correlations one day out of two or none since January 18
Categories
(Socorro :: General, task)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: scoobidiver, Assigned: rhelmer)
References
Details
Since January 18, crash correlation folders for even days contain only a zipped csv file.
There's also January 12's case.
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•12 years ago
|
||
Looks like it's a problem related to locking:
[2013-01-28 08:05:01] lock cron_libraries
error (lock): lock already exists for cron_libraries pid 24770
[2013-01-29 08:05:01] lock cron_libraries
warn (lock): removing lock for cron_libraries, 24770 not running
This means that it skipped the 28th because PID 24770 was running, but the next day the lock was still found but 24770 was not running.
Assignee | ||
Comment 2•12 years ago
|
||
This is still happening, going to investigate further today.
Reporter | ||
Comment 3•12 years ago
|
||
Since February 26, there have been also no correlations on odd days.
Severity: normal → critical
Summary: No crash correlations for even days since January 18 → No crash correlations for even days since January 18 and for all days since February 25
Comment 5•12 years ago
|
||
related to crontabber?
Assignee | ||
Comment 6•12 years ago
|
||
(In reply to Laura Thomson :laura from comment #5)
> related to crontabber?
Shouldn't be, this script has not been moved over and it started happening before we deployed crontabber.
I'll take a look, sorry this dropped off my radar.
Assignee | ||
Comment 7•12 years ago
|
||
OK looks like the current problem is that the script is hanging and holding up the lockfile:
socorro 25938 25934 0 Feb26 ? 00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh >> /var/log/socorro/cron_libraries.log
socorro 25941 25938 0 Feb26 ? 00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_libraries.sh
socorro 26219 26214 0 Jan12 ? 00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh >> /var/log/socorro/cron_libraries.log
socorro 26224 26219 0 Jan12 ? 00:00:00 /bin/sh /data/socorro/application/scripts/crons/cron_libraries.sh
Assignee | ||
Comment 8•12 years ago
|
||
Working on backfilling correlation reports now, if there's still trouble with even days (original bug here) then I'll tackle that next.
Assignee | ||
Comment 9•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #8)
> Working on backfilling correlation reports now, if there's still trouble
> with even days (original bug here) then I'll tackle that next.
This is taking an inordinate amount of time to run for 2012-01-18, on the "fetching from hbase" step. Looking closer, I see that it's spending hours on:
/usr/bin/python2.6 /data/crash-data-tools/per-crash-core-count.py -p Firefox -r 9.0.1 -f /tmp/Firefox_9.0.1.tar
I noticed that there are a lot of old scripts related to this job, I've terminated them all and am trying again. I'll debug further if it gets stuck again.
Assignee | ||
Comment 10•12 years ago
|
||
ISTR this job taking 2-3 hours, so comment 9 may be jumping the gun - if 2012-01-18 completes ok then I will backfill backwards in time, so the most recent days are done first, since this will take a while...
Assignee | ||
Comment 11•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #9)
> This is taking an inordinate amount of time to run for 2012-01-18, on the
> "fetching from hbase" step. Looking closer, I see that it's spending hours
> on:
>
> /usr/bin/python2.6 /data/crash-data-tools/per-crash-core-count.py -p Firefox
> -r 9.0.1 -f /tmp/Firefox_9.0.1.tar
This is definitely hanging, strace reveals:
connect(4, {sa_family=AF_FILE, path="/var/run/abrt/abrt.socket"}, 27
This may be https://partner-bugzilla.redhat.com/show_bug.cgi?id=614752
Assignee | ||
Comment 12•12 years ago
|
||
One thing I notice so far is that the initial pull from hbase is logging a lot of "ooid not found" errors, it could simply be that there's nothing to process here and python isn't exiting correctly for some reason.
I am going to try backfilling backwards in time, starting yesterday.
Assignee | ||
Updated•12 years ago
|
Assignee | ||
Comment 13•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #10)
> ISTR this job taking 2-3 hours, so comment 9 may be jumping the gun - if
> 2012-01-18 completes ok then I will backfill backwards in time, so the most
> recent days are done first, since this will take a while...
OOPS would help if I did 2013 not 2012, starting over again, will check it in the morning.
Assignee | ||
Comment 14•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #13)
> (In reply to Robert Helmer [:rhelmer] from comment #10)
> > ISTR this job taking 2-3 hours, so comment 9 may be jumping the gun - if
> > 2012-01-18 completes ok then I will backfill backwards in time, so the most
> > recent days are done first, since this will take a while...
>
> OOPS would help if I did 2013 not 2012, starting over again, will check it
> in the morning.
This completed successfully, but no output was copied to crash-analysis. Doing some further debugging.
Reporter | ||
Comment 15•12 years ago
|
||
There are correlations from March 10th: https://crash-analysis.mozilla.com/crash_analysis/20130310/
Reporter | ||
Comment 16•12 years ago
|
||
... but none for March 11th.
Reporter | ||
Updated•12 years ago
|
Summary: No crash correlations for even days since January 18 and for all days since February 25 → No crash correlations for even days since January 18, for all days from February 25 to March 9, for odd days since March 11
Reporter | ||
Comment 17•12 years ago
|
||
Again no crash correlations at all.
Summary: No crash correlations for even days since January 18, for all days from February 25 to March 9, for odd days since March 11 → No crash correlations for even days from Jan 18 to Feb 24, for odd days from Mar 11 to 19, for all days from Feb 25 to Mar 9 and since Mar 21
Assignee | ||
Comment 18•12 years ago
|
||
Correlations is currently running:
socorro 13840 13835 0 Mar22 ? 00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh > /var/log/socorro/cron_libraries.log 2>&1
socorro 13841 13840 0 Mar22 ? 00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_libraries.sh
We're likely hitting some bug in the correlations scripts, since it's hanging so often.
Assignee | ||
Comment 19•12 years ago
|
||
Hang looks like bug 848681, pinged on that.
Assignee | ||
Comment 20•12 years ago
|
||
Terminated all hanging processes, attempting backfill again starting with 2013-03-25
Assignee | ||
Comment 21•12 years ago
|
||
Finally got a good run for the 25th:
https://crash-analysis.mozilla.com/crash_analysis/20130325/
Continuing to attempt backfill, will see if it hangs again...
Comment 22•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #21)
> Finally got a good run for the 25th:
>
> https://crash-analysis.mozilla.com/crash_analysis/20130325/
>
> Continuing to attempt backfill, will see if it hangs again...
How are things looking? This seems to be getting in the way of Scoobidiver & KaiRo's stability investigations, from what I've seen in other bugs.
Assignee | ||
Comment 23•12 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #22)
> (In reply to Robert Helmer [:rhelmer] from comment #21)
> > Finally got a good run for the 25th:
> >
> > https://crash-analysis.mozilla.com/crash_analysis/20130325/
> >
> > Continuing to attempt backfill, will see if it hangs again...
>
> How are things looking? This seems to be getting in the way of Scoobidiver &
> KaiRo's stability investigations, from what I've seen in other bugs.
26th is done, doing 27th now. The way I am doing backfill right now creates a lock that blocks the normal job from running, but I will separate it out so that doesn't happen.
I think the root cause here is bug 848681, going to raise the severity on that since there hasn't been any action yet.
Assignee | ||
Comment 24•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #23)
> (In reply to Alex Keybl [:akeybl] from comment #22)
> > (In reply to Robert Helmer [:rhelmer] from comment #21)
> > > Finally got a good run for the 25th:
> > >
> > > https://crash-analysis.mozilla.com/crash_analysis/20130325/
> > >
> > > Continuing to attempt backfill, will see if it hangs again...
> >
> > How are things looking? This seems to be getting in the way of Scoobidiver &
> > KaiRo's stability investigations, from what I've seen in other bugs.
>
> 26th is done, doing 27th now. The way I am doing backfill right now creates
> a lock that blocks the normal job from running, but I will separate it out
> so that doesn't happen.
This is done and tested, attempting to backfill all missing days (going backwards in time). This should not interfere with regularly scheduled correlations reports moving forward.
We'll need to keep an eye out for hanging processes until bug 848681 is fixed. Really we should have an alert watching these output dirs, I'll file that.
Assignee | ||
Comment 25•12 years ago
|
||
Filed bug 857253 for monitoring these directories, so we don't miss this type of problem again.
Reporter | ||
Comment 26•12 years ago
|
||
Crash correlations are now available from March 15th to April 1st but still none on April 2nd and 3rd.
Assignee | ||
Comment 27•12 years ago
|
||
(In reply to Scoobidiver from comment #26)
> Crash correlations are now available from March 15th to April 1st but still
> none on April 2nd and 3rd.
3rd being blank is expected (correlations run for the previous UTC day), but 2nd is not. Backfill is still happening (filling in all missing days for 2013).
I will check on status of the 2nd and rerun right now (takes a few hours)
Assignee | ||
Comment 28•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #27)
> I will check on status of the 2nd and rerun right now (takes a few hours)
Running (in addition to already-running backfill)
![]() |
||
Comment 29•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #27)
> 3rd being blank is expected (correlations run for the previous UTC day)
Actually, I noticed that we always created correlations for the current day, even though that always looked strange for me. We might actually place them into the "wrong" directory, but AFAIK even the UI might depend on that.
Assignee | ||
Comment 30•12 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #29)
> (In reply to Robert Helmer [:rhelmer] from comment #27)
> > 3rd being blank is expected (correlations run for the previous UTC day)
>
> Actually, I noticed that we always created correlations for the current day,
> even though that always looked strange for me. We might actually place them
> into the "wrong" directory, but AFAIK even the UI might depend on that.
oh interesting, thanks! yes looking at the script I see how that would be the case...
Reporter | ||
Updated•12 years ago
|
Summary: No crash correlations for even days from Jan 18 to Feb 24, for odd days from Mar 11 to 19, for all days from Feb 25 to Mar 9 and since Mar 21 → Crash correlations one day out of two or none since January 18
Reporter | ||
Comment 31•12 years ago
|
||
Since April 6, we have been in the no correlation period.
Assignee | ||
Comment 32•12 years ago
|
||
(In reply to Scoobidiver from comment #31)
> Since April 6, we have been in the no correlation period.
Backfill looks ok, but the regular nightly job has been hanging since the 7th - nothing too useful from strace:
Process 19080 attached - interrupt to quit
recvfrom(5, ^C <unfinished ...>
Process 19080 detached
This might be a case of bug 848681
Bug 857253 should detect this kind of thing in the future.
Assignee | ||
Comment 33•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #32)
> (In reply to Scoobidiver from comment #31)
> > Since April 6, we have been in the no correlation period.
>
> Backfill looks ok, but the regular nightly job has been hanging since the
> 7th - nothing too useful from strace:
Terminated this hanging job and starting a backfill for April 6th through today
Reporter | ||
Comment 34•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #33)
> starting a backfill for April 6th through today
For those correlation files, I get "You don't have permission to access ... on this server."
Assignee | ||
Comment 35•12 years ago
|
||
(In reply to Scoobidiver from comment #34)
> (In reply to Robert Helmer [:rhelmer] from comment #33)
> > starting a backfill for April 6th through today
> For those correlation files, I get "You don't have permission to access ...
> on this server."
Fixed this for the reports that are done so far
Reporter | ||
Comment 36•12 years ago
|
||
Correlations from April 12 are almost all empty.
Reporter | ||
Comment 37•12 years ago
|
||
(In reply to Scoobidiver from comment #36)
> Correlations from April 12 are almost all empty.
Same for April 14.
Assignee | ||
Comment 38•12 years ago
|
||
(In reply to Scoobidiver from comment #36)
> Correlations from April 12 are almost all empty.
(In reply to Scoobidiver from comment #37)
> (In reply to Scoobidiver from comment #36)
> > Correlations from April 12 are almost all empty.
> Same for April 14.
OK well that's new behavior :/ at least it's not hanging...
I'll try backfilling these days and do some debugging if they still come out empty.
Assignee | ||
Comment 39•12 years ago
|
||
Backfill for 201304{14..11} running now
Assignee | ||
Comment 40•12 years ago
|
||
Backfill still running, not sure if it is significant yet but seeing this traceback for some days/products/versions:
Traceback (most recent call last):
File "/data/crash-data-tools/per-crash-interesting-modules.py", line 115, in <module>
signame = signame + "|" + crash["reason"]
TypeError: coercing to Unicode: need string or buffer, NoneType found
Reporter | ||
Comment 41•12 years ago
|
||
The result is the same after the backfill. Is there another side effect of Socorro 41?
Reporter | ||
Comment 42•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #40)
> signame = signame + "|" + crash["reason"]
> TypeError: coercing to Unicode: need string or buffer, NoneType found
Is it related to the new processor?
Assignee | ||
Comment 43•12 years ago
|
||
(In reply to Scoobidiver from comment #42)
> (In reply to Robert Helmer [:rhelmer] from comment #40)
> > signame = signame + "|" + crash["reason"]
> > TypeError: coercing to Unicode: need string or buffer, NoneType found
> Is it related to the new processor?
I don't see any obvious reason, or any code changes that would account for this, but it's certainly possible. I have added some debugging and am attempting to backfill 20130409 now.
Assignee | ||
Comment 44•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #43)
> (In reply to Scoobidiver from comment #42)
> > (In reply to Robert Helmer [:rhelmer] from comment #40)
> > > signame = signame + "|" + crash["reason"]
> > > TypeError: coercing to Unicode: need string or buffer, NoneType found
> > Is it related to the new processor?
>
> I don't see any obvious reason, or any code changes that would account for
> this, but it's certainly possible. I have added some debugging and am
> attempting to backfill 20130409 now.
That seems to have worked:
https://crash-analysis.mozilla.com/crash_analysis/20130409/
Going to try 09 through 17th now.
Reporter | ||
Comment 45•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #44)
> That seems to have worked:
> https://crash-analysis.mozilla.com/crash_analysis/20130409/
It's an illusion. Data in correlations doesn't have the right format except for core counts. See https://crash-analysis.mozilla.com/crash_analysis/20130409/20130409_Firefox_20.0-interesting-addons.txt.gz and https://crash-analysis.mozilla.com/crash_analysis/20130409/20130409_Firefox_20.0-interesting-modules.txt.gz.
So it's a regression in Socorro 41.
Assignee | ||
Comment 46•12 years ago
|
||
(In reply to Scoobidiver from comment #45)
> (In reply to Robert Helmer [:rhelmer] from comment #44)
> > That seems to have worked:
> > https://crash-analysis.mozilla.com/crash_analysis/20130409/
> It's an illusion. Data in correlations doesn't have the right format except
> for core counts. See
> https://crash-analysis.mozilla.com/crash_analysis/20130409/
> 20130409_Firefox_20.0-interesting-addons.txt.gz and
> https://crash-analysis.mozilla.com/crash_analysis/20130409/
> 20130409_Firefox_20.0-interesting-modules.txt.gz.
> So it's a regression in Socorro 41.
Ah! Thanks for catching that, I only checked that the files were not 0-byte. I have now fixed this so that it goes to stderr and doesn't get in the output.
I'll rerun 20130409 and add more debugging to see exactly what's going on.
Assignee | ||
Comment 47•12 years ago
|
||
OK - so I've bisected this down, and what's going on is that crash "reason" seems to be missing for crashes that the correlations are picking up starting on 2013-04-09.
I tried pulling some processed crashes randomly from crash-stats and they seem to have "reason" in the JSON, so not sure what the deal is here. I am progressively adding more debugging to my copy of the correlation scripts, and will report back once I have more info.
Lars, do you have any ideas on the above? Thanks!
Flags: needinfo?(lars)
Assignee | ||
Comment 49•12 years ago
|
||
OK I believe that the issues we've been seeing since 2014-04-09 are bug 867342, and the previous problems were bugs that have now been resolved.
Bug 867345 will track backfilling all these 0-byte correlations once the underlying bug 867342 is fixed.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Flags: needinfo?(lars)
You need to log in
before you can comment on or make changes to this bug.
Description
•