Closed Bug 836671 Opened 7 years ago Closed 7 years ago

Crash correlations one day out of two or none since January 18

Categories

(Socorro :: General, task, critical)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scoobidiver, Assigned: rhelmer)

References

Details

Since January 18, crash correlation folders for even days contain only a zipped csv file.
There's also January 12's case.
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
Looks like it's a problem related to locking:

[2013-01-28 08:05:01] lock cron_libraries
error (lock): lock already exists for cron_libraries pid 24770
[2013-01-29 08:05:01] lock cron_libraries
warn (lock): removing lock for cron_libraries, 24770 not running

This means that it skipped the 28th because PID 24770 was running, but the next day the lock was still found but 24770 was not running.
This is still happening, going to investigate further today.
Since February 26, there have been also no correlations on odd days.
Severity: normal → critical
Summary: No crash correlations for even days since January 18 → No crash correlations for even days since January 18 and for all days since February 25
Duplicate of this bug: 848202
related to crontabber?
(In reply to Laura Thomson :laura from comment #5)
> related to crontabber?

Shouldn't be, this script has not been moved over and it started happening before we deployed crontabber.

I'll take a look, sorry this dropped off my radar.
OK looks like the current problem is that the script is hanging and holding up the lockfile:

socorro  25938 25934  0 Feb26 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh >> /var/log/socorro/cron_libraries.log
socorro  25941 25938  0 Feb26 ?        00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_libraries.sh
socorro  26219 26214  0 Jan12 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh >> /var/log/socorro/cron_libraries.log
socorro  26224 26219  0 Jan12 ?        00:00:00 /bin/sh /data/socorro/application/scripts/crons/cron_libraries.sh
Depends on: 848419
Working on backfilling correlation reports now, if there's still trouble with even days (original bug here) then I'll tackle that next.
(In reply to Robert Helmer [:rhelmer] from comment #8)
> Working on backfilling correlation reports now, if there's still trouble
> with even days (original bug here) then I'll tackle that next.

This is taking an inordinate amount of time to run for 2012-01-18, on the "fetching from hbase" step. Looking closer, I see that it's spending hours on:

/usr/bin/python2.6 /data/crash-data-tools/per-crash-core-count.py -p Firefox -r 9.0.1 -f /tmp/Firefox_9.0.1.tar

I noticed that there are a lot of old scripts related to this job, I've terminated them all and am trying again. I'll debug further if it gets stuck again.
ISTR this job taking 2-3 hours, so comment 9 may be jumping the gun - if 2012-01-18 completes ok then I will backfill backwards in time, so the most recent days are done first, since this will take a while...
(In reply to Robert Helmer [:rhelmer] from comment #9)
> This is taking an inordinate amount of time to run for 2012-01-18, on the
> "fetching from hbase" step. Looking closer, I see that it's spending hours
> on:
> 
> /usr/bin/python2.6 /data/crash-data-tools/per-crash-core-count.py -p Firefox
> -r 9.0.1 -f /tmp/Firefox_9.0.1.tar

This is definitely hanging, strace reveals:

connect(4, {sa_family=AF_FILE, path="/var/run/abrt/abrt.socket"}, 27

This may be https://partner-bugzilla.redhat.com/show_bug.cgi?id=614752
One thing I notice so far is that the initial pull from hbase is logging a lot of "ooid not found" errors, it could simply be that there's nothing to process here and python isn't exiting correctly for some reason.

I am going to try backfilling backwards in time, starting yesterday.
Blocks: 848681
No longer blocks: 848681
Depends on: 848681
(In reply to Robert Helmer [:rhelmer] from comment #10)
> ISTR this job taking 2-3 hours, so comment 9 may be jumping the gun - if
> 2012-01-18 completes ok then I will backfill backwards in time, so the most
> recent days are done first, since this will take a while...

OOPS would help if I did 2013 not 2012, starting over again, will check it in the morning.
(In reply to Robert Helmer [:rhelmer] from comment #13)
> (In reply to Robert Helmer [:rhelmer] from comment #10)
> > ISTR this job taking 2-3 hours, so comment 9 may be jumping the gun - if
> > 2012-01-18 completes ok then I will backfill backwards in time, so the most
> > recent days are done first, since this will take a while...
> 
> OOPS would help if I did 2013 not 2012, starting over again, will check it
> in the morning.

This completed successfully, but no output was copied to crash-analysis. Doing some further debugging.
There are correlations from March 10th: https://crash-analysis.mozilla.com/crash_analysis/20130310/
... but none for March 11th.
Summary: No crash correlations for even days since January 18 and for all days since February 25 → No crash correlations for even days since January 18, for all days from February 25 to March 9, for odd days since March 11
Again no crash correlations at all.
Summary: No crash correlations for even days since January 18, for all days from February 25 to March 9, for odd days since March 11 → No crash correlations for even days from Jan 18 to Feb 24, for odd days from Mar 11 to 19, for all days from Feb 25 to Mar 9 and since Mar 21
Correlations is currently running:

socorro  13840 13835  0 Mar22 ?        00:00:00 /bin/sh -c /data/socorro/application/scripts/crons/cron_libraries.sh > /var/log/socorro/cron_libraries.log 2>&1
socorro  13841 13840  0 Mar22 ?        00:00:00 /bin/bash /data/socorro/application/scripts/crons/cron_libraries.sh

We're likely hitting some bug in the correlations scripts, since it's hanging so often.
Hang looks like bug 848681, pinged on that.
Terminated all hanging processes, attempting backfill again starting with 2013-03-25
Finally got a good run for the 25th:

https://crash-analysis.mozilla.com/crash_analysis/20130325/

Continuing to attempt backfill, will see if it hangs again...
(In reply to Robert Helmer [:rhelmer] from comment #21)
> Finally got a good run for the 25th:
> 
> https://crash-analysis.mozilla.com/crash_analysis/20130325/
> 
> Continuing to attempt backfill, will see if it hangs again...

How are things looking? This seems to be getting in the way of Scoobidiver & KaiRo's stability investigations, from what I've seen in other bugs.
(In reply to Alex Keybl [:akeybl] from comment #22)
> (In reply to Robert Helmer [:rhelmer] from comment #21)
> > Finally got a good run for the 25th:
> > 
> > https://crash-analysis.mozilla.com/crash_analysis/20130325/
> > 
> > Continuing to attempt backfill, will see if it hangs again...
> 
> How are things looking? This seems to be getting in the way of Scoobidiver &
> KaiRo's stability investigations, from what I've seen in other bugs.

26th is done, doing 27th now. The way I am doing backfill right now creates a lock that blocks the normal job from running, but I will separate it out so that doesn't happen.

I think the root cause here is bug 848681, going to raise the severity on that since there hasn't been any action yet.
(In reply to Robert Helmer [:rhelmer] from comment #23)
> (In reply to Alex Keybl [:akeybl] from comment #22)
> > (In reply to Robert Helmer [:rhelmer] from comment #21)
> > > Finally got a good run for the 25th:
> > > 
> > > https://crash-analysis.mozilla.com/crash_analysis/20130325/
> > > 
> > > Continuing to attempt backfill, will see if it hangs again...
> > 
> > How are things looking? This seems to be getting in the way of Scoobidiver &
> > KaiRo's stability investigations, from what I've seen in other bugs.
> 
> 26th is done, doing 27th now. The way I am doing backfill right now creates
> a lock that blocks the normal job from running, but I will separate it out
> so that doesn't happen.

This is done and tested, attempting to backfill all missing days (going backwards in time). This should not interfere with regularly scheduled correlations reports moving forward.

We'll need to keep an eye out for hanging processes until bug 848681 is fixed. Really we should have an alert watching these output dirs, I'll file that.
Filed bug 857253 for monitoring these directories, so we don't miss this type of problem again.
Crash correlations are now available from March 15th to April 1st but still none on April 2nd and 3rd.
(In reply to Scoobidiver from comment #26)
> Crash correlations are now available from March 15th to April 1st but still
> none on April 2nd and 3rd.

3rd being blank is expected (correlations run for the previous UTC day), but 2nd is not. Backfill is still happening (filling in all missing days for 2013).

I will check on status of the 2nd and rerun right now (takes a few hours)
(In reply to Robert Helmer [:rhelmer] from comment #27)
> I will check on status of the 2nd and rerun right now (takes a few hours)

Running (in addition to already-running backfill)
(In reply to Robert Helmer [:rhelmer] from comment #27)
> 3rd being blank is expected (correlations run for the previous UTC day)

Actually, I noticed that we always created correlations for the current day, even though that always looked strange for me. We might actually place them into the "wrong" directory, but AFAIK even the UI might depend on that.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #29)
> (In reply to Robert Helmer [:rhelmer] from comment #27)
> > 3rd being blank is expected (correlations run for the previous UTC day)
> 
> Actually, I noticed that we always created correlations for the current day,
> even though that always looked strange for me. We might actually place them
> into the "wrong" directory, but AFAIK even the UI might depend on that.

oh interesting, thanks! yes looking at the script I see how that would be the case...
Summary: No crash correlations for even days from Jan 18 to Feb 24, for odd days from Mar 11 to 19, for all days from Feb 25 to Mar 9 and since Mar 21 → Crash correlations one day out of two or none since January 18
Since April 6, we have been in the no correlation period.
(In reply to Scoobidiver from comment #31)
> Since April 6, we have been in the no correlation period.

Backfill looks ok, but the regular nightly job has been hanging since the 7th - nothing too useful from strace:

Process 19080 attached - interrupt to quit
recvfrom(5, ^C <unfinished ...>
Process 19080 detached

This might be a case of bug 848681
Bug 857253 should detect this kind of thing in the future.
(In reply to Robert Helmer [:rhelmer] from comment #32)
> (In reply to Scoobidiver from comment #31)
> > Since April 6, we have been in the no correlation period.
> 
> Backfill looks ok, but the regular nightly job has been hanging since the
> 7th - nothing too useful from strace:

Terminated this hanging job and starting a backfill for April 6th through today
(In reply to Robert Helmer [:rhelmer] from comment #33)
> starting a backfill for April 6th through today
For those correlation files, I get "You don't have permission to access ... on this server."
(In reply to Scoobidiver from comment #34)
> (In reply to Robert Helmer [:rhelmer] from comment #33)
> > starting a backfill for April 6th through today
> For those correlation files, I get "You don't have permission to access ...
> on this server."

Fixed this for the reports that are done so far
Correlations from April 12 are almost all empty.
(In reply to Scoobidiver from comment #36)
> Correlations from April 12 are almost all empty.
Same for April 14.
(In reply to Scoobidiver from comment #36)
> Correlations from April 12 are almost all empty.

(In reply to Scoobidiver from comment #37)
> (In reply to Scoobidiver from comment #36)
> > Correlations from April 12 are almost all empty.
> Same for April 14.

OK well that's new behavior :/ at least it's not hanging...

I'll try backfilling these days and do some debugging if they still come out empty.
Backfill for 201304{14..11} running now
Backfill still running, not sure if it is significant yet but seeing this traceback for some days/products/versions:

Traceback (most recent call last):
  File "/data/crash-data-tools/per-crash-interesting-modules.py", line 115, in <module>
    signame = signame + "|" + crash["reason"]
TypeError: coercing to Unicode: need string or buffer, NoneType found
The result is the same after the backfill. Is there another side effect of Socorro 41?
(In reply to Robert Helmer [:rhelmer] from comment #40)
>     signame = signame + "|" + crash["reason"]
> TypeError: coercing to Unicode: need string or buffer, NoneType found
Is it related to the new processor?
(In reply to Scoobidiver from comment #42)
> (In reply to Robert Helmer [:rhelmer] from comment #40)
> >     signame = signame + "|" + crash["reason"]
> > TypeError: coercing to Unicode: need string or buffer, NoneType found
> Is it related to the new processor?

I don't see any obvious reason, or any code changes that would account for this, but it's certainly possible. I have added some debugging and am attempting to backfill 20130409 now.
(In reply to Robert Helmer [:rhelmer] from comment #43)
> (In reply to Scoobidiver from comment #42)
> > (In reply to Robert Helmer [:rhelmer] from comment #40)
> > >     signame = signame + "|" + crash["reason"]
> > > TypeError: coercing to Unicode: need string or buffer, NoneType found
> > Is it related to the new processor?
> 
> I don't see any obvious reason, or any code changes that would account for
> this, but it's certainly possible. I have added some debugging and am
> attempting to backfill 20130409 now.

That seems to have worked:

https://crash-analysis.mozilla.com/crash_analysis/20130409/

Going to try 09 through 17th now.
(In reply to Robert Helmer [:rhelmer] from comment #44)
> That seems to have worked:
> https://crash-analysis.mozilla.com/crash_analysis/20130409/
It's an illusion. Data in correlations doesn't have the right format except for core counts. See https://crash-analysis.mozilla.com/crash_analysis/20130409/20130409_Firefox_20.0-interesting-addons.txt.gz and https://crash-analysis.mozilla.com/crash_analysis/20130409/20130409_Firefox_20.0-interesting-modules.txt.gz.
So it's a regression in Socorro 41.
(In reply to Scoobidiver from comment #45)
> (In reply to Robert Helmer [:rhelmer] from comment #44)
> > That seems to have worked:
> > https://crash-analysis.mozilla.com/crash_analysis/20130409/
> It's an illusion. Data in correlations doesn't have the right format except
> for core counts. See
> https://crash-analysis.mozilla.com/crash_analysis/20130409/
> 20130409_Firefox_20.0-interesting-addons.txt.gz and
> https://crash-analysis.mozilla.com/crash_analysis/20130409/
> 20130409_Firefox_20.0-interesting-modules.txt.gz.
> So it's a regression in Socorro 41.

Ah! Thanks for catching that, I only checked that the files were not 0-byte. I have now fixed this so that it goes to stderr and doesn't get in the output.

I'll rerun 20130409 and add more debugging to see exactly what's going on.
OK - so I've bisected this down, and what's going on is that crash "reason" seems to be missing for crashes that the correlations are picking up starting on 2013-04-09.

I tried pulling some processed crashes randomly from crash-stats and they seem to have "reason" in the JSON, so not sure what the deal is here. I am progressively adding more debugging to my copy of the correlation scripts, and will report back once I have more info.

Lars, do you have any ideas on the above? Thanks!
Flags: needinfo?(lars)
Duplicate of this bug: 867144
OK I believe that the issues we've been seeing since 2014-04-09 are bug 867342, and the previous problems were bugs that have now been resolved.

Bug 867345 will track backfilling all these 0-byte correlations once the underlying bug 867342 is fixed.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Flags: needinfo?(lars)
You need to log in before you can comment on or make changes to this bug.