Closed Bug 947021 Opened 12 years ago Closed 11 years ago

Crash correlations one day out of two or none since Nov. 21st

Categories

(Socorro :: General, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hectorz, Assigned: rhelmer)

References

Details

Attachments

(2 files)

+++ This bug was initially created as a clone of Bug #836671 +++ This is happening again since Nov. 21st.
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
Hm so looks like: 1) we overwrite the log every day instead of appending 2) the script was hanging today (which gets hidden because of #1) I am running the script in debug mode in screen right now, will see exactly where it is hanging.
Bug 961141 makes it difficult to see how long things are taking (although we can infer the total elapsed time from the delta between the cron start time and the timestamp on the output file). However, it looks like the processed JSON export from hbase is taking a while and is quite large, if I am reading it right then the days it actually works, it's taking close to 13 hours (!) where it used to take around 1. No idea yet on why it only seems to produce output files every other day, having the logging fixed should help with that though.
Depends on: 961141
OK the primary problem seems to be that correlation reports are taking way longer to run than they used to - the one I kicked off manually in comment 1 is still running, this used to only take around an hour.
OK did some profiling and figured out what's going on - we increased the size and complexity of the processed JSON a few months back (gradually replacing the pipe-delimited "dump" string), and hit some pathological case in the built-in Python 2.6 json library. This causes the correlation scripts (such as the core-count) to take ~13 hours instead of ~15 minutes. I have a copy in prod that now uses a third-party JSON parser written in C (ujson), which brings us back to the previous "normal" time. I re-ran today's reports and am starting in on a backfill, starting yesterday and going back in time until Nov 21st (per comment 0).
(In reply to Robert Helmer [:rhelmer] from comment #4) > OK did some profiling and figured out what's going on - we increased the > size and complexity of the processed JSON a few months back (gradually > replacing the pipe-delimited "dump" string), and hit some pathological case > in the built-in Python 2.6 json library. This causes the correlation scripts > (such as the core-count) to take ~13 hours instead of ~15 minutes. > > I have a copy in prod that now uses a third-party JSON parser written in C > (ujson), which brings us back to the previous "normal" time. I re-ran > today's reports and am starting in on a backfill, starting yesterday and > going back in time until Nov 21st (per comment 0). Not sure if this is related, but https://crash-analysis.mozilla.com/crash_analysis/20140122/ returns 403 Forbidden for me.
(In reply to Hector Zhao [:hectorz] from comment #5) > (In reply to Robert Helmer [:rhelmer] from comment #4) > > OK did some profiling and figured out what's going on - we increased the > > size and complexity of the processed JSON a few months back (gradually > > replacing the pipe-delimited "dump" string), and hit some pathological case > > in the built-in Python 2.6 json library. This causes the correlation scripts > > (such as the core-count) to take ~13 hours instead of ~15 minutes. > > > > I have a copy in prod that now uses a third-party JSON parser written in C > > (ujson), which brings us back to the previous "normal" time. I re-ran > > today's reports and am starting in on a backfill, starting yesterday and > > going back in time until Nov 21st (per comment 0). > > Not sure if this is related, but > https://crash-analysis.mozilla.com/crash_analysis/20140122/ returns 403 > Forbidden for me. Yep related and fixed in bug 962876 And in fact looking in there now all the reports are 0-byte... that didn't happen when I ran this manually before :( Might need to dig into this a bit more.
Ah I see the 0-byte file problem, re-running... I think there's a bug in the way this script sets permissions too, will check it out when it finishes.
OK it is finally working: https://crash-analysis.mozilla.com/crash_analysis/20140122/ Backfill is ongoing. There does appear to be a bug still with permissions, will take a look at that as part of this bug. I have a local fix for this running in socorro user's homedir, triggered by socorro user's crontab, in production - this needs to be checked in before this bug is closed, and the crontab entry removed when the fix ships.
(In reply to Robert Helmer [:rhelmer] from comment #8) > OK it is finally working: > https://crash-analysis.mozilla.com/crash_analysis/20140122/ Thanks! It seems the non-gzipped version (0 in size) will be kept on server, but no big deal.
(In reply to Hector Zhao [:hectorz] from comment #9) > (In reply to Robert Helmer [:rhelmer] from comment #8) > > OK it is finally working: > > https://crash-analysis.mozilla.com/crash_analysis/20140122/ > > Thanks! > > It seems the non-gzipped version (0 in size) will be kept on server, but no > big deal. Huh, I will remove those, thanks for pointing that out
(In reply to Robert Helmer [:rhelmer] from comment #10) > (In reply to Hector Zhao [:hectorz] from comment #9) > > (In reply to Robert Helmer [:rhelmer] from comment #8) > > > OK it is finally working: > > > https://crash-analysis.mozilla.com/crash_analysis/20140122/ > > > > Thanks! > > > > It seems the non-gzipped version (0 in size) will be kept on server, but no > > big deal. > > Huh, I will remove those, thanks for pointing that out Actually even better I just replaced them with the proper version. I think it was an artifact from the previous bad run that was not replaced due to a permissions bug (which I believe to have fixed in the version now running)
(In reply to Robert Helmer [:rhelmer] from comment #11) > > Actually even better I just replaced them with the proper version. I think > it was an artifact from the previous bad run that was not replaced due to a > permissions bug (which I believe to have fixed in the version now running) Ok, I've updated my script to prefer the gzipped one when both exist.
dmajor just pointed out that https://crash-stats.mozilla.com/report/list?product=Firefox&signature=RtlInterlockedFlushSList+|+winmm_buffer_thread is showing "null" for correlations, turned out reports for the past few days are 0-byte files :( This is because the old unfixed version of this script is still running in production, and acquiring a lock so the new fixed one I am running can't run. I've changed the lock name and am re-running it now, will ask for the current one to be disabled until the fix is shipped (it'll run out of the socorro user's crontab until then)
Depends on: 964496
Attached patch bug947021.diffSplinter Review
use ujson instead
Attachment #8366292 - Flags: review?(peterbe)
Attachment #8366297 - Flags: review?(peterbe)
Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/8fdd97269ec57a60966fcecbac97dbb02fcf85fe bug 947021 - json is pathologically slow on our new processed structure, provide ujson for correlation reports instead https://github.com/mozilla/socorro/commit/31a2a14247c75319f6d0ba86277065178cd67990 Merge pull request #1829 from rhelmer/bug947021-correlations-json-slow bug 947021 - json is pathologically slow on our new processed structure,...
Attachment #8366297 - Flags: review?(peterbe)
Attachment #8366292 - Flags: review?(peterbe) → review+
Blocks: 1004532
Blocks: 1004534
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: