Closed
Bug 947021
Opened 12 years ago
Closed 11 years ago
Crash correlations one day out of two or none since Nov. 21st
Categories
(Socorro :: General, task)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: hectorz, Assigned: rhelmer)
References
Details
Attachments
(2 files)
+++ This bug was initially created as a clone of Bug #836671 +++
This is happening again since Nov. 21st.
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → rhelmer
Assignee | ||
Updated•12 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•12 years ago
|
||
Hm so looks like:
1) we overwrite the log every day instead of appending
2) the script was hanging today (which gets hidden because of #1)
I am running the script in debug mode in screen right now, will see exactly where it is hanging.
Assignee | ||
Comment 2•12 years ago
|
||
Bug 961141 makes it difficult to see how long things are taking (although we can infer the total elapsed time from the delta between the cron start time and the timestamp on the output file).
However, it looks like the processed JSON export from hbase is taking a while and is quite large, if I am reading it right then the days it actually works, it's taking close to 13 hours (!) where it used to take around 1.
No idea yet on why it only seems to produce output files every other day, having the logging fixed should help with that though.
Depends on: 961141
Assignee | ||
Comment 3•12 years ago
|
||
OK the primary problem seems to be that correlation reports are taking way longer to run than they used to - the one I kicked off manually in comment 1 is still running, this used to only take around an hour.
Assignee | ||
Comment 4•12 years ago
|
||
OK did some profiling and figured out what's going on - we increased the size and complexity of the processed JSON a few months back (gradually replacing the pipe-delimited "dump" string), and hit some pathological case in the built-in Python 2.6 json library. This causes the correlation scripts (such as the core-count) to take ~13 hours instead of ~15 minutes.
I have a copy in prod that now uses a third-party JSON parser written in C (ujson), which brings us back to the previous "normal" time. I re-ran today's reports and am starting in on a backfill, starting yesterday and going back in time until Nov 21st (per comment 0).
Reporter | ||
Comment 5•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #4)
> OK did some profiling and figured out what's going on - we increased the
> size and complexity of the processed JSON a few months back (gradually
> replacing the pipe-delimited "dump" string), and hit some pathological case
> in the built-in Python 2.6 json library. This causes the correlation scripts
> (such as the core-count) to take ~13 hours instead of ~15 minutes.
>
> I have a copy in prod that now uses a third-party JSON parser written in C
> (ujson), which brings us back to the previous "normal" time. I re-ran
> today's reports and am starting in on a backfill, starting yesterday and
> going back in time until Nov 21st (per comment 0).
Not sure if this is related, but https://crash-analysis.mozilla.com/crash_analysis/20140122/ returns 403 Forbidden for me.
Assignee | ||
Comment 6•12 years ago
|
||
(In reply to Hector Zhao [:hectorz] from comment #5)
> (In reply to Robert Helmer [:rhelmer] from comment #4)
> > OK did some profiling and figured out what's going on - we increased the
> > size and complexity of the processed JSON a few months back (gradually
> > replacing the pipe-delimited "dump" string), and hit some pathological case
> > in the built-in Python 2.6 json library. This causes the correlation scripts
> > (such as the core-count) to take ~13 hours instead of ~15 minutes.
> >
> > I have a copy in prod that now uses a third-party JSON parser written in C
> > (ujson), which brings us back to the previous "normal" time. I re-ran
> > today's reports and am starting in on a backfill, starting yesterday and
> > going back in time until Nov 21st (per comment 0).
>
> Not sure if this is related, but
> https://crash-analysis.mozilla.com/crash_analysis/20140122/ returns 403
> Forbidden for me.
Yep related and fixed in bug 962876
And in fact looking in there now all the reports are 0-byte... that didn't happen when I ran this manually before :( Might need to dig into this a bit more.
Assignee | ||
Comment 7•12 years ago
|
||
Ah I see the 0-byte file problem, re-running... I think there's a bug in the way this script sets permissions too, will check it out when it finishes.
Assignee | ||
Comment 8•12 years ago
|
||
OK it is finally working:
https://crash-analysis.mozilla.com/crash_analysis/20140122/
Backfill is ongoing. There does appear to be a bug still with permissions, will take a look at that as part of this bug.
I have a local fix for this running in socorro user's homedir, triggered by socorro user's crontab, in production - this needs to be checked in before this bug is closed, and the crontab entry removed when the fix ships.
Reporter | ||
Comment 9•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #8)
> OK it is finally working:
> https://crash-analysis.mozilla.com/crash_analysis/20140122/
Thanks!
It seems the non-gzipped version (0 in size) will be kept on server, but no big deal.
Assignee | ||
Comment 10•12 years ago
|
||
(In reply to Hector Zhao [:hectorz] from comment #9)
> (In reply to Robert Helmer [:rhelmer] from comment #8)
> > OK it is finally working:
> > https://crash-analysis.mozilla.com/crash_analysis/20140122/
>
> Thanks!
>
> It seems the non-gzipped version (0 in size) will be kept on server, but no
> big deal.
Huh, I will remove those, thanks for pointing that out
Assignee | ||
Comment 11•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #10)
> (In reply to Hector Zhao [:hectorz] from comment #9)
> > (In reply to Robert Helmer [:rhelmer] from comment #8)
> > > OK it is finally working:
> > > https://crash-analysis.mozilla.com/crash_analysis/20140122/
> >
> > Thanks!
> >
> > It seems the non-gzipped version (0 in size) will be kept on server, but no
> > big deal.
>
> Huh, I will remove those, thanks for pointing that out
Actually even better I just replaced them with the proper version. I think it was an artifact from the previous bad run that was not replaced due to a permissions bug (which I believe to have fixed in the version now running)
Reporter | ||
Comment 12•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #11)
>
> Actually even better I just replaced them with the proper version. I think
> it was an artifact from the previous bad run that was not replaced due to a
> permissions bug (which I believe to have fixed in the version now running)
Ok, I've updated my script to prefer the gzipped one when both exist.
Assignee | ||
Comment 13•12 years ago
|
||
dmajor just pointed out that https://crash-stats.mozilla.com/report/list?product=Firefox&signature=RtlInterlockedFlushSList+|+winmm_buffer_thread is showing "null" for correlations, turned out reports for the past few days are 0-byte files :(
This is because the old unfixed version of this script is still running in production, and acquiring a lock so the new fixed one I am running can't run. I've changed the lock name and am re-running it now, will ask for the current one to be disabled until the fix is shipped (it'll run out of the socorro user's crontab until then)
Assignee | ||
Comment 15•12 years ago
|
||
Attachment #8366297 -
Flags: review?(peterbe)
Comment 16•12 years ago
|
||
Commits pushed to master at https://github.com/mozilla/socorro
https://github.com/mozilla/socorro/commit/8fdd97269ec57a60966fcecbac97dbb02fcf85fe
bug 947021 - json is pathologically slow on our new processed structure, provide ujson for correlation reports instead
https://github.com/mozilla/socorro/commit/31a2a14247c75319f6d0ba86277065178cd67990
Merge pull request #1829 from rhelmer/bug947021-correlations-json-slow
bug 947021 - json is pathologically slow on our new processed structure,...
Updated•12 years ago
|
Attachment #8366297 -
Flags: review?(peterbe)
Updated•12 years ago
|
Attachment #8366292 -
Flags: review?(peterbe) → review+
Assignee | ||
Updated•11 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•