Closed Bug 630948 Opened 13 years ago Closed 13 years ago

Please restore the dataset to Soccoro staging, to allow continued WebQA tests

Categories

(Socorro :: General, task, P1)

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: stephend, Assigned: rhelmer)

References

()

Details

WebQA needs https://crash-stats.stage.mozilla.com to have as much data as possible (at least as much as we had while developing and running our automation tests for 1.7.6) to run against the 1.7.7 milestone (and beyond).

Not sure what needs to happen from IT/Socorro dev's side, but Jabba and Rob would know.

From IRC, looks like we at least fixed a typo in submit_dump_to_staging.sh, and rhelmer was going to land a fix to puppet SVN and re-enable the job in the crontab.
Assignee: server-ops → jdow
Jabba, know you're busy with a ton of other stuff (as always), but it'd really help us get our automation tests going/passing again, if we could get a more-complete dataset; thanks!
From what I can tell, the old cronjob is functioning normally again. I think it just grabs random crashes and submits them to staging. I'll give this to rob to see if there is a reason that the data isn't there like it used to be. Perhaps we are throttling in stage?
Assignee: jdow → rhelmer
Component: Server Operations → Socorro
Product: mozilla.org → Webtools
QA Contact: mrz → socorro
Reassigning since I am out this week, and presumably you do not want to wait that long :)

Laura, I think the only thing needed to unblock QA is probably to take at least one crash for each product from prod and insert into stage.

(In reply to comment #2)
> From what I can tell, the old cronjob is functioning normally again. I think it
> just grabs random crashes and submits them to staging. I'll give this to rob to
> see if there is a reason that the data isn't there like it used to be. Perhaps
> we are throttling in stage?

I don't think we throttle these products, would be worth checking though.

Most likely, these products are such a small subset of overall crashes that we just have not happened to pull them yet; I see crashes from today in stage for Firefox so I assume the cron job is functioning correctly.
Assignee: rhelmer → laura
I'm not sure that just "one crash for each product" will keep our automation happy, as I think tests multiple views per product, some of which might depend on more data, but David can best answer that.  (In reply to comment 3.)
This is a blocker for 1.7.7; I'm unable to see whether bugs like bug 633235 (which require pagination, and thus, larger datasets than what is available on staging right now) are regressions or misconfigurations in prod (likely the former).
Severity: major → blocker
Target Milestone: --- → 1.7.7
(In reply to comment #4)
> I'm not sure that just "one crash for each product" will keep our automation
> happy, as I think tests multiple views per product, some of which might depend
> on more data, but David can best answer that.  (In reply to comment 3.)

In that case, we could pull a larger one-time set of crashes and insert them to stage... we developed some tools to do this as part of the load-testing exercise pre-PHX (I think we pulled 10% of crashes over a period of 2 weeks).

Laura, any reason not to do this?
(In reply to comment #6)
> (In reply to comment #4)
> > I'm not sure that just "one crash for each product" will keep our automation
> > happy, as I think tests multiple views per product, some of which might depend
> > on more data, but David can best answer that.  (In reply to comment 3.)
> 
> In that case, we could pull a larger one-time set of crashes and insert them to
> stage... we developed some tools to do this as part of the load-testing
> exercise pre-PHX (I think we pulled 10% of crashes over a period of 2 weeks).
> 
> Laura, any reason not to do this?

Go ahead.
Assignee: laura → rhelmer
Going to use the code we developed in bug 619814 to pull a subset of crashes, and the latest trunk submitter.py to insert them to staging.

Going to aim for 10% of total crashes over a 10-day period, which was about 240k when we did this for the PHX transition.

ETA tomorrow to have this inserting into staging, may take a bit longer if staging can't keep up with the load (very likely). I suspect that QA's needs will be satisfied before this is complete.
Status: NEW → ASSIGNED
Priority: -- → P1
For future reference, pulling the crashes looks like:

"""
. /etc/socorro/socorrorc
. /etc/socorro/socorro-monitor.conf
for date in 11020{6..9} 1102{10..16}
  do mkdir -p /tmp/test/${date}
  python ${APPDIR}/socorro/storage/hbaseClient.py -h $hbaseHost export_sampled_crashes_tarball_for_dates 24000 ${date} /tmp/test/${date} crashes-${date}.tar.gz > ${date}.log 2>&1 &
done
wait
"""

10 day period, 10 simultaneous hbase connections, 24k crashes each.
Crashes are being inserted and processed, but it looks like we're still missing data for Thunderbird, SeaMonkey and Fennec.

I'm going to check to make sure that the dataset I extracted from production has at least a few crashes for each of these. It's taking a very long time to insert, which I guess should not be a huge surprise :)

I may prioritize the products with missing data though, to be able to close this bug sooner.
Actually it looks like the bigger problem is that all throttling is set to 10% for all products/versions on staging, and many current releases have not been updated (since this is a manual process and not synced from production, as I understand it).

I'll do this manually for now and see if it helps, it's going to be an ongoing issue until we make branch data sources more automatic and remove the need for throttling (not sure if we have bugs for either of these issues yet, although I know they've both been discussed).
(In reply to comment #11)
> I'll do this manually for now and see if it helps, it's going to be an ongoing
> issue until we make branch data sources more automatic and remove the need for
> throttling (not sure if we have bugs for either of these issues yet, although I
> know they've both been discussed).

This helped a bit, but I realized that the active date range is going to be all wrong, and it'll take far too long to go through and sync this to production by hand.

Since we're taking production crashes anyway, we should just import this data from prod to staging. I think that'd give us the most consistent experience on stage. We've been talking about using a full database copy for staging soon, which will be a better long-term solution for this problem.

Also just to clarify, I believe that we now have enough crash data in stage, it's just a matter of getting the web UI to display it. Versions are always going to be a moving target, and since branch data source info is updated by hand in production we need to get that over to staging somehow. If this seems wrong, please speak up :)
(In reply to comment #12)

> Since we're taking production crashes anyway, we should just import this data
> from prod to staging. I think that'd give us the most consistent experience on
> stage. We've been talking about using a full database copy for staging soon,
> which will be a better long-term solution for this problem.
> 
> Also just to clarify, I believe that we now have enough crash data in stage,
> it's just a matter of getting the web UI to display it. Versions are always
> going to be a moving target, and since branch data source info is updated by
> hand in production we need to get that over to staging somehow. If this seems
> wrong, please speak up :)

Copying a full prod database over to staging is *precisely* what we do for AMO, and makes the most sense; for Socorro, are there privacy risks?  Since prod is public (except for stuff behind Admin), and the same functionality exists on both, I'm guessing there aren't.

This would solve two problems:

1) Not enough (and of the right type) of data, on staging
2) We wouldn't have to, by hand, monkey with /admin/branch_data_sources, would we?
(In reply to comment #13)
> Copying a full prod database over to staging is *precisely* what we do for AMO,
> and makes the most sense; for Socorro, are there privacy risks?  Since prod is
> public (except for stuff behind Admin), and the same functionality exists on
> both, I'm guessing there aren't.

We currently take crashes from prod so I don't think it's different in this case.

> This would solve two problems:
> 
> 1) Not enough (and of the right type) of data, on staging
> 2) We wouldn't have to, by hand, monkey with /admin/branch_data_sources, would
> we?

Right. This is the plan for the new staging (from repurposed SJC prod hardware). The HBase situation will be a little more complicated, but we can cross that bridge (with Metrics' help) when we come to it :)

For the purposes of the current bug, I think we can get #1 close enough (as close as it's ever been anyway), and for #2 I am going to dump the table in prod and import it to stage (I'll file a separate IT bug for this).
(In reply to comment #14)
> For the purposes of the current bug, I think we can get #1 close enough (as
> close as it's ever been anyway), and for #2 I am going to dump the table in
> prod and import it to stage (I'll file a separate IT bug for this).

I've dumped productdims and product_visibility tables from prod, but have not asked them to be imported to stage yet. I'll test this out today and file the aforementioned IT bug.
Downgrading from blocker to normal, since we're now passing all automated tests except for Camino. I'll take a closer look at this, it seems to work manually after a little tweaking.

(In reply to comment #15)
> I've dumped productdims and product_visibility tables from prod, but have not
> asked them to be imported to stage yet. I'll test this out today and file the
> aforementioned IT bug.

I'm a little wary of breaking things by pushing this in, I'm going to see if I can do a lighter-weight solution to get this unblocked (by tweaking branch data sources), since we're passing all but one test now.
Severity: blocker → normal
Are any tests still failing? stephend said in irc that this one was:
http://viewvc.svn.mozilla.org/vc/projects/socorro_qa/test_smoke_tests.py?r1=83057&r2=83096&pathrev=83096&sortby=date

But if I follow the instructions manually for test_that_advanced_search_view_signature_for_camino_crash() then it looks ok (click Advanced, "Filter Crash Reports", data is returned (no "no data found"))
Yes, still failing; always checkable on our Hudson instance, here: http://qa-selenium.mv.mozilla.com:8080/job/socorro/
(In reply to comment #18)
> Yes, still failing; always checkable on our Hudson instance, here:
> http://qa-selenium.mv.mozilla.com:8080/job/socorro/

Though, I should say, I'm not sure those are lack-of-data issues -- Matt, can you look into them?
(In reply to comment #19)
> (In reply to comment #18)
> > Yes, still failing; always checkable on our Hudson instance, here:
> > http://qa-selenium.mv.mozilla.com:8080/job/socorro/
> 
> Though, I should say, I'm not sure those are lack-of-data issues -- Matt, can
> you look into them?

Thanks for the (re-)link, yeah I am taking a look at these, the error looks different than when it was the only failure, and there are more (10) that might actually be code or some other issue.
(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #18)
> > > Yes, still failing; always checkable on our Hudson instance, here:
> > > http://qa-selenium.mv.mozilla.com:8080/job/socorro/
> > 
> > Though, I should say, I'm not sure those are lack-of-data issues -- Matt, can
> > you look into them?
> 
> Thanks for the (re-)link, yeah I am taking a look at these, the error looks
> different than when it was the only failure, and there are more (10) that might
> actually be code or some other issue.

Looked at the test failures and discussed in IRC, looks like there are two types of failures and they are failing the same way on all products. I think they are due to code changes, closing this one out, reopen if this is wrong.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
The new bug tracking bringing staging back up is bug 641176.
Verified FIXED; thanks:

http://qa-selenium.mv.mozilla.com:8080/job/socorro/755/console:

--------------------- generated xml file: socorrotests.xml ---------------------
==================== 46 passed, 1 skipped in 192.96 seconds ====================
Status: RESOLVED → VERIFIED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.