630948 - Please restore the dataset to Soccoro staging, to allow continued WebQA tests

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Description

•

13 years ago

WebQA needs https://crash-stats.stage.mozilla.com to have as much data as possible (at least as much as we had while developing and running our automation tests for 1.7.6) to run against the 1.7.7 milestone (and beyond).

Not sure what needs to happen from IT/Socorro dev's side, but Jabba and Rob would know.

From IRC, looks like we at least fixed a typo in submit_dump_to_staging.sh, and rhelmer was going to land a fix to puppet SVN and re-enable the job in the crontab.

Justin Dow [:jabba]

Updated

•

13 years ago

Assignee: server-ops → jdow

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 1

•

13 years ago

Jabba, know you're busy with a ton of other stuff (as always), but it'd really help us get our automation tests going/passing again, if we could get a more-complete dataset; thanks!

Justin Dow [:jabba]

Comment 2

•

13 years ago

From what I can tell, the old cronjob is functioning normally again. I think it just grabs random crashes and submits them to staging. I'll give this to rob to see if there is a reason that the data isn't there like it used to be. Perhaps we are throttling in stage?

Assignee: jdow → rhelmer

Component: Server Operations → Socorro

Product: mozilla.org → Webtools

QA Contact: mrz → socorro

Robert Helmer [:rhelmer]

Assignee

Comment 3

•

13 years ago

Reassigning since I am out this week, and presumably you do not want to wait that long :)

Laura, I think the only thing needed to unblock QA is probably to take at least one crash for each product from prod and insert into stage.

(In reply to comment #2)
> From what I can tell, the old cronjob is functioning normally again. I think it
> just grabs random crashes and submits them to staging. I'll give this to rob to
> see if there is a reason that the data isn't there like it used to be. Perhaps
> we are throttling in stage?

I don't think we throttle these products, would be worth checking though.

Most likely, these products are such a small subset of overall crashes that we just have not happened to pull them yet; I see crashes from today in stage for Firefox so I assume the cron job is functioning correctly.

Assignee: rhelmer → laura

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 4

•

13 years ago

I'm not sure that just "one crash for each product" will keep our automation happy, as I think tests multiple views per product, some of which might depend on more data, but David can best answer that.  (In reply to comment 3.)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 5

•

13 years ago

This is a blocker for 1.7.7; I'm unable to see whether bugs like bug 633235 (which require pagination, and thus, larger datasets than what is available on staging right now) are regressions or misconfigurations in prod (likely the former).

Severity: major → blocker

Target Milestone: --- → 1.7.7

Ryan Snyder [:ryansnyder] [:rsnyder] [:rysny]

Updated

•

13 years ago

Depends on: 632599

Robert Helmer [:rhelmer]

Assignee

Comment 6

•

13 years ago

(In reply to comment #4)
> I'm not sure that just "one crash for each product" will keep our automation
> happy, as I think tests multiple views per product, some of which might depend
> on more data, but David can best answer that.  (In reply to comment 3.)

In that case, we could pull a larger one-time set of crashes and insert them to stage... we developed some tools to do this as part of the load-testing exercise pre-PHX (I think we pulled 10% of crashes over a period of 2 weeks).

Laura, any reason not to do this?

Laura Thomson :laura

Comment 7

•

13 years ago

(In reply to comment #6)
> (In reply to comment #4)
> > I'm not sure that just "one crash for each product" will keep our automation
> > happy, as I think tests multiple views per product, some of which might depend
> > on more data, but David can best answer that.  (In reply to comment 3.)
> 
> In that case, we could pull a larger one-time set of crashes and insert them to
> stage... we developed some tools to do this as part of the load-testing
> exercise pre-PHX (I think we pulled 10% of crashes over a period of 2 weeks).
> 
> Laura, any reason not to do this?

Go ahead.

Laura Thomson :laura

Updated

•

13 years ago

Assignee: laura → rhelmer

Robert Helmer [:rhelmer]

Assignee

Comment 8

•

13 years ago

Going to use the code we developed in bug 619814 to pull a subset of crashes, and the latest trunk submitter.py to insert them to staging.

Going to aim for 10% of total crashes over a 10-day period, which was about 240k when we did this for the PHX transition.

ETA tomorrow to have this inserting into staging, may take a bit longer if staging can't keep up with the load (very likely). I suspect that QA's needs will be satisfied before this is complete.

Status: NEW → ASSIGNED

Priority: -- → P1

Robert Helmer [:rhelmer]

Assignee

Comment 9

•

13 years ago

For future reference, pulling the crashes looks like:

"""
. /etc/socorro/socorrorc
. /etc/socorro/socorro-monitor.conf
for date in 11020{6..9} 1102{10..16}
  do mkdir -p /tmp/test/${date}
  python ${APPDIR}/socorro/storage/hbaseClient.py -h $hbaseHost export_sampled_crashes_tarball_for_dates 24000 ${date} /tmp/test/${date} crashes-${date}.tar.gz > ${date}.log 2>&1 &
done
wait
"""

10 day period, 10 simultaneous hbase connections, 24k crashes each.

Robert Helmer [:rhelmer]

Assignee

Comment 10

•

13 years ago

Crashes are being inserted and processed, but it looks like we're still missing data for Thunderbird, SeaMonkey and Fennec.

I'm going to check to make sure that the dataset I extracted from production has at least a few crashes for each of these. It's taking a very long time to insert, which I guess should not be a huge surprise :)

I may prioritize the products with missing data though, to be able to close this bug sooner.

Robert Helmer [:rhelmer]

Assignee

Comment 11

•

13 years ago

Actually it looks like the bigger problem is that all throttling is set to 10% for all products/versions on staging, and many current releases have not been updated (since this is a manual process and not synced from production, as I understand it).

I'll do this manually for now and see if it helps, it's going to be an ongoing issue until we make branch data sources more automatic and remove the need for throttling (not sure if we have bugs for either of these issues yet, although I know they've both been discussed).

Robert Helmer [:rhelmer]

Assignee

Comment 12

•

13 years ago

(In reply to comment #11)
> I'll do this manually for now and see if it helps, it's going to be an ongoing
> issue until we make branch data sources more automatic and remove the need for
> throttling (not sure if we have bugs for either of these issues yet, although I
> know they've both been discussed).

This helped a bit, but I realized that the active date range is going to be all wrong, and it'll take far too long to go through and sync this to production by hand.

Since we're taking production crashes anyway, we should just import this data from prod to staging. I think that'd give us the most consistent experience on stage. We've been talking about using a full database copy for staging soon, which will be a better long-term solution for this problem.

Also just to clarify, I believe that we now have enough crash data in stage, it's just a matter of getting the web UI to display it. Versions are always going to be a moving target, and since branch data source info is updated by hand in production we need to get that over to staging somehow. If this seems wrong, please speak up :)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 13

•

13 years ago

(In reply to comment #12)

> Since we're taking production crashes anyway, we should just import this data
> from prod to staging. I think that'd give us the most consistent experience on
> stage. We've been talking about using a full database copy for staging soon,
> which will be a better long-term solution for this problem.
> 
> Also just to clarify, I believe that we now have enough crash data in stage,
> it's just a matter of getting the web UI to display it. Versions are always
> going to be a moving target, and since branch data source info is updated by
> hand in production we need to get that over to staging somehow. If this seems
> wrong, please speak up :)

Copying a full prod database over to staging is *precisely* what we do for AMO, and makes the most sense; for Socorro, are there privacy risks?  Since prod is public (except for stuff behind Admin), and the same functionality exists on both, I'm guessing there aren't.

This would solve two problems:

1) Not enough (and of the right type) of data, on staging
2) We wouldn't have to, by hand, monkey with /admin/branch_data_sources, would we?

Robert Helmer [:rhelmer]

Assignee

Comment 14

•

13 years ago

(In reply to comment #13)
> Copying a full prod database over to staging is *precisely* what we do for AMO,
> and makes the most sense; for Socorro, are there privacy risks?  Since prod is
> public (except for stuff behind Admin), and the same functionality exists on
> both, I'm guessing there aren't.

We currently take crashes from prod so I don't think it's different in this case.

> This would solve two problems:
> 
> 1) Not enough (and of the right type) of data, on staging
> 2) We wouldn't have to, by hand, monkey with /admin/branch_data_sources, would
> we?

Right. This is the plan for the new staging (from repurposed SJC prod hardware). The HBase situation will be a little more complicated, but we can cross that bridge (with Metrics' help) when we come to it :)

For the purposes of the current bug, I think we can get #1 close enough (as close as it's ever been anyway), and for #2 I am going to dump the table in prod and import it to stage (I'll file a separate IT bug for this).

Robert Helmer [:rhelmer]

Assignee

Comment 15

•

13 years ago

(In reply to comment #14)
> For the purposes of the current bug, I think we can get #1 close enough (as
> close as it's ever been anyway), and for #2 I am going to dump the table in
> prod and import it to stage (I'll file a separate IT bug for this).

I've dumped productdims and product_visibility tables from prod, but have not asked them to be imported to stage yet. I'll test this out today and file the aforementioned IT bug.

Robert Helmer [:rhelmer]

Assignee

Comment 16

•

13 years ago

Downgrading from blocker to normal, since we're now passing all automated tests except for Camino. I'll take a closer look at this, it seems to work manually after a little tweaking.

(In reply to comment #15)
> I've dumped productdims and product_visibility tables from prod, but have not
> asked them to be imported to stage yet. I'll test this out today and file the
> aforementioned IT bug.

I'm a little wary of breaking things by pushing this in, I'm going to see if I can do a lighter-weight solution to get this unblocked (by tweaking branch data sources), since we're passing all but one test now.

Severity: blocker → normal

Robert Helmer [:rhelmer]

Assignee

Comment 17

•

13 years ago

Are any tests still failing? stephend said in irc that this one was:
http://viewvc.svn.mozilla.org/vc/projects/socorro_qa/test_smoke_tests.py?r1=83057&r2=83096&pathrev=83096&sortby=date

But if I follow the instructions manually for test_that_advanced_search_view_signature_for_camino_crash() then it looks ok (click Advanced, "Filter Crash Reports", data is returned (no "no data found"))

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 18

•

13 years ago

Yes, still failing; always checkable on our Hudson instance, here: http://qa-selenium.mv.mozilla.com:8080/job/socorro/

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 19

•

13 years ago

(In reply to comment #18)
> Yes, still failing; always checkable on our Hudson instance, here:
> http://qa-selenium.mv.mozilla.com:8080/job/socorro/

Though, I should say, I'm not sure those are lack-of-data issues -- Matt, can you look into them?

Robert Helmer [:rhelmer]

Assignee

Comment 20

•

13 years ago

(In reply to comment #19)
> (In reply to comment #18)
> > Yes, still failing; always checkable on our Hudson instance, here:
> > http://qa-selenium.mv.mozilla.com:8080/job/socorro/
> 
> Though, I should say, I'm not sure those are lack-of-data issues -- Matt, can
> you look into them?

Thanks for the (re-)link, yeah I am taking a look at these, the error looks different than when it was the only failure, and there are more (10) that might actually be code or some other issue.

Robert Helmer [:rhelmer]

Assignee

Comment 21

•

13 years ago

(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #18)
> > > Yes, still failing; always checkable on our Hudson instance, here:
> > > http://qa-selenium.mv.mozilla.com:8080/job/socorro/
> > 
> > Though, I should say, I'm not sure those are lack-of-data issues -- Matt, can
> > you look into them?
> 
> Thanks for the (re-)link, yeah I am taking a look at these, the error looks
> different than when it was the only failure, and there are more (10) that might
> actually be code or some other issue.

Looked at the test failures and discussed in IRC, looks like there are two types of failures and they are failing the same way on all products. I think they are due to code changes, closing this one out, reopen if this is wrong.

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 22

•

13 years ago

The new bug tracking bringing staging back up is bug 641176.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 23

•

13 years ago

Verified FIXED; thanks:

http://qa-selenium.mv.mozilla.com:8080/job/socorro/755/console:

--------------------- generated xml file: socorrotests.xml ---------------------
==================== 46 passed, 1 skipped in 192.96 seconds ====================

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

13 years ago

Component: Socorro → General

Product: Webtools → Socorro