Closed Bug 973944 Opened 10 years ago Closed 10 years ago

b2g datazilla hamachi shows no data since Feb 10, 2014

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bkelly, Unassigned)

References

Details

It appears datazilla for b2g master on hamachi has been busted for the last week.  Can someone from a-team take a look at what is going on?
I did restart the b2g-3 device earlier today because it was hit by bug https://bugzilla.mozilla.org/show_bug.cgi?id=971747 and restarted a job.

b2g-0 is still hit by it but I cannot restart it because I don't know the password.
Actually b2g-3 looks to have fallen foul of bug 971747 again. The UI tests are not hit by this bug anymore but it seems the perf tests are.
I think this is a manifestation of bug 971605.  I've just bumped b2gperf and we'll see if that fixes it.
Depends on: 974092
That problem fixed, new problem found: bug 973822.  Will fix with bug 974092.
Do we have some monitoring in place that could send emails every time a b2gperf build fails?
(In reply to Anthony Ricaud (:rik) from comment #6)
> Do we have some monitoring in place that could send emails every time a
> b2gperf build fails?

Its in the works and discussed weekly as part of the Signal from Noise meetings.
(In reply to Ben Kelly [:bkelly] from comment #7)
> (In reply to Anthony Ricaud (:rik) from comment #6)
> > Do we have some monitoring in place that could send emails every time a
> > b2gperf build fails?
> 
> Its in the works and discussed weekly as part of the Signal from Noise
> meetings.

In the interim, we can have Jenkins send an e-mail for every failing build, but this generates a lot noise.
This seems to be working now.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to Jonathan Griffin (:jgriffin) from comment #8)
> (In reply to Ben Kelly [:bkelly] from comment #7)
> > (In reply to Anthony Ricaud (:rik) from comment #6)
> > > Do we have some monitoring in place that could send emails every time a
> > > b2gperf build fails?
> > 
> > Its in the works and discussed weekly as part of the Signal from Noise
> > meetings.
> 
> In the interim, we can have Jenkins send an e-mail for every failing build,
> but this generates a lot noise.

These are already sent to the webqa-ci mailing list [1], so you can subscribe but you will want to configure filters because there's a lot of noise.

[1] https://mail.mozilla.org/listinfo/webqa-ci
(In reply to Jonathan Griffin (:jgriffin) from comment #8)
> In the interim, we can have Jenkins send an e-mail for every failing build,
> but this generates a lot noise.
Why is that a lot of noise? Failing builds is a signal. It's not ok to lose a week of data because no one was warned.

My question here is "how can we make sure we do better next time?".
(In reply to Anthony Ricaud (:rik) from comment #11)
> (In reply to Jonathan Griffin (:jgriffin) from comment #8)
> > In the interim, we can have Jenkins send an e-mail for every failing build,
> > but this generates a lot noise.
> Why is that a lot of noise? Failing builds is a signal. It's not ok to lose
> a week of data because no one was warned.

Failing builds do not mean that no results are gathered and submitted to DataZilla, so from that point there could be noise. I usually monitor these jobs myself (including the existing email notifications), however I have been on PTO for the last two weeks.

> My question here is "how can we make sure we do better next time?".

I have high hopes for the data ingestion alerts that are being worked on for datazilla, which will send an alert when the rate of data received drops. I'm not sure of the bug or ETA for this though.

I'm a little surprised that this wasn't noticed for so long, so if anybody monitoring the results sees such an outage I would encourage them to raise a bug sooner.

One possible reason this went unnoticed could be that datazilla shows the available data spanning the available width of the chart, so a lack of recent results might not be clear until there are no results in the last week (the default range). I wonder if there's a way in the datazilla UI we could make lack of recent data more obvious?
Flags: needinfo?(jeads)
In this case, we did notice the problem, however we misattributed its cause.  Bug 971747 popped up around the same time as a couple of unrelated problems, some of which had similar effects.  We didn't notice the unrelated problems as quickly as we might have, because we assumed they were instances of bug 971747, and were waiting on help from devs to figure what was going on with that.

I agree that the Datazilla alerting system which is in development will help here, but it won't eliminate the kind of confused identity problem we encountered in this particular case.
(In reply to Dave Hunt (:davehunt) from comment #12)
> One possible reason this went unnoticed could be that datazilla shows the
> available data spanning the available width of the chart, so a lack of
> recent results might not be clear until there are no results in the last
> week (the default range). I wonder if there's a way in the datazilla UI we
> could make lack of recent data more obvious?
Yes, that, totally! I think the rightmost part of datazilla should be the current time and not the last result.
(In reply to Anthony Ricaud (:rik) from comment #14)
> (In reply to Dave Hunt (:davehunt) from comment #12)
> > One possible reason this went unnoticed could be that datazilla shows the
> > available data spanning the available width of the chart, so a lack of
> > recent results might not be clear until there are no results in the last
> > week (the default range). I wonder if there's a way in the datazilla UI we
> > could make lack of recent data more obvious?
> Yes, that, totally! I think the rightmost part of datazilla should be the
> current time and not the last result.

I've raised bug 974860.
Flags: needinfo?(jeads)
You need to log in before you can comment on or make changes to this bug.