Last Comment Bug 390349 - returning builds are often hidden
: returning builds are often hidden
Status: RESOLVED FIXED
:
Product: Webtools Graveyard
Classification: Graveyard
Component: Tinderbox (show other bugs)
: Trunk
: All All
: -- minor (vote)
: ---
Assigned To: Aki Sasaki [:aki]
:
:
Mentors:
Depends on:
Blocks: 482802
  Show dependency treegraph
 
Reported: 2007-07-31 12:15 PDT by Nick Thomas [:nthomas]
Modified: 2014-06-16 14:00 PDT (History)
11 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
don't autoignore builds that fall off the waterfall (2.58 KB, patch)
2009-02-26 14:02 PST, Aki Sasaki [:aki]
no flags Details | Diff | Splinter Review
preserve ignore/scrape/warning states for inactive builds (5.18 KB, patch)
2009-02-26 19:35 PST, Aki Sasaki [:aki]
no flags Details | Diff | Splinter Review
add current column & list all admin-able builds (5.30 KB, patch)
2009-03-04 00:23 PST, cls
aki: review+
Details | Diff | Splinter Review
trunk version of cls' patch (5.94 KB, patch)
2009-03-10 19:18 PDT, Aki Sasaki [:aki]
no flags Details | Diff | Splinter Review

Description Nick Thomas [:nthomas] 2007-07-31 12:15:51 PDT
When a new build reports to a tinderbox tree it will often be hidden, so that it must then be enabled in the admin pages. It seems to be fairly random, but the right thing to do from a Mozilla point of view is default to show all new builds.

For example, I just restarted some builds that had active previously, but been idle for the last 9 days (well off the waterfall page). Two from two were hidden.
Comment 1 timeless 2007-07-31 23:47:35 PDT
are you sure people didn't disable it because it was yellow for so long that they didn't like seeing it?
Comment 2 Nick Thomas [:nthomas] 2007-08-01 01:08:21 PDT
I really really doubt it. Builds that were stopped would have been yellow for 12 hours then fallen off the waterfall.
Comment 3 cls 2007-08-01 08:53:37 PDT
If the build wasn't showing up in the admin interface as "Active", then someone explicitly disabled it.  We default to the Active state if the build name doesn't show up in the ignore_builds hash.
http://lxr.mozilla.org/mozilla/source/webtools/tinderbox/admintree.cgi#185

Also, I added a new build to my local tree yesterday and it popped up without me needing to explicitly add it via the admin interface.  FWIW, it didn't show up the minute I started it.  I had to wait a cycle for the build mail to be processed.  
Comment 4 Nick Thomas [:nthomas] 2007-08-01 09:19:46 PDT
It's not very fair to close this on the basis of one test, given that I've stated there is some randomness involved. This problem is something that more than one person on the build team has noticed, regretfully we can't provide a lot of information because it's not something that occurs that often. Suggestions of things to check, or test scenarios, would be welcome.

I should also clarify my original report, where the summary refers to new builds but the example is for a build that is coming back after an absence. It's the latter case which is what I'd like to focus on, because (IIRC) it's more common.  Does anything happen to $treedata->{ignore_builds} when a build falls off the waterfall, or is it cleaned up at some later time ?

> I had to wait a cycle for the build mail to be processed.  

A build cycle or a tinderbox server one ? Do you mean that it appears after the build end message but not after the build start one ? I didn't wait for the build cycle to finish, but the yellow building state was displayed as soon as enabled the builds.
Comment 5 cls 2007-08-01 09:44:47 PDT
Actually, I was closing it on the basis of the code in the tree.  The test was just due diligence.  To be fair, it turns out that tinderbox.m.o isn't running the latest server code but the old code had the same behavior. http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&file=admintree.cgi&branch=&root=/cvsroot&subdir=mozilla/webtools/tinderbox&command=DIFF_FRAMESET&rev1=1.31&rev2=1.32

I thought that I was fairly clear with the opening statement of my previous comment:  If you had to explicitly set the build as Active, then someone had explicitly set it as not Active at some point.  There's no randomness there.

The ignore_builds file is never cleaned up.  Once a build is added to the ignore file, it lives there until the file is rewritten by someone changing the Active state of any build in that tree.

The mails sent to the tinderbox server are processed by a cron job.  That is the cycle to which I'm referring to.  justdave used to have those mails being processed every minute. The interval may have changed since then.
Comment 6 Dave Miller [:justdave] (justdave@bugzilla.org) 2007-08-01 09:46:32 PDT
It's still one minute.
Comment 7 Nick Thomas [:nthomas] 2008-05-19 14:58:53 PDT
Here's some data to illustrate the weirdness. Compare

http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaRelease&maxdate=1210647116&hours=24&noignore=1
http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaRelease&maxdate=1210647116&hours=24

These machines are only shown on the first link:
Linux fx-linux-1 Dep Fx-Trunk-l10n-Release	
Linux fx-linux-1 Dep Release	
MacOSX Darwin 8.8.4 bm-xserve10 Dep Fx-Trunk-l10n-Release	
MacOSX Darwin 8.8.4 bm-xserve10 Dep Release	
WINNT 5.2 fx-win32-19-s2 Dep Fx-Trunk-l10n-Release	
WINNT 5.2 fx-win32-19-s2 Dep Release

Then look at the admin page
  http://tinderbox.mozilla.org/admintree.cgi?tree=MozillaRelease
and there's no sign of those builds there.

I don't know if the issue lies within tinderbox, or the way it's run at Mozilla, but hopefully this might spark a conversation about it.
Comment 8 Nick Thomas [:nthomas] 2008-05-19 15:00:25 PDT
Forgot to say that the 6 builds listed in comment #7 only run once every few weeks, as they are release builds. The other ones run 24/7 doing nightlies.
Comment 9 Nick Thomas [:nthomas] 2008-09-03 17:11:17 PDT
This problem is still biting us. What further information can we provide to help diagnose the problem ?
Comment 10 Nick Thomas [:nthomas] 2008-09-03 17:14:53 PDT
Belay that, we should give the changes picked up in bug 446547 a chance first.
Comment 11 cls 2008-09-03 18:58:10 PDT
Re: comment #7, I'm still not seeing the weirdness.  The only difference between the first & second URLs is "&noignore=1".  That means that the 6 builds that show up in the first URL but not the second are explicitly marked as ignored.  IOW, they're listed in ignorebuilds.pl which only happens because someone put them there.

The data used to determine the builds listed on the admintree.cgi page comes from the tb_load_data() routine of tbglobals.pl which calls load_buildlog().  That routine only lists builds between maxdate & mindate - 2 hrs.  

The maxdate on the first 2 URLs starts from "Mon May 12 19:51:56 2008".  The non-existant timestamp on the 3rd URL means that maxdate is "now".  Given that the time windows are different, those URLs are unlikely to return the same set of builds.
Comment 12 David Baron :dbaron: ⌚️UTC-7 2009-01-27 16:10:00 PST
So the underlying cause of this that I've definitely seen in the past is that when somebody hides a build because it's misbehaving, the checkbox for unhiding it disappears once the machine hasn't reported for 12 hours.  So when the machine starts reporting again, the fact that it was hidden hasn't been in the tinderbox admin UI since it last reported.  (And I think it's not uncommon for build machine admins to hide tinderboxes when they take them down for maintenance, although I have told people not to do that a bunch of times.)


Perhaps a solution for this would be to change the machine-hiding UI such that it automatically shows *all* machines that are currently hidden as unchecked, whether or not they've reported in the last 12 hours, but only skips the not-recently-reported machines for the ones that would be checked (as showing).
Comment 13 David Baron :dbaron: ⌚️UTC-7 2009-01-27 16:11:18 PST
(That said, I would *strongly* advise people at Mozilla to *never* use the machine hiding UI, because it also breaks with "show me the waterfall 80 hours ago").
Comment 14 Ben Hearsum (:bhearsum) 2009-01-29 09:20:31 PST
I'm reopening this based on comment #12.
Comment 15 cls 2009-01-30 08:10:57 PST
The first part of the proposal is to automatically add the machines from ignorebuilds.pl to the tree table on the admin page and mark them as not Active. First, returning machines will automatically be treated this way (the original problem).  Second, with this change, you can never remove trees from ignorebuilds.pl unless you mark them as active.  Right now, they'll fall off if they aren't in the 12 hr window the next time the file is modified. 

The second part doesn't make any sense to me.  "but only skips the not-recently-reported machines for the ones that would checked"?  So, you'd want to ignore the ignorebuilds.pl entries for recently reported machines?  Since there are no timestamps in the ignorebuilds.pl file, there's no way to implement that unless you want to use the timestamp of the ignorebuilds.pl file itself which seems like a bad idea since the other timestamps come from the client.

That seems like a whole lot of trouble to resolve just to get people to realize that they have to unhide a build that someone hid.
Comment 16 Ben Hearsum (:bhearsum) 2009-01-30 08:14:09 PST
(In reply to comment #15)
> That seems like a whole lot of trouble to resolve just to get people to realize
> that they have to unhide a build that someone hid.

The whole point here is that they were _not_ hidden by someone. They dropped off for some period of time, reported again and were _not_ shown.

Seems pointless to re-open this again, though, so I won't.
Comment 17 cls 2009-01-30 09:11:01 PST
(In reply to comment #16)
> The whole point here is that they were _not_ hidden by someone. They dropped
> off for some period of time, reported again and were _not_ shown.

You have yet to show a case where they were not hidden by someone.  _You_ may not know who that someone is but _someone_ did it.  There are no real logs or even timestamps kept for tinderbox actions other than the standard apache log so I don't see how you can say with 100% certainty that they were not hidden by someone.

The code clearly shows that the only way to hide a tree is to do it explicitly.  dbaron's statements clearly show that when this happened in the past, it was because someone hid the build.

And as I stated earlier, the ignorebuilds.pl file is not automatically cleared.  So when your build "dropped off" and came back hidden, it was because it was still marked as ignored.
Comment 18 David Baron :dbaron: ⌚️UTC-7 2009-01-30 09:24:44 PST
(In reply to comment #17)
> The code clearly shows that the only way to hide a tree is to do it explicitly.
>  dbaron's statements clearly show that when this happened in the past, it was
> because someone hid the build.

Yes, but it's a disaster because the hiding UI doesn't reflect what's been hidden.

I'm planning to file a separate bug requesting that the hiding UI be hidden from tinderbox.mozilla.org, both because of this bug, and because hiding hides a tinderbox over all past views.
Comment 19 cls 2009-01-30 09:34:18 PST
(In reply to comment #18)
> Yes, but it's a disaster because the hiding UI doesn't reflect what's been
> hidden.

Correct. The admin UI only reflects current trees.  You could show ignored builds for the entire history of the tree but that has the potential to get lengthy depending upon how often you trim your trees.

> I'm planning to file a separate bug requesting that the hiding UI be hidden
> from tinderbox.mozilla.org, both because of this bug, and because hiding hides
> a tinderbox over all past views.

Seems like overkill.  Hiding trees has its uses.  It just sounds like it's being misused.  And does &noignore=1 not work when looking at past views?
Comment 20 Aki Sasaki [:aki] 2009-02-24 17:17:49 PST
Looking at admin_builds ( http://mxr.mozilla.org/webtools/source/tinderbox/doadmin.cgi#171 ), I noticed that it looks at build.dat and sets $active_buildnames{$bname} = 0, then gets the active list from the submitted form.

However, if a build has been inactive, it's not listed in the form.  This causes inactive builds to become ignored, and they no longer reappear when they become active.

This seems to have broken in revision 1.28 of doadmin.cgi.
Comment 21 Aki Sasaki [:aki] 2009-02-26 14:02:14 PST
Created attachment 364390 [details] [diff] [review]
don't autoignore builds that fall off the waterfall

Since we currently set $active_buildnames{$bname} to 0 for all builds in build.dat, builds that fall off the waterfall are often added to the ignore list inadvertently.

This patch creates %active_buildnames from a list of all builds on the waterfall + all builds currently in ignorebuilds.pl.

An alternate approach would be to include the builds in ignorebuilds.pl to the checkbox list, but that could get messy in a hurry.  I think this is the right approach.

Tested locally.
Comment 22 Aki Sasaki [:aki] 2009-02-26 14:13:57 PST
(Steps to reproduce:

1. Create a test tinderbox page
2. Put a build off the waterfall in build.dat.  You can either do this by sending a build to the test page and then waiting for it to fall off, or by adding a line to build.dat like this:

1226012310|1226012000|bug 454055|unix|busted|1226012000.1226012607.3994.gz|

3. rm $::tree_dir/$tree/ignorebuilds.pl , or otherwise note that the build 'bug 454055' is not in there.

3. Go to the admin page for the test tree.  Note that the build 'bug 454055' is not in the ignore form.  Make no changes, and click "Change build configuration"

4. Open $::tree_dir/$tree/ignorebuilds.pl.  Note that the build 'bug 454055' is now ignored.)
Comment 23 Aki Sasaki [:aki] 2009-02-26 14:36:34 PST
Comment on attachment 364390 [details] [diff] [review]
don't autoignore builds that fall off the waterfall

Actually, I'm going to forget all the scrape/warning settings for builds off the waterfall.  This isn't any worse than it is now, but I'll see if that's easily fixed.
Comment 24 Aki Sasaki [:aki] 2009-02-26 19:35:33 PST
Created attachment 364464 [details] [diff] [review]
preserve ignore/scrape/warning states for inactive builds

Instead of getting a list of all builds from build.dat, zeroing them out, and then un-zeroing the active clicked builds, this patch:

1. gets the currently saved settings from ignorebuilds.pl, scrapebuilds.pl, and warningbuilds.pl
2. "zeros out" all active builds (actually sets to 1 for $ignore_builds)
3. toggles all clicked active builds
4. Data::Dumper::Dump()s to the appropriate files.

The main side effect I've noticed is that you find a lot of builds set to 0 in the above three files.  This doesn't harm anything, but if it's annoying I can

    delete($ignore_builds->{$build_name});

rather than

    $ignore_builds->{$build_name} = 0;

I have yet to find a perl5 install without a default install of Data::Dumper, but we can revert to the long print() method if the dependency is an issue.
Comment 25 cls 2009-03-03 21:59:27 PST
(In reply to comment #22)
> (Steps to reproduce:
> 
> 1. Create a test tinderbox page
> 2. Put a build off the waterfall in build.dat.  You can either do this by
> sending a build to the test page and then waiting for it to fall off, or by
> adding a line to build.dat like this:
> 
> 1226012310|1226012000|bug 454055|unix|busted|1226012000.1226012607.3994.gz|
> 
> 3. rm $::tree_dir/$tree/ignorebuilds.pl , or otherwise note that the build 'bug
> 454055' is not in there.
> 
> 3. Go to the admin page for the test tree.  Note that the build 'bug 454055' is
> not in the ignore form.  Make no changes, and click "Change build
> configuration"
> 
> 4. Open $::tree_dir/$tree/ignorebuilds.pl.  Note that the build 'bug 454055' is
> now ignored.)

The above steps do not reproduce the problem for me using branch tbox1_20080527_cls_branch .  

We have a tree that has 3 regular builds and 3 one-time builds that are kicked off each morning @ 6am by a cron job.  

Every day, I see the builds appear in the morning and disappear in the evening (PST).  

I verified that the builds do not appear in the ignorebuilds.pl though older builds that I hid last month do appear.  

When I pull up the admin page (currently at 9:45pm), all 6 builds still appear in the admin table.

If I click 'Change build configuration', none of the 6 builds appear in the ignorebuilds.pl file.  The file still contains all of the old entries but the timestamp on the file has changed.

Now, I just did try the explicit step of adding a fake build with an ancient timestamp and that exhibits the behavior of adding the fake build to the ignore file.  Hrmph.
Comment 26 Aki Sasaki [:aki] 2009-03-03 22:39:19 PST
(In reply to comment #25)
> Now, I just did try the explicit step of adding a fake build with an ancient
> timestamp and that exhibits the behavior of adding the fake build to the ignore
> file.  Hrmph.

This would leave me to believe that at some point >16hrs the build is removed from the admin table but still exists in build.dat.

Also, if you remove one of the old, hidden builds from ignorebuilds.pl, it should be re-ignored/unscraped/unwarned on submit if it still exists in build.dat.

I suppose if we find the threshold (24hrs? 7days?) where builds disappear from the admin table, an alternate fix could be to ignore any builds in build.dat where |timestamp < time() - threshold| when populating ignorebuilds.pl.

Or if the log cleanup is trimming the admin table, we could possibly trim build.dat at the same time.
Comment 27 Aki Sasaki [:aki] 2009-03-03 23:56:02 PST
Comment on attachment 364464 [details] [diff] [review]
preserve ignore/scrape/warning states for inactive builds

Also, if you otherwise approve of this patch, I should submit a new one without eval(), in case someone embeds perl in their build name.
Comment 28 cls 2009-03-04 00:23:44 PST
Created attachment 365394 [details] [diff] [review]
add current column & list all admin-able builds

Quick definitions:
* Current builds: builds returned by tb_load_data() (default: now - 14 hrs)
* Active builds: builds not in ignorebuilds.pl
* Scrape builds: builds in scrapebuilds.pl
* Warning builds: builds in warnings.pl 

Changes:
* All of the above build types are listed in the admin table
* The admin table adds a 4th disabled column that indicates whether the build is a Current build
* Builds are only added to the state files (ignore/scrape/warn) if they are listed in the admin table.
   - This allows builds to be removed from those state files
   - Once removed, the builds no longer show up in the admin table unless they are Current.
* Build.dat is no longer scanned for builds so the page should load noticably faster for large trees.
Comment 29 Aki Sasaki [:aki] 2009-03-04 09:36:55 PST
Awesome.  I'll pound on this for a bit.
Comment 30 Aki Sasaki [:aki] 2009-03-04 14:57:05 PST
Comment on attachment 365394 [details] [diff] [review]
add current column & list all admin-able builds

>+        push @names, $aname if (!grep(/^$aname$/, @names));

Hunh.  I always built a hash and used keys() to get a unique array.
TMTOWTDI =)

It's a little bit odd that you have to mark the checkboxes |(_) X _ _| to make the build name disappear from the list, but that's far from the quirkiest thing about tinderbox 1.

Played with this a bit, looks good.  r=aki

How do we get this into t.m.o?
Comment 31 cls 2009-03-04 15:29:32 PST
(In reply to comment #30)
> It's a little bit odd that you have to mark the checkboxes |(_) X _ _| to make
> the build name disappear from the list, but that's far from the quirkiest thing
> about tinderbox 1.

Yeah.  I thought about masking that quirk with a delete column but I'm not sure it's worth the effort.  I'm also not crazy about the 'active' vs 'current' naming but the legend should clear up that potential confusion.

> How do we get this into t.m.o?

I believe t.m.o syncs to trunk so reed or justdave need to apply the patch to trunk and update t.m.o.
Comment 32 cls 2009-03-10 18:52:47 PDT
Checked into tbox1_20080527_cls_branch

Checking in admintree.cgi;
/cvsroot/mozilla/webtools/tinderbox/admintree.cgi,v  <--  admintree.cgi
new revision: 1.37.2.6; previous revision: 1.37.2.5
done
Checking in doadmin.cgi;
/cvsroot/mozilla/webtools/tinderbox/doadmin.cgi,v  <--  doadmin.cgi
new revision: 1.35.2.7; previous revision: 1.35.2.6
done
Comment 33 Aki Sasaki [:aki] 2009-03-10 19:18:09 PDT
Created attachment 366740 [details] [diff] [review]
trunk version of cls' patch

If needed, since the above didn't apply cleanly to trunk.
Comment 34 Reed Loden [:reed] (use needinfo?) 2009-03-11 14:13:38 PDT
Checking in admintree.cgi;
/cvsroot/mozilla/webtools/tinderbox/admintree.cgi,v  <--  admintree.cgi
new revision: 1.41; previous revision: 1.40
done
Checking in doadmin.cgi;
/cvsroot/mozilla/webtools/tinderbox/doadmin.cgi,v  <--  doadmin.cgi
new revision: 1.40; previous revision: 1.39
done
Comment 35 Markus Stange [:mstange] 2009-03-13 02:23:55 PDT
*** Bug 483057 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.