Closed Bug 321404 Opened 19 years ago Closed 19 years ago

Tinderbox build graphs are returning error 500

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugzilla-mozilla-20000923, Assigned: morgamic)

References

()

Details

Every tinderbox build graph I try to view is failing to load the image, for example:

http://build-graphs.mozilla.org/graph/query.cgi?testname=xulwinopen&tbox=comet.mozilla.org&autoscale=1&days=7&avg=1&showpoint=2005:12:23:13:28:09,453

The image on that page is getting a 500 Internal Server Error instead of anything useful, even though the data itself appears to be available.
That's on purpose.  The script has issues.
Status: UNCONFIRMED → NEW
Depends on: 321234
Ever confirmed: true
Oh great, another security issue killing off a useful developer tool. Wheee.
OS: Windows Server 2003 → All
Hardware: PC → All
Assignee: server-ops → justdave
*** Bug 323007 has been marked as a duplicate of this bug. ***
Is there anything I can do to help get this moving?  Without the graphs it's very difficult to spot perf regressions without watching tinderbox continuously...
*** Bug 323749 has been marked as a duplicate of this bug. ***
Severity: normal → major
This is a major problem for Firefox development.  These graphs are very important.  Is there anything I can do to help?
Severity: major → blocker
It's unacceptable for this to be unfixed for so long.  I'm escalating now.

/be
In particular, this bug should have higher priority than anything to-do with news.mozilla.org.

/be
Morgamic - can you take this?
I can give it a shot.  Is there a staging environment for Tinderbox I can use and/or get access to in order to test patches on graph/query.cgi?
Mike is currently working on a fix...
Assignee: justdave → morgamic
It looks like some of the regexp matches were causing die's where it should have just set the respective parameter to '', and there were a couple of cases where there were missing ;'s.

That said, even if I fix syntax errors and update these regexps to allow for a null param, the majority of the time I get no graph, and digging revealed what gnuplot is giving based on the passed $cmds:

gnuplot> reset
gnuplot> set term png color
Terminal type set to 'png'
Options are 'small color'
gnuplot> set output "/tmp/gnuplot.9270"
gnuplot> set title  "comet.mozilla.org xulwinopen"
gnuplot> set key graph 0.1,0.95 reverse spacing .75 width -18
gnuplot> set linestyle 1 lt 3 lw 1 pt 7 ps .5
gnuplot> set linestyle 2 lt 3 lw 1 pt 7 ps 1
gnuplot> set linestyle 3 lt 3 lw 1
gnuplot> set linestyle 4 lt 8 lw 1 pt 7 ps 3
gnuplot> set data style points
gnuplot> set timefmt "%Y:%m:%d:%H:%M:%S"
gnuplot> set xdata time
gnuplot> set xrange ["2006:01:19:12:36:11" : "2006:01:26:12:36:11"]
gnuplot> set yrange [ 0 : ]
gnuplot> set ylabel "xulwinopen (ms)"
gnuplot> set timestamp "Generated: %d/%b/%y %H:%M" 0,-1
gnuplot> set nokey
gnuplot> set grid
gnuplot> plot "db/xulwinopen/comet.mozilla.org" using 1:2 with lines, "db/xulwinopen/point.9270" using 1:2 with points ls 4, "db/xulwinopen/comet.mozilla.org_avg" using 1:2 with lines ls 3
         all points undefined!

This is an example -- the all points undefined error occurs when there is no data associated with the given test/machine/daterange.

So this leads me to wonder if there is a problem with how data gets synced to axolotl from the build systems?  Does anybody know what would cause the delivery of this data to be interrupted?
Status: NEW → ASSIGNED
Upon further investigation, I learned that Tinderbox accesses axolotl via HTTP.  When a build fires off, it access a URL, defined by $tmpurl, that points to collect.cgi in graph/.

For example:
my $tmpurl = "http://$Settings::results_server/graph/collect.cgi";
tmpurl .= "?value=$value&data=$data_plus_co_time&testname=$testname&tbox=$tbox";

What this means is that in order for build information to be properly inserted as a datapoint in the build graph datafile (/db/$tbox), collect.cgi needs to:

1) be accessible over HTTP by the build machine
2) work correctly

In our case, we know that access was not the problem, because the build IP space is allowed access to the graph/ directory on axolotl.

In the second case, it was apparent that the patch meant to fix security holes in *.cgi (particularly for sanitizing GET parameters and opening files properly in rawdata.cgi and graph.cgi to disallow maligned inputs and/or injection) caused a syntax error on the last line of the cleansing for collect.cgi (see attachment 206844 [details] [diff] [review]:
> +die "Unexpected value for parameter 'data' supplied"
> +     unless $data =~ /^(?:\d+:?)*$/

My conclusion is that because of this syntax error, for the period between the date when the patch was applied (probably Jan 10th) to now, data has not been updated for build graphs.

This would explain the errors in gnuplot that report missing data and/or "All points undefined."

The plan for resolving this:

1) Update/fix *.cgi to allow for null inputs, and re-verify all scripts to ensure there are no parse/syntax errors
2) Verify that build data is once again being inserted into /graph/db/$tbox
3) Document Tinderbox's dependency on graph/collect.cgi so this doesn't happen in the future

Thoughts?
So in other words performance data is not actually being collected and hasn't been since Jan 10?

If so we should probably close the tree, especially since tinderbox itself seems to lose such data after some time in my experience...

Is there any way we can re-scrape the tinderbox logs since data collection stopped to restore the data?
It should be possible to scrape the apache logs for past data from the build systems, and create entries retro-actively to restore lost data.

Someone will look into doing this soon, and hopefully we can get the data back this way.  :)

BTW, the graphs should be back up, pending a review on the patch for bug 321234.
This has been re-enabled.  We are working to get back historical data...should have it back in the next few days.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
The historical data should be back for this bug.  I was not however able to retrieve the average data.  I didn't look into this much, it may be easy to do.
Just having the red graph is probably fine unless the other is dead-easy.  Thank you for fixing!
Jeremy > *!
Is bug 321234 going to be opened and/or the fixes checked into CVS?
(In reply to comment #20)
> Is bug 321234 going to be opened and/or the fixes checked into CVS?
> 

Yes, as soon as the webtools/security team reviews the code for cvs check-in.
So... I have some bad news.  :(  The graphs are up and there is data up to the middle of the day on Jan 27 (so up to comment 16 or comment 17 on this bug).  None of the data from after that has made it into the graphs.  Should I file a separate bug on this?
As pointed out in comment 22, collect.cgi apparently isn't working at all now...

Going to need oremj's magic to fill in the blanks again after we get it working, too.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #23)
> As pointed out in comment 22, collect.cgi apparently isn't working at all
> now...
> 
> Going to need oremj's magic to fill in the blanks again after we get it
> working, too.
> 

Is it really boroken, or were file permissions possibly altered by the fill in the blanks process?  It was working right up until the approximate time that the blanks got filled in from what I can see.

Just something to check.
Very possible that this is a permission problem from when I patched... I fixed the permissions we'll see if the data starts appearing again.
Permissions were the problem. I'll fill in the missing data and reset the permissions and it should be all fixed.
Graphs look good right now -- resolving.

If anybody notices strange or completely nonsensical graph information, please holler.
Status: REOPENED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.