Closed Bug 637091 Opened 13 years ago Closed 12 years ago

Refactor and improve the performance of generate_update_statistics

Categories

(Camino Graveyard :: General, defect)

All
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: alqahira, Assigned: stuart.morgan+bugzilla)

References

()

Details

Attachments

(1 file)

Right now, generate_update_statistics is a bunch of grep invocations (as many as 6) for each statistic, even though the vast majority of these greps are repeated for every stat.  This is slow and inefficient.

At the very least, we should factor out the 

grep "$date" $file | grep -Ev "$badips" | grep "Sparkle/"

and 

grep "$date" $file | grep -Ev "$badips" | grep "Sparkle/" | grep " /update-check?os="

things that are common and used all over and only run those once, stashing the results in a variable, rather than when doing every stat.

Stuart volunteered to rewrite the script in A Real Language like Ruby; Sam maybe volunteered to rewrite the script in A Real Language like Python.  I don't really care which, just as long as 

1) The script is faster and more performant;
2) It doesn't regress bug 582113 again;
3) I can still make trivial changes like adding new bad IPs or adding new versions on my own, without having to bug people; and
4) It doesn't break my wrapper droplet that lets me mass-generate results.
Assignee: nobody → stuart.morgan+bugzilla
Status: NEW → ASSIGNED
This is stage of one of the new script. It creates output that is identical* to the old script.

1) Check. Currently on the data I was using it was ~1/3 faster. That's for the fully compatible version where we only process one day though. It's doing initial parsing of all the days, so we can change it later to do all the days at once to get even more of a win over running the old script once with each day.
2) Check. In fact, it now accepts arbitrary numbers of files, and they can have spaces.
3) Check (I hope). Pretty much everything that needs tweaking is at the top of the file, and the format should be pretty clear**
4) Check? The params are the same and the output is almost identical, so I expect it'll be fine. I'll tweak if necessary (but, see my soon-to-be-written follow-up comment)

* A few counts are slightly higher with this script; I haven't tracked down exactly why, but I suspect it's differences between the regexes and my parsing. The numbers are close enough that I'm not concerned about debugging exactly why they aren't identical. Also, the language counts at the end aren't all lined up. That's easily fixable if we want to fix it.

** $by_version_and_os_keys is a bit opaque because I was trying to avoid duplicate work on updates like adding a new OS. The format is: each line is an array, the first element being the Camino major version, and the second being an array of OS major versions. It re-uses the $by_os_keys, but trims off 10.3 for post-1.6; [1..-1 is ruby for "everything from index 1 (the second element) to the last index.

Landed as http://hg.mozilla.org/camino/rev/086034bb94a5
So now that we have a real language to work with, we can easily change it in a couple of efficiency-boosting ways:
- Generate multiple days at once (determining what days are in the log dynamically)
- Make it output tab-separated values matching Docs, so we can (I believe) pipe the output to pbcopy and then paste one or more lines right into docs all at once

We should talk about the current flow and the ideal flow, so we can proceed as makes sense.
(In reply to comment #2)
> So now that we have a real language to work with, we can easily change it in a
> couple of efficiency-boosting ways:
> - Generate multiple days at once (determining what days are in the log
> dynamically)

Given that logs always have a split day, and that I have typically been running this mid-week (the night before meetings, when Tuesday's data is still incomplete), it sounds like dynamically determining the days in the log would be sub-optimal: 1) it'll do extra work for a) days already processed last week and b) days not yet completed, and 2) we'll always have to feed it two logs to get the split day processed, so there will be extra overhead from having two files when it tries to process the other 10-13 days.

> - Make it output tab-separated values matching Docs, so we can (I believe) pipe
> the output to pbcopy and then paste one or more lines right into docs all at
> once

This sounds good, although the current Docs structure is bad in a lot of ways (I'd like the newest foo to always be at the left, rather than the right; the languages should have the shipped languages first, in alpha order, and then any other languages, so that I don't have to move columns every time because we're no longer getting en-GB pings, etc.).

My current process is to run the script for a week's worth of data (I have an AppleScript that provides the date arguments, in a loop) and output to a .txt file.  I transform the .txt file into a .tsv with some regexes and open in NeoOffice to move things around to match the sub-optimal Docs layout.  (I also keep a local record in case something catastrophic happens to the Docs version, and I try but usually forget to download the Docs copy.)
> Given that logs always have a split day

I thought about; it would be pretty simple for the script to trim off the first and last days in whatever data set it had.

> and that I have typically been running this mid-week

So it sounds like it should either take a start and end date, or it should take a start date, and figure out the end date dynamically. Would one of those work for you?

> I'd like the newest foo to always be at the left, rather than the right

Agreed, that's seemed odd to me a number of times. If I re-order the columns and update the script at the same time will that be fine, or do I need to coordinate with you?

> the languages should have the shipped languages first, in alpha order, and then any
> other languages, so that I don't have to move columns every time because we're
> no longer getting en-GB pings, etc.).

So there are two ways we could do this:
1) like you suggested, and hard-code a list of the shipping languages in the script so they get the right order.
2) hard-code all languages in the script, have it output 0 when it doesn't see any pings, and have it print something to the console if it sees a language it doesn't expect.

2 isn't that much harder than 1, so whichever you prefer.

> I transform the .txt file into a .tsv with some regexes and open in
> NeoOffice to move things around to match the sub-optimal Docs layout.

Yeah, we can totally eliminate this step.
(In reply to comment #4)
> So it sounds like it should either take a start and end date, or it should take
> a start date, and figure out the end date dynamically. Would one of those work
> for you?

Start and end seems like it would be the best, since (in effect) that's what I'm currently doing from my AS.

> > I'd like the newest foo to always be at the left, rather than the right
> 
> Agreed, that's seemed odd to me a number of times. If I re-order the columns
> and update the script at the same time will that be fine, or do I need to
> coordinate with you?

If you re-order and update, that should be fine.

> > the languages should have the shipped languages first, in alpha order, and then any
> > other languages, so that I don't have to move columns every time because we're
> > no longer getting en-GB pings, etc.).
> 
> So there are two ways we could do this:
> 1) like you suggested, and hard-code a list of the shipping languages in the
> script so they get the right order.
> 2) hard-code all languages in the script, have it output 0 when it doesn't see
> any pings, and have it print something to the console if it sees a language it
> doesn't expect.
> 
> 2 isn't that much harder than 1, so whichever you prefer.

I'm not clear on these.  Does 1 still output other languages?  What are the "all languages" we're hard-coding in 2?

I'd rather not have any languages ever show up as columns in the sheet that don't get hits (so no 0s), and I mostly only care about the languages we ship (but I do like to see when it appears we have unofficial l10ns out there, and which ones/how many).  So, pending the answer to my questions, it seems like 1 would be my preferred option.
(In reply to comment #5)
> Start and end seems like it would be the best, since (in effect) that's what
> I'm currently doing from my AS.

Landed as http://hg.mozilla.org/camino/rev/cea6099de251

You can still give just one date, but you can also give startdate,enddate to generate a range. Also, the dates are now parsed, so you can use other forms (e.g., 3/21,3/26 will work). The old format is fine though. I changed the output format slightly; the date is output in a different form, and it now has a blank line and enclosing ============= banners, so each day stands out. Easily changed if that's a problem, but soon we'll have the ready-to-paste format instead.

For a rough idea of the benefit here: Doing one day takes me ~30 seconds, but 7 days only takes ~60.
OS: Mac OS X → Windows 7
I tried to run this on Apr 02 data, and 

1) My Mac is far less responsive while this is going on (I couldn't do anything, even switch programs, until the script started to error out), and 

2) It fails:
../../192branch/mozilla/camino/tools/stats/generate_update_statistics:135:in `print_spreadsheet_stats': undefined method `count' for #<Array:0x13c19a2c> (NoMethodError)
	from ../../192branch/mozilla/camino/tools/stats/generate_update_statistics:101:in `main'
	from ../../192branch/mozilla/camino/tools/stats/generate_update_statistics:99:in `each'
	from ../../192branch/mozilla/camino/tools/stats/generate_update_statistics:99:in `main'
	from ../../192branch/mozilla/camino/tools/stats/generate_update_statistics:279

The first version also fails with a similar error:

../../192branch/mozilla/camino/tools/stats/generate_update_statistics:136:in `print_spreadsheet_stats': undefined method `count' for #<Array:0x13c19b58> (NoMethodError)
	from ../../192branch/mozilla/camino/tools/stats/generate_update_statistics:102:in `main'
	from ../../192branch/mozilla/camino/tools/stats/generate_update_statistics:265
OS: Windows 7 → Mac OS X
Sorry, count requires a more recent version of ruby; I didn't realize that. Replaced with length.
(In reply to comment #8)
> Sorry, count requires a more recent version of ruby; I didn't realize that.
> Replaced with length.

Making that change locally, I got identical results* for April 2 with the old bash script and the new ruby one, and better responsiveness from the new one than in comment 7. :)

* Can we output '0' instead of 'nil'?

As for time:

Old: 25.185u 5.646s 0:24.88 123.8%	0+0k 0+11io 14pf+0w
New: 30.828u 6.164s 0:37.84  97.7%	0+0k 2+6io   1pf+0w

The new one seems to be marginally slower (so definitely not ~1/3 faster) here, but if the big win is that doing a week takes only twice as long as a single day, overall it'll be fine, and the slowdown in single day is small enough to not really be noticeable, I think.

I'll test with some other datasets later.
Whoops,(In reply to comment #9)
> * Can we output '0' instead of 'nil'?

Whoops, fixed. There were no 0's in the date I was testing.

I suspect the time issue (and the responsiveness issue) is that my script is keeping state, so it takes a fair amount of memory. If you have memory pressure, it'll be slower. I'll think about whether I can restructure it to be less memory-hungry without giving up flexibility.
For March 30-31 (all in one file):

Old-wrapper: 396.322u 53.172s 7:23.55 101.3%	0+0k 2+152io  76pf+0w (7m)
Old-wrapper (no time):                                                (1m)
Old-manual:   74.834u 11.661s 1:18.25 110.5%	0+0k 0+14io    0pf+0w (2m)
             325.833u 42.250s 5:58.70 102.6%	0+0k 0+12io    0pf+0w (6m)
New-wrapper:  63.719u 12.365s 1:30.76  83.8%	0+0k 0+123io 168pf+0w (1m)
New-wrapper (no time):                                                (1m)
New-manual:   34.725u  5.929s 0:40.78  99.6%	0+0k 0+12io    0pf+0w (1m)

The "wrapper" data includes some amount of interacting with the wrapper UI each time, and runs the command twice (once for each day) because I haven't modified my wrapper AppleScript yet to support the new syntax.

The "m" numbers in parens at the end are times calculated by the difference between created and modified times of the output file; I'm not sure if they're  directly comprable between new and old scripts (the old script should write the date essentially at start; the new one seems to wait a while before writing the date, but I'm not sure what sort of processing it's doing then).  Also, since there are no seconds and I didn't start right as the minute rolled, they could be off some, but they're good enough to tell me a couple of things about my weird data.

Also, there's something very odd going on with the 31 Mar data and the old script (the second line in "Old-manual"); I ran the 31st twice and still got outrageous results the second time!  It also probably accounts for part of the huge time with the wrapper script (but I have no idea why the "no time" run is so fast; a "no time" run of the manual script for the 31st still takes forever!).  There are ~24K bad pings on the 31st.

Anyway, the new script seems to be doing quite well, and I didn't notice any responsiveness issues tonight parsing those two dates out of an 86 MB file.

I did tonight (whereas I didn't in comment 9) notice data discrepancies like you mentioned in comment 1 (slightly higher in the new version); is it the difference between "205.202.243.3[0-9]" and "205.202.243.3.", or some other regex parsing?
3[0-9] and 3. should be identical as long as we aren't getting garbage IP addresses.

Like I said before, I don't know for certain because they were close enough I was confident that there wasn't a glaring problem. What I suspect though (and I guess would be easy enough to verify if we want) is that it's due to other overmatching in the IP filter. For example:
  63.245.210.43
  72.10.105.23
  75.135.19.219
  97.119.20.170
  97.119.24.98
in the old script would all match the corresponding IP addresses starting with 1 (so, 163.245...). The second would match .23, but also .230 through .239.
I'm glad the perf is looking okay :) I'll try to resist the temptation to spend time optimizing the part that doesn't really matter, and work on improving the workflow instead.

I guess we should call this bug good, and open a new bug about changing the output format to be paste-able?
(In reply to Stuart Morgan from comment #13)
> I'm glad the perf is looking okay :) I'll try to resist the temptation to
> spend time optimizing the part that doesn't really matter, and work on
> improving the workflow instead.
> 
> I guess we should call this bug good, and open a new bug about changing the
> output format to be paste-able?

I think so, since you opened bug 647941 back in April.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.