Closed Bug 863602 Opened 11 years ago Closed 11 years ago

5 partial update mar urls not tested in final verification check in Firefox 21.0b3

Categories

(Release Engineering :: Release Automation: Other, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(1 file)

Log file was too large to attach - instead you can see it here:

http://people.mozilla.org/~pmoore/firefox21.0b3_final_verification_log.txt

Pete
Please note, this line does exist in log file:

Wed Apr 17 05:13:49 PDT 2013:  Successfully extracted mar url and mar file size for patch type 'partial' from https://aus3.mozilla.org/update/1/Firefox/21.0/20130401192816/WINNT_x86-msvc/ku/releasetest/update.xml?force=1

This should have returned the mar file url:
http://download.mozilla.org/?product=firefox-21.0b3-partial-21.0b1&os=win&lang=ku&force=1

Which then should have redirected to url:
http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/21.0b3/update/win32/ku/firefox-21.0b1-21.0b3.partial.mar

I am currently investigating why this was not the case. It is strange because there were no errors. Will keep this ticket posted with progress.
I believe I have finally found the cause of this issue as a concurrency bug in the code. After turning on shell debugging and running several tests, I have been able to generate a similar issue while capturing a 221 Mb log file (3.5 million lines of output!!). Thankfully vim treats such large log files with ease! :)

Since the log file is so long, I will here just show the lines of critical importance to understand the cause of the error.

Line 25 of get-update-xml.sh writes the mar file and the file size to standard out:
    echo "${mar_url_and_size}"

This script is forked by the xargs command, running multiple (by default 128) separate processes. Since the subprocesses write to standard out with a simple echo command, the output of the xargs command is the concatenation of the output of all the individual subprocesses. Therefore the collected output should contain the variable ${mar_url_and_size} from various child processes.

Here is a subset of this output, taken from the debug log:

pmoore@hazel:~/git/tools/release master $ cat newoutput2.log | sed -n '594550,594949p' | grep "^+ echo 'http:"
+ echo 'http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=km&force=1 27046714'
+ echo 'http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=hi-IN&force=1 26789119'
+ echo 'http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=ko&force=1 26758540'
+ echo 'http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=hy-AM&force=1 26786915'
+ echo 'http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=lg&force=1 26754427'

Here we see the output of five different instances of get-update-xml.sh, running in parallel, passing their output back to the standard out of the xargs command that forked them.

Later on in the log file, we see the output that the xargs command received back from the get-update-xml.sh child processes it created (since it wrote it to a file, which I later output):

pmoore@hazel:~/git/tools/release master $ cat newoutput2.log | sed -n '3471714,3471718p'
http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=km&force=1 27046714
http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=ko&force=1 26758540
19
http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=hy-AM&force=1 26786915
http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=lg&force=1 26754427

The suspicious '19' on a line of its own provides the evidence of the problem. Compare these lines, to the original ones that were echo'd above. Essentially the first debug output shows what was scheduled to be written to standard out, and the second debug output shows what actually ended up in standard out - and they are not the same.

The reason is that these two lines start writing to standard output of the xargs process *at the same point* (not in sequence of one another):

+ echo 'http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=hi-IN&force=1 26789119'
+ echo 'http://download.mozilla.org/?product=firefox-21.0b3-complete&os=win&lang=ko&force=1 26758540'

the first line is written first, but the second line is written afterwards, overwriting the content of the first line. However, the original line was 3 characters longer, so the final three characters of the string (1, 9, carriage return) remain in the standard output even after the second line has been written. This is why we see the mysterious "19" in a line all of its own - the same last three characters of the original string. The other characters from the original url are replaced by the second url, which was three charaters shorter.

In this instance, since the original url was longer than the second one, we see the evidence that the original url was written to standard out, but got partially replaced, because a part remains. Now if it was the other way around, and the second url was longer, we would not have seen any trace of the first url. This is what I believe happened in the case of the five missing urls that this ticket relates to. I will investigate this now to test this hypothesis.

This bug certainly needs to be fixed, regardless of whether it is the same issue as raised in this ticket (which I am pretty sure it will be). Do this I will simply write to a temporary file, rather than to standard out, so that the individual streams of writing do not interfere with each other, since they will each have separate file handles. They can be serially collated afterwards, to ensure that are all included.

Lesson learned: be careful when forking processes that write to a shared stream (e.g. standard error or standard out), since it appears the subprocesses may write in parallel, and potentially overwrite each others' output.
Looks like it might be a job for GNU parallel instead of xargs...

http://www.gnu.org/software/parallel/man.html#differences_between_xargs_and_gnu_parallel

GNU parallel should provide the same functionality, but avoid confusing the output of the subprocesses... At least according to the article!
Hey Ben,

This patch now makes sure that parallel processes forked by xargs are no longer writing to shared streams (neither shared files nor shared standard out/standard error). Instead each process writes to its own unique temporary files, and when xargs completes, the temporary files are then processed (and eventually deleted).

Please note due to the synchronisation problems of subprocesses concurrently outputting logs, no output is provided while xargs is running - so it may appear that nothing is happening. However, as soon as xargs completes, the "buffered logs" are then flushed, so the logging history will be seen. In other words, the logging output from xargs is simply postponed until xargs completes.

Please note the patch has been created against the changes introduced in bug 863428 - therefore the patch from bug 863428 should be applied before this patch.

Thanks,
Pete
Attachment #740100 - Flags: review?(bhearsum)
Depends on: 863428
Comment on attachment 740100 [details] [diff] [review]
Patch to be applied after patch from bug 863428

Review of attachment 740100 [details] [diff] [review]:
-----------------------------------------------------------------

Seems okay. In the future I think something like this is complex enough to warrant a Python rewrite, though.
Attachment #740100 - Flags: review?(bhearsum)
Attachment #740100 - Flags: review+
Attachment #740100 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: