Closed Bug 1165335 Opened 7 years ago Closed 6 years ago

get_bugs_for_search_term doesn't handle utf8 search terms (parse-log UnicodeEncodeError: 'ascii' codec can't encode characters in position 17-28: ordinal not in range(128))

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/556195-eb34d411-fb0a-11e4-94c4-c81f66b8ceca

File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/tasks.py", line 34, in parse_log
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/utils.py", line 102, in post_log_artifacts
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/utils.py", line 77, in _retry
File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/celery/app/task.py", line 660, in retry
File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/celery/utils/__init__.py", line 242, in maybe_reraise
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/utils.py", line 95, in post_log_artifacts
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/utils.py", line 47, in extract_text_log_artifacts
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/error_summary.py", line 219, in get_error_summary_artifacts
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/error_summary.py", line 47, in get_error_summary
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/error_summary.py", line 188, in get_bugs_for_search_term
File "/usr/lib64/python2.7/urllib.py", line 1332, in urlencode
Attached file live_backing.log
[2015-05-15 07:00:22,545: ERROR/Worker-66] Failed to download/parse log for fx-team 44f1bba0-39e2-4ed4-9ae7-96c0eac3d128/0 (https://queue.taskcluster.net/v1/task/RPG7oDniTtSa55bA6sPRKA/runs/0/artifacts/public/logs/live_backing.log): 'ascii' codec can't encode characters in position 17-28: ordinal not in range(128)

[2015-05-15 07:00:22,551: ERROR/MainProcess] Task parse-log[5495384c-fbbd-4a2d-aa81-929edd2693fd] raised unexpected: UnicodeEncodeError('ascii', u'AssertionError: "\u013f\u0227\u0227\u019e\u0260\u016d\u016d\u0227\u0227\u0260\u1e17\u1e17" == "Language"', 17, 29, 'ordinal not in range(128)')
Traceback (most recent call last):
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/newrelic-2.50.0.39/newrelic/hooks/application_celery.py", line 66, in wrapper
    return wrapped(*args, **kwargs)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__
    return self.run(*args, **kwargs)
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/tasks.py", line 34, in parse_log
    check_errors
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/utils.py", line 102, in post_log_artifacts
    _retry(e)
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/utils.py", line 77, in _retry
    retry_task.retry(exc=e, countdown=(1 + retry_task.request.retries) * 60)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/celery/app/task.py", line 660, in retry
    maybe_reraise()
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/celery/utils/__init__.py", line 242, in maybe_reraise
    reraise(exc_info[0], exc_info[1], exc_info[2])
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/utils.py", line 95, in post_log_artifacts
    job_guid, check_errors)
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/log_parser/utils.py", line 47, in extract_text_log_artifacts
    artifact_list.extend(get_error_summary_artifacts(artifact_list))
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/error_summary.py", line 219, in get_error_summary_artifacts
    "blob": json.dumps(get_error_summary(all_errors))
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/error_summary.py", line 47, in get_error_summary
    bugscache_uri
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/error_summary.py", line 188, in get_bugs_for_search_term
    query_string = urllib.urlencode(params)
  File "/usr/lib64/python2.7/urllib.py", line 1332, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 17-28: ordinal not in range(128)


Bad log attached.
The issue is with our urlencode() of the error line that's been found, when we try to query the bugscache.

We should either manually convert the string to ascii, use the Django urlencode() (which I think has better handling for this), or else try using the Python requests library, in case is has more intelligent handling.
Priority: P3 → P2
Summary: parse-log exceptions:UnicodeEncodeError: 'ascii' codec can't encode characters in position 17-28: ordinal not in range(128) → get_bugs_for_search_term doesn't handle utf8 search terms (parse-log UnicodeEncodeError: 'ascii' codec can't encode characters in position 17-28: ordinal not in range(128))
+1 to using requests.
I was going to say I was pretty sure the line / search term that caused the issue was:

AssertionError: "Ä¿È§È§ÆžÉ Å­Å­È§È§É á¸—á¸—" == "Language"

But I can't repro when using that in simple testcase locally?

>>> import urllib
>>> params = { 'search': 'AssertionError: "Ä¿È§È§ÆžÉ Å­Å­È§È§É á¸—á¸—" == "Language"' }
>>> s = urllib.urlencode(params)
>>> params
{'search': 'AssertionError: "\x8e\xa8\xd4\xf5\xd4\xf5\x92z\x90 \x8f\xf0\x8f\xf0\xd4\xf5\xd4\xf5\x90 \xa0\xf7-\xa0\xf7-" == "Language"'}
>>> s
'search=AssertionError%3A+%22%8E%A8%D4%F5%D4%F5%92z%90+%8F%F0%8F%F0%D4%F5%D4%F5%90+%A0%F7-%A0%F7-%22+%3D%3D+%22Language%22'
>>>
Though I can if I stick the 'u' in front of the string:

>>> params = { 'search': u'AssertionError: "Ä¿È§È§ÆžÉ Å­Å­È§È§É á¸—á¸—" == "Language"' }
>>> s = urllib.urlencode(params)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\urllib.py", line 1338, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 17-23: ordinal not in range(128)
>>>
Depends on: 1165356
This will be fixed by bug 1165356.
Status: NEW → RESOLVED
Closed: 7 years ago
No longer depends on: 1165356
Resolution: --- → DUPLICATE
Duplicate of bug: 1165356
Reopening since bug 1188661 is the same as this issue and happening often on some TC jobs, so we don't want to wait until bug 1165356 is fixed.
Assignee: nobody → emorley
Blocks: 1165356
Status: RESOLVED → REOPENED
Priority: P2 → P1
Resolution: DUPLICATE → ---
Status: REOPENED → ASSIGNED
Duplicate of this bug: 1188661
Attachment #8642521 - Flags: review?(cdawson) → review+
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/5a0aa0cdad3a4ee383e6a5afb1c5c545ecf0acff
Bug 1165335 - Remove unused etl.common.retrieve_api_content()

https://github.com/mozilla/treeherder/commit/fad76032a5a3cca48ac564262cd45c7288fdd3d8
Bug 1165335 - Switch from urllib to requests for bugscache API query

urllib isn't handling the unicode found in some log lines correctly,
whereas requests does. This prevents UnicodeEncodeError exceptions when
making the request to the bugscache API to find the bug suggestions for
these log lines.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Duplicate of this bug: 1166958
Wes: this looks like a new bug.  The log parsing had failed for some reason (I don't know why).  I went into the DB directly, and set the status in the job_log_url table to "pending" from "failed"  Then went into TH and it showed as pending.  And a second later, it finished.

Would you open a new bug?  I think perhaps how we should handle this is that, if the user clicks on a job with a failed parse status, that we should retry it anyway.  It's low overhead and may just kick things into working.  Maybe we should also figure out WHY it fails sometimes.  But that could be a timeout or something.  The bug might also mention we should retry parsing if it fails (if we don't already).
New bug please :-)
Flags: needinfo?(emorley)
You need to log in before you can comment on or make changes to this bug.