Closed Bug 1469010 Opened 6 years ago Closed 6 years ago

[machinery] In TM, add support for querying strings longer than 255 characters

Categories

(Webtools Graveyard :: Pontoon, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mathjazz, Assigned: jotes)

References

Details

Attachments

(3 files)

Internal Pontoon translation memory doesn't work for strings longer than 255 characters due to Postgres levenshtein limitations.
I checked the limitation and it's only a #define[1] that's meant to protect memory and cpu from abuse[2], so that nobody asks for a levenshtein of a big blob. It can be increased, they just used a "reasonable" varchar-like value.

I don't know which system you use, but on Debian (and derivatives) you can patch that, recompile postgresql (just the server) and the limitation is gone. It takes a couple of minutes. I just did it.

# download the source
$ apt-get build-dep postgresql-9.6
$ apt-get source postgresql-9.6

# you go into the directory with the source
$ cd DIR_WITH_SOURCE

# change the #define, I set it to 512
$ vi src/backend/utils/adt/levenshtein.c 

# build and install
$ dpkg-buildpackage -rfakeroot -uc -b
$ cd ..
$ dpkg -i NAME_OF_PACKAGE.deb

# restart the server

I tested the attached file before and after the patch:

* Result before the patch:

 levenshtein
-------------
           2
(1 row)

ERROR:  levenshtein argument exceeds maximum length of 255 characters
 levenshtein
-------------
           5
(1 row)


* Result after the patch:

 levenshtein 
-------------
           2
(1 row)

 levenshtein 
-------------
           5
(1 row)

 levenshtein 
-------------
           5
(1 row)

[1] https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/levenshtein.c#L26
[2] https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/levenshtein.c#L122
Hi Eduardo!

First of all, thanks for your input. I also considered modification of this plugin myself some time ago.

However, from what I know, Pontoon is hosted on Heroku and from what I remember, it's not possible to add a custom precompiled extension (feel free to correct me if my knowledge is not up to date).
Would it be possible to do just those searches (involving source strings whose length is more than 255) on an external copy of the database, at Mozilla for example, where you could add the extension? If they are less than 1%, that impact would be, I guess, not very noticeable.

Also, maybe searching and comparing the first 255 would anyway be helpful, though a 100% would have to be downgraded to other value so that translators know it's not a 100% match. That could work for many strings. It's better than an empty Machinery box.

I don't know the internals, so those ideas don't necessarily make sense. Just trying to help.
Imho, the fastest way to solve this issue is to measure the Levenshtein ratio in Python (only for strings longer than 255 characters). 

It may take some time before We'll find a better storage for the Translation Memory (or change algorithm) - and probably this goes beyond the scope of this bug.

:mathjazz what do you think about this approach?
Flags: needinfo?(m)
SGTM.
Flags: needinfo?(m)
Assignee: nobody → poke
Status: NEW → ASSIGNED
Commit pushed to master at https://github.com/mozilla/pontoon

https://github.com/mozilla/pontoon/commit/ce9db869a6a91afa293fbd9f5c489a779d0d5c0d
Fix bug 1469010 - Process all strings above 255 characters length in (#1121)

* Fix bug 1469010 - Process all strings above 255 characters length in
Python.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
I've deployed this patch to stage:

TM returns 500 for any input I try:
https://mozilla-pontoon-staging.herokuapp.com/machinery/

Internal Server Error: /translation-memory/ 
Traceback (most recent call last): 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/exception.py", line 41, in inner 
    response = get_response(request) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/base.py", line 249, in _legacy_get_response 
    response = self._get_response(request) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response 
    response = self.process_exception_by_middleware(e, request) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response 
    response = wrapped_callback(request, *callback_args, **callback_kwargs) 
  File "/app/.heroku/python/lib/python2.7/site-packages/newrelic/hooks/framework_django.py", line 544, in wrapper 
    return wrapped(*args, **kwargs) 
  File "/app/pontoon/machinery/views.py", line 62, in translation_memory 
    .minimum_levenshtein_ratio(text) 
  File "/app/pontoon/base/models.py", line 3030, in minimum_levenshtein_ratio 
    max_dist, 
  File "/app/pontoon/base/models.py", line 2934, in postgres_levenshtein_ratio 
    output_field=models.DecimalField() 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 945, in annotate 
    clone.query.add_annotation(annotation, alias, is_summary=False) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/query.py", line 973, in add_annotation 
    summarize=is_summary) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/expressions.py", line 217, in resolve_expression 
    for expr in c.get_source_expressions() 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/expressions.py", line 411, in resolve_expression 
    c.lhs = c.lhs.resolve_expression(query, allow_joins, reuse, summarize, for_save) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/expressions.py", line 411, in resolve_expression 
    c.lhs = c.lhs.resolve_expression(query, allow_joins, reuse, summarize, for_save) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/expressions.py", line 548, in resolve_expression 
    c.source_expressions[pos] = arg.resolve_expression(query, allow_joins, reuse, summarize, for_save) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/expressions.py", line 411, in resolve_expression 
    c.lhs = c.lhs.resolve_expression(query, allow_joins, reuse, summarize, for_save) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/expressions.py", line 411, in resolve_expression 
    c.lhs = c.lhs.resolve_expression(query, allow_joins, reuse, summarize, for_save) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/expressions.py", line 471, in resolve_expression 
    return query.resolve_ref(self.name, allow_joins, reuse, summarize) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/query.py", line 1477, in resolve_ref 
    self.get_initial_alias(), reuse) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/query.py", line 1417, in setup_joins 
    names, opts, allow_many, fail_on_missing=True) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/query.py", line 1352, in names_to_path 
    "Choices are: %s" % (name, ", ".join(available))) 
FieldError: Cannot resolve keyword 'source_length' into field. Choices are: entity, entity_id, id, locale, locale_id, project, project_id, source, target, translation, translation_id
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Commit pushed to master at https://github.com/mozilla/pontoon

https://github.com/mozilla/pontoon/commit/173b38f0389ffa1e2ed7d57db1e0bbcceb1fe3b1
Fix bug 1469010: Calculate source length in the quality expression (#1134)
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: