Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check pofile string delimiters #1151

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

rtobar
Copy link

@rtobar rtobar commented Nov 17, 2024

This PR adds checks to the pofile parser code to validate that message strings are correctly delimited by double quotes. Keeping with the current design, an error is only raised if requested, otherwise a warning is printed, the faulty lines are corrected and parsing goes on.

I found this issue while processing a pofile used in the Spanish translation of the CPython documentation. One of our files was incorrectly written, and from all our tooling only the msgcat tool of GNU's gettext package complained, while babel, polib and others didn't. See python/python-docs-es#2873, izimobil/polib#161 and https://git.afpy.org/AFPy/powrap/pulls/4 for further reference.

While implementing this change I found that the _NormalizedString class not only was used to contain message lines, but also participated in the parsing process (and hid some parsing as well). I thus broke down my changes into three separate commits:

  • I first clarified the usage of the current _NormalizedString class across the codebase (see details in commit).
  • I then added the double quote delimitation check logic I wanted to add to the parser
  • Now that all strings have the same form, I more formally constrained how _NormalizedString behaves

Along the way I also implemented three small quality-of-life changes. They are included as the first three commits of this PR, happy to submit these separately if required:

  • Avoid re-compiling a regular expression
  • Remove a duplicate test assertion
  • Perform a better assertion in a particular test, allegedly what was intended in the first place

rtobar added a commit to python/python-docs-es that referenced this pull request Nov 18, 2024
En `library/re.po` había una entrada que no estaba delineada
correctamente con comillas dobles (si ven el diff entero es la última
entrada en el diff, o pueden ver simplemente el primer commit de este
PR). Esto hacía que `powrap --check` se saltara el archivo y no lo
validara. Esto, a su vez, ocurría porque la utilidad `msgcat` de
`gettext` identificaba el error de sintaxis, y fallaba al ser ejecutada.
`powrap` no consideraba esos errores al momento de calcular el exit code
del proceso, y por lo tanto el archivo no sólo seguía siendo inválido,
sino que tampoco era verificado. De igual forma, el archivo no podía ser
wrapeado correctamente usando `powrap library/re.po`.

Ya abrí un PR contra `powrap` para cambiar este comportamiento en
https://git.afpy.org/AFPy/powrap/pulls/4 (actualización: el PR ya fue
mergeado, y una nuevs versión de powrap fue publicada, pornlo que
también actualicé en este PR nuestra dependencia de powrap, además del
pre-commit hook de powrap).

Por otro lado, el resto de nuestras herramientas *no* consideraban este
archivo como inválido, Esto es porque `polib` no hacía la validación
correspondiente, e incorrectamente parseaba la entrada. También abrí un
PR contra polib para esto en izimobil/polib#161.
Actualización: en el intertanto también me di cuenta de que el paquete
`babel` sufre del mismo problema, yo incorrectamente había asumido que
babel dependía de polib; PR creada contra babel:
python-babel/babel#1151.

Después de corregir el error de sintaxis, ejecuté powrap de tal manera
que ahora `library/re.po` está bien formateado.

---------

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar
Copy link
Author

rtobar commented Nov 21, 2024

Gentle ping, at least to kick off CI and check if there's any obvious mistakes to be fixed

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Project coverage is 91.26%. Comparing base (740ee3d) to head (1a9a142).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
babel/messages/pofile.py 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1151      +/-   ##
==========================================
- Coverage   91.27%   91.26%   -0.02%     
==========================================
  Files          27       27              
  Lines        4619     4634      +15     
==========================================
+ Hits         4216     4229      +13     
- Misses        403      405       +2     
Flag Coverage Δ
macos-12-3.10 ?
macos-12-3.11 ?
macos-12-3.12 ?
macos-12-3.13-dev ?
macos-12-3.8 ?
macos-12-3.9 ?
macos-12-pypy3.10 ?
ubuntu-22.04-3.10 90.05% <95.45%> (-0.02%) ⬇️
ubuntu-22.04-3.11 89.98% <95.45%> (-0.02%) ⬇️
ubuntu-22.04-3.12 90.20% <95.45%> (-0.02%) ⬇️
ubuntu-22.04-3.13-dev 89.72% <95.45%> (-0.01%) ⬇️
ubuntu-22.04-3.8 89.98% <95.45%> (-0.02%) ⬇️
ubuntu-22.04-3.9 89.98% <95.45%> (-0.02%) ⬇️
ubuntu-22.04-pypy3.10 90.05% <95.45%> (-0.02%) ⬇️
windows-2022-3.10 90.17% <95.45%> (-0.02%) ⬇️
windows-2022-3.11 90.10% <95.45%> (-0.02%) ⬇️
windows-2022-3.12 90.32% <95.45%> (-0.02%) ⬇️
windows-2022-3.13-dev 89.84% <95.45%> (-0.02%) ⬇️
windows-2022-3.8 90.09% <95.45%> (-0.02%) ⬇️
windows-2022-3.9 90.09% <95.45%> (-0.02%) ⬇️
windows-2022-pypy3.10 90.17% <95.45%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@akx akx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Sorry for the delay and thanks for the contribution! ❤️

Some comments within – very nice work on the very detailed commit messages; they were a joy to read!

babel/messages/pofile.py Outdated Show resolved Hide resolved
babel/messages/pofile.py Outdated Show resolved Hide resolved
babel/messages/pofile.py Show resolved Hide resolved
babel/messages/pofile.py Outdated Show resolved Hide resolved
babel/messages/pofile.py Show resolved Hide resolved
Copy link
Author

@rtobar rtobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @akx for the review, it's much appreciated.

I'm far from computers.at the moment, so I won't be pushing anything until next week, at least. In the meantime please see my replies. I'll also attend to the coverage report that was raised, I thought I had covered all bases with the tests.

Do you prefer force-pushes (I know I do) or further commits?

babel/messages/pofile.py Outdated Show resolved Hide resolved
babel/messages/pofile.py Outdated Show resolved Hide resolved
babel/messages/pofile.py Show resolved Hide resolved
babel/messages/pofile.py Outdated Show resolved Hide resolved
babel/messages/pofile.py Show resolved Hide resolved
While the re module caches some of the latest compilations, it's better
form to not rely on it doing so.

Signed-off-by: Rodrigo Tobar <[email protected]>
The exact same check is performed a few lines above.

Signed-off-by: Rodrigo Tobar <[email protected]>
Since Python 2.3 sorted() has been guaranteed to be stable. The comment
was wrong, and thus it makes sense to do the full assertion as clearly
intended.

Signed-off-by: Rodrigo Tobar <[email protected]>
The _NormalizeString helper class mixes some responsibilities, not only
acting as a container for potentially multiple lines of a single string
message, but also doing and hiding some of the parsing of such strings.
"Doing" because it performs a .strip() on all incoming strings in order
to remove any whitespace before/after them, and "hiding" because when
invoking the "denormalize" method, each line is slices to remove the
first and last element, which are implicitly assumed to be the string
delimiters (double quotes, in principle).

These multiple roles have already led to confusion within the codebase
as to how this class is supposed to be used. Its existing unit test
doesn't provide strings with proper delimiters (and thus calling
.denormalize() on these objects would return unexpected results -- empty
strings in all cases). Similarly, missing msgstr instances also result
in a call to _NormalizeString(""), which does work, but is conceptually
incorrect, as the empty string is somethiing that _NormalizeString
should never see coming in.

This commit changes all the places where confusing usage of the
_NormalizeString class happens. In particular, the existing unit test's
strings are now always delimited by double quotes (so calling
.denormalize on them would yield the expected value). A number of new
unit tests have also been added exercising the denormalize() method,
which includes unescaping escaped characters. Finally, the construction
of an empty string message has been simplified to _NormalizeString().

Signed-off-by: Rodrigo Tobar <[email protected]>
Strings should be delimited on both ends by double quotes, but this
is currently not being been detected, and content is simply being
incorrectly trimmed.

This commit adds a check for each string to verify it starts and ends
with a double quote character, issuing a warning/error if that's not the
case (and fixing it as appropriate).

A few new test cases have been added to check that the lack of double
quotes to delimit strings issues errors as expected.

Signed-off-by: Rodrigo Tobar <[email protected]>
Now that all strings given as inputs to _NormalizeString have been
verified (or corrected) to be correctly delimited with double quotes,
there's no reason to continue doing an internal strip anymore. Moreover,
we can express this internal constraint with an assertion to avoid
issues in the future.

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar rtobar force-pushed the check-string-delimiters branch from 1a9a142 to ff5213b Compare December 17, 2024 08:37
@rtobar
Copy link
Author

rtobar commented Dec 17, 2024

@akx thanks for the patience .I finally found some time today to work on this. I opted for adding yet more checks (and tests) for the msgstr plural entries, hence ensuring their messages are as well-formed as the rest. I re-based on top of the latest master and push-forced, and I marked all conversations as resolved, mostly to keep track of what I needed to work on, so don't necessarily interpret that as me not being open to further suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants