Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle ZWNBSP the same way as Word Joiner #145

Open
dhouck opened this issue Jan 1, 2025 · 1 comment
Open

Handle ZWNBSP the same way as Word Joiner #145

dhouck opened this issue Jan 1, 2025 · 1 comment

Comments

@dhouck
Copy link

dhouck commented Jan 1, 2025

wink-nlp version: 2.3.0
wink-eng-lite-web-model: 1.8.0

Currently the Unicode Zero-Width Non-Breaking Space character is only supposed to be used as a Byte-Order Mark, but it has previously had the same job as the Word Joiner character and is still occasionally used that way, and Unicode recommends treating a ZWNBSP that is not at the start of the file the same way as a word joiner.

Currently, the old ZWNBSP character is not output in the token stream, similar to #135. For example, I had a text with the date range 1830<U+FEFF>–<U+FEFF>1832, and the output did not include the U+FEFF characters at all. When I replace the deprecated U+FEFF characters with U+2060 Word Joiners, all characters are correctly reproduced in the output stream.

Note: I found this bug while debugging an issue in another project which uses Wink, and I donʼt know much about Wink myself. I expect this is enough information to identify the issue, but if not then I might need extra help to provide more useful information.

@rachnachakraborty
Copy link
Member

Happy Year 2025!

Thank you for detailing the observations on Non-Breaking Space handling.

We will take a while to address this. May comeback for clarifications.

Shall keep you posted.

Best,
Rachna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants