-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Burmese script marking character placement problem with Harfbuzz / Speedata #52
Comments
The question was asked...
The screen shots are from both a browser and from a PDF file created by Speedata / Harfbuzz. But I can assure you that the Padauk font results in the same error in both the Speedata produced PDF and in the FF browser. And that the Microsoft mmrtext.ttf font correctly displays in both the Speedata PDF and the FF browser. |
The Speedata developer also commented...
See original comment here harfbuzz/harfbuzz#4784 (reply in thread) |
It looks like with this GitHub issue, we don't have threads like with the previous discussion. I assume this is a difference between issues and discussions, but if not, I am happy to enable a setting to have threaded issues. With OpenType, marks that are positioned (which includes two smaller glyphs under a base glyph in all example above AFAIKT) the advance width of the mark glyph is set to zero. Even if the mark glyph originally has width. The solution is to add an advance width to the cluster, we do this in other parts of the font (like for the solution for another medial form), we just missed this situation. Does that clarify the situation for @alerque and @pgundlach ? Adding an advance width might be needed regardless if stacking marks (which is what I think mmrtext.ttf is doing) or a ligature is used. If the base glyph is wide enough that the stacking marks or ligature does not extend to the right of the advance width of the base glyph, then no advance width needs to be added. |
Would it be simplest to just look at how mmrtext.ttf handles it and do the same? It should be in your Win11 fonts. |
I know how mmrtext.ttf does it, the question is how much work it would be to replicate that behaviour in Padauk. You can help by providing examples (you can save them up, you don't need to make a post each time you find one) of where stacking needs to be. I need the codepoints (including the base character) of the characters used, so either copy the text into your post or use something like Sploot! to see the codepoints in the text. I suspect all the examples you will find will be of the form CHCHC where C is a consonant and H is U+1039 MYANMAR SIGN VIRAMA. In some of the posts you mention using a browser, and sometimes that gets clarified as Firefox. It would help if you always specify Firefox (or Chrome, or Edge, or Safari) and the OS (such as Windows 11, macOS, Android, etc). |
Your comments about different shaping depending on where in the line the word is makes sense. I just cannot replicate that behaviour. Which means I am unable to test if any fix I makes resolves the original issue. In my testing the difference in shaping is due to if Graphite or OpenType shaping is used. Using what is essentially the Paduak 5.001 font (before the recent fix to Graphite) with XeTeX from TeX Live 2023 on Ubuntu 24.04 Graphite OpenType With the Padauk 5.002 (only two zeros, just like two zeros in 5.001 above) Graphite OpenType The Graphite and OpenType shaping are still a bit different. The OpenType has more space around the second medial. This is why I suggest testing with different browsers. If you use Padauk 5.001 from Google fonts, that font has had the Graphite tables removed, so you should get the OpenType shaping everywhere (even with Firefox). I understand the OpenType shaping is still not ideal, but it is more readable that either Graphite shaping example. In this case, the solution seems clear (stack the medials (which are marks in OpenType) so maybe I don't need to understand the position in line part of the issue. But it is still a puzzle to me. |
I think you are correct that if we focus on stacking all the needed medials under the associated letter that we are on track to the solution. As for Speedata PDF versus Firefox versus any other browser I frankly do not see any consequential difference in the rendering when using the same font. The software does not make the difference, but the font that I use whether Padauk5.001, Padauk5.002, or Microsoft mmrtext.ttf. The google fonts are out of the equation for me. I know they don't work properly and I have much greater hope talking with you to get Padauk working. Microsoft mmrtext.ttf is useful because it does work in all the browsers and Speedata and so maybe it can help us get Padauk working. Concerning the medial that overflows into the margin. Do not make too big a deal about that. That only happens in Speedata/Harfbuzz PDFs because only that application has a rigid hard line margin and Speedata does not have enough information from the font to be aware that the medial is in the margin. Browser displays do not have full justification on the right margin and so the problem is not as big a problem in browser display. But it is a problem in the PDF. I will try to begin building a list of problem medials, though that seems a challenge. I will have to tediously compare the text rendering using mmrtext.ttf versus Padauk5.002 to locate the problems. I guess it would not be possible to study the mmrtext.ttf font itself to read its tables to locate all the exceptions we need? Maybe that kind of reverse engineering is not allowed from a copyrighted font? Alternatively maybe there is a Myanmar / Burmese speaker who can simply list out all the cases that we need to know about. I am in touch with the Sanskrit Bible maintainer and will direct him to this issue page. Maybe he can help us. I know you want just one comprehensive list of medials and I will see what I can do. However, until then here are a few more cases to learn from, all from Matthew 4. The github.com comment in FF falls back to mmrtext.ttf from my Win11 box This word in Matthew 4:3 is displayed differently by all three web pages above. ဘဝေသ္တရှျာဇ္ဉယာ This word from Matthew 4:12 likewise is different in all three webpages above. တဒွါရ္တ္တာံ This word from Matthew 4:13 same story... သီမ္နောရ္မဓျဝရ္တ္တီ I think you are saying we need a comprehensive list of all the letters that have these double, triple, and quadrupal medials associated with them. That could be a short list or perhaps very long. |
Looking at mmrtext.ttf to see what medials it handles might not be allowed, as you mention. But I can make a page of all possible medials and use that font and see what it handles. For now, the data you have found is helpful. No need to look at lots of data to visually compare. I was thinking that you might search the text with a program to find all the example of CHCHC where C is a consonant and H is U+1039 MYANMAR SIGN VIRAMA. Well, I forgot about characters that have names starting with MYANMAR CONSONANT SIGN MEDIAL. I will have to think about that. What do you mean by Myanmar / Burmese? Myanmar is a script and a country. Burmese is a people group and a language. Since other Myanmar script fonts do not handle two medials (except for mmrtext.ttf) I would guess the Burmese language does not need have two medials, but I guess the Sanskrit language in Myanmar script does. So a Burmese speaker might not know the answer we need. How are you counting the medials? For CHCHC, that results in a base character and two medials below. I would call that a double medial. Would you call that a double? For each of the examples above, plus the original example, how many medials do you count? |
Oh maybe I can help with that and a little regex. Let me work on getting the unique consonant characters. I did find the sequence counts... If you know regex... I was using Myanmar and Burmese as synonyms. Sorry for the confusion. Yes you are right a Burmese speaker may not understand these medials nor the Sanskrit language. I have just contacted a Burmese fluent friend and a Sanskrit fluent associate to see what we can learn. Yes I would call CHCHC a double also. Though some of the medial characters look like 4 little accents, so I thought maybe there were as many as four. But my regex showed that three is the max. Is there any other "glue" unicode that I need to search for beyond \x{1039} ? |
My bad on the scope of things that move when paired under a 101B. I've reduced it now and it should also look better in graphite at least. I've also added the sequence to our sanskrit test. |
I built a comprehensive medial sequence checker for the Sanskrit Burmese script NT. All medial sequences as defined a character preceded by \x{1039} are listed here https://stage.aionianbible.org/Debug/Sanskrit-Burmese. Let me know if I can do anything to improve the tool. You might need to clear your browser cache. |
Thanks for this. I wonder if I might be so bold as to ask you for a simpler form of this data: one string per line that I can run through a rendering test. I only need the 3 and 2 medials lists. Each line is a single string which is the test string. TIA |
Okay that is added. Visit the same page and refresh, https://stage.aionianbible.org/Debug/Sanskrit-Burmese. There is a list of all the unique sequences only and also a list of the unique sequences in a sample context, meaning the sequence plus one character on either side. This helps the font to display the medials better, though even the extra character doesn't always result in the proper display because more of the word is needed for the font to render it properly in some cases. Also note that the Microsoft font still proves to be the best. However, I noticed that in some cases when I display the sequence even the Microsoft font is not displaying the medials properly because it needs the whole context of the word. However, my debug page is only display the sequence and the sequence plus one character on other side. Let me know if you need to see the context of the entire word and I can see what I can do. That would be much harder though. |
My debug script has shown that the Sanskrit Burmese text uses the medial unicodes... 1000, 1001, 1002, 1003, 1005, 1006, 1007, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1036, 1038, 1050, 1051, 100b, 100c, 100d, 100f, 101c, 101e Note that the are holes in the numeric sequence. I don't know about medials and what characters are included, but it may be that my Sanskrit text does not include all the medials that the font should handle. |
Thank you for the fantastic test data, that is very helpful. I don't think we need the whole word at this point. Different languages might have different medials, just like different languages in Latin script vary as to what diacritics exist and go with what letter. A medial is part of a consonant cluster. Consonants have an inherant vowel sound, that is why the name (and roughly the sound) of the first character is KA, not K. To get K, you have two characters, KA, VIRAMA. So the character sequence ရ္ဒ္ဓ (RA-VIRAMA-DA-VIRAMA-DHA sounds roughly like R-D-DHA. More details are in the Unicode standard. So you first example of a triple medial ends with U+1039 MYANMAR SIGN VIRAMA, U+1036 MYANMAR SIGN ANUSVARA, I would have expected a consonant in place of the ANUSVARA so I suspect that sequence is a typo. The stacking medials have now been improved, you can download the latest build of the font. It will still say version 5.002. |
Okay your latest build is loaded in my test page, https://stage.aionianbible.org/Debug/Sanskrit-Burmese. Seems like the original problem sequence is corrected. Though maybe others from the unabridged list that still need work? |
For further complication here is what my Burmese speaking friend says... ===
A word is typically structured with one or more consonants with one or more vowels and medials. A missing vowel and or medial can change the meaning of a word. It is possible for some words to have two consonants stacked. The space between words are not critical but it is not good to add a space between a consonant, vowels and medials. Also, you would not want to break consonants from vowels for wrapping to the next line. === I am currently corresponding with her for more explanation about better placement of the medials. |
This document may also help, https://www.loc.gov/catdir/cpso/romanization/burmese.pdf |
All the needed medials for Sanskrit were added. The positioning can be improved, but you can test the font as is. |
In my initial tests of build #613, the medial placement is excellent with the Speedata / Harfbuzz rendering engine. Medials are stacked nicely without clobbering one another and none overflow into the margin. There are a few cases where the medials could be nudged a bit so better align with each other and with their letter and one case where medials did clash. For example, Here is the whole document with Paduak5.002 Now strangely the HTML was not as good. When tested in FF, Edge, and Chrome the medials where better than before, but still not as good as the Microsoft font or as good as the rendering in Speedata / Harfbuzz. You can see examples yourself online. Matthew 4:12 middle word is a good example. When multiple medials some slide too far to the right. https://stage.aionianbible.org/Bibles/Sanskrit---Burmese-Script/Matthew/4 The debug tool is also still available, though I removed Paduak5.001 from the display and show Paduak5.002 only. Thanks for all the good work and let me know how I can help further. |
Any progress with further repair to the font? |
Yes, I improved the stacking of small medials (in OpenType, but not Graphite, so Firefox will not have as good of shaping as other browsers for the stacking small medials). Your 4th example above and Matt 4:12 should be fixed in OpenType. The first 3 examples still need to be fixed in OpenType, but work in Graphite (so in this case, Firefox should shape the text better than other browsers). I increased the version to 5.003. Please include screenshots (as you have done in many cases) but when you write as good as the rendering in Speedata / Harfbuzz I cannot see those results. Even if I have the same programs as you do, differences in operating systems, different versions of fonts and applications can make the display different for different people. Please also include the codepoints (or the actual text) that were used to generate the screenshots. |
Side topic with my distro packager hat on: was there supposed to have been a Version 5.002 release for this project? I don't see tags/release artifacts for that. |
Version 5.002 was not intended to be a release (nor is 5.003 or anything a release from now until version 5.100). The 5.003 is just to make this GitHub issue easier to understand so we can refer to a version of the font. |
Okay roger that, fair enough. The OpenType spec not having anything other than XXX.YYY for versioning makes it hard to do anything semantically useful with development and pre releases. |
Starting to look great. And yes as noted above FF fails in some medial placement until Graphite is fixed, but Chrome / Edge are mostly good. Visit this page to compare Padauk with MS Myanmar Text font, https://stage.aionianbible.org/Debug/Sanskrit-Burmese. The page compares the Microsoft Myanmar Text font with Padauk 5.003. HOWEVER, to see the Microsoft font you need to open the browser on a Windows box that has the Myanmar Text font. And perhaps inspect the HTML to confirm that Myanmar Text is loading properly.
I can also confirm that no medials are spilling into the margins in the Speedata / Harfbuzz PDF production at https://stageresources.aionianbible.org/Holy-Bible---Sanskrit---Burmese-Script---Aionian-Edition.pdf. The medial overflow into the margin first alerted me to the problem. So the medials are no longer floating in the margins. However, there may still be some work to fine tune medial placement as shown in the webpage above and also specify medial placement in Graphite (if I understand correctly). Anything else I can do to help this to the finish line? |
For item 30 above you need more context (which one form of your test data had). The larger context is
The first three characters form a kinzi, which is a form of U+1004 MYANMAR LETTER NGA. The kinzi is described in a paragraph in the Myanmar section of chapter 15 of the Unicode Standard. All this works in the OpenType code, but not the Graphite code until ea4e1c4. With the current test data (which is the same as above but without the first and last characters) the kinzi is not going to form (the NGA is needed). The VOWEL SIGN E at the end should not affect the kinzi. |
I think 30 is a sequence only found in John 6:50-51 as " ဘုင်္က္တ္တေ " |
The lookups for adding extra guard space need to ignore narrow width non-spacing vowels that are above the base characters. The list of such vowels needs to include all possible characters that would be found in a situation requiring extra guard space. There may be more characters that needed to be added to this list. This list might be able to be merged with @classUpperVowel. With this addition of another vowel, duplicates of this vowel in text will no longer be shown with a dotted circle, but only in the DFLT shaper, not the much more commonly used mym2 shaper. If desired, such checking could be added to the mym2 lookups. This issue was found in improving the selection of different forms of RA to handle the clashes (the medial is too close to the short RA) in the first three images at #52 (comment). In the case of the forms of RA extra guard space might not be needed, the correct form of RA has the correct APs to position the medial correctly. The next commit or so will improve the selection of RA forms.
Any further improvement expected? |
Yes. I think I have fixed most issues, the version has been increased to 5.004. Let me know how it looks. |
Looking great and perhaps production ready for me. There are cases where Paduak seems better than MS Myanmar Text and vice versa. Also the Paduak medials generally seem smaller, though maybe that is better, but are they are a little harder to read. Not sure. See my debugger at https://stage.aionianbible.org/Debug/Sanskrit-Burmese. When this page is viewed from a Windows box with MS Myanmar Text available you will be able to compare it with Paduak 5.004. A quick check and I saw minor concerns at line 30, 88, 91, 95, and 112. A final thorough comparison of Paduak with Myanmar Text from this webpage ought to get us to the finish line if we are not there already. As an aside I am talking with John Hudson of https://www.tiro.com/ and they are working to get permission to provide extended license permission for the MS Myanmar Text font which he helped to develop. Though likely a fee for that, so I am thankful for all the efforts to get Paduak working properly. |
The small medials is Padauk fit nicely in a wrap (U+103C MYANMAR CONSONANT SIGN MEDIAL RA) due to their, as you noted, quite small size. In the image below, Padauk Book is on the left, Myanmar Text on the right. The medials might be a bit easier to see if you are using Padauk Book instead of Padauk For item 30, I thought I had addressed that in my comment above. Was I not correct, or was something I wrote not clear? |
RE: item 30, yes I see. Sorry about the repeated concern. And yes, I might try Paduak Book. Please post back if you do make further improvements. So thankful for your good work on this. |
I think I fixed all the remaining clashes. The version is now 5.005. I am now looking in to line 91 (which is the same issue I think as line 112) where the medials for U+100B MYANMAR LETTER TTA look very different depending on the font. The glyph in Padauk is similar to the glyph in Pyidaungsu, but very different from the other fonts. So I am not sure that Padauk is wrong. |
5.005 is looking wonderful. Thank you so much for all the effort. My debug page will remain here https://stage.aionianbible.org/Debug/Sanskrit-Burmese. And for further study my Burmese texts remain here https://www.aionianbible.org/Bibles/Sanskrit---Burmese-Script, https://www.aionianbible.org/Bibles/Myanmar---Burmese-Common-Bible, https://www.aionianbible.org/Bibles/Myanmar---Myanmar-Burmese-Judson, and https://www.aionianbible.org/Bibles/Myanmar---Burmese-Judson. Thank you again. |
This harfbuzz/harfbuzz#4784 discussion now moved here and below.
Last relevant comments copied below...
Thanks for working on this. I see that the unicode marks no longer clobber each other, though they are not stacked properly yet either. I also see that there are other unicode marks effected or should be effected by this new rule as well. This might be a much bigger deal.
Here is a picture of the culprit word using Padauk50002 with marks not stacked and rolling into the margin.
Here it is properly displayed with mmrtext.ttf
I am also noticing other errors in words in Matthew 4 with problems.
Another word from Matthew 4 with Padauk50002
Yet here is it displayed in mmrtext.ttf
A summary of the problem is that when there are multiple marking characters the mmrtext.ttf font stacks them so they are all there with out clobbering each other. However, Padauk50001 allowed marks to clobber each other, and now Padauk50002 prevents clobbering, but does not stack the marks with the associated letter.
What info can I provide to help?
The text was updated successfully, but these errors were encountered: