Nested link and orphan characters #431

gzagatti · 2022-10-24T15:30:40Z

This PR fixes issue #430.

To fix this issue, I do not allow nested markups in regular links. However, in the future we need to deal with markup on the description of the link as per the syntax specification.

Second, I created a list of orphan characters that might appear inside of a markup that does not allow nested markups. Once out of such markup, I skip any orphan character with an end capture.

Let's say we have the following:

zoo =and *boo and= bar* and *goo zar*

Then Treesitter would collect the following: =and *boo and=, *boo and= bar* and *goo zar*. In this case, we should skip *boo and= bar*.

Similarly, if we have:

This is =a *1 b= and *ab cd* foo

Then Treesitter would collect the following =a *1 b=, *1 b= and * and *ab cd*. In this case, we should skip *1 b= and *.

Feel free to test against additional corner cases. All highlights are correct in the file below:

* Circle

Let a circle with radius \( r \), then the perimeter is given by \(2\pi r\) and the area by \( π * r^2 \).

Some examples:
 - Small circle \( 1 \), \( A = \pi \)
 - Big circle \(1000 \), \(A = \pi * 1000^2\)
 - this is /slanted/
 - a \test
 - a \goof{test}

One more thing \(\).

\( a + b \) and foo

\( a \) b

The book is on the table \enlargethispage{2\baselineskip} and \( \mathbb{R} + 1 \) and \foo{bar} 

And \foo{ bar } and \foo{bar } and \foo{bar} and \foo{} and \foo{bar and foo}

hello \foo{bar} and (\(1 - 2\)) and \alpha and \boo and \foo. And (\foo)

Foobar \(n\)-th character, -th should not be higlighted. 

This should only highlight =\ab= : \ab12

This should not be math: =\( 1 + 1\)= ~\(1 + 1\)~ 

This should only be verbatim: =*1*=

This should only be latex: \( 1 + =*2*= \) \(a ~2~ \)

This should not get higlighted: \2ab

T : *\( 1 + *2* \)*

This is *important* and ~foo(x) = x + 1~ foo /bar/

Bold and verbatim *=abc=* slanted and verbatim /=foo=/ bold and slanted */foo/* bold, slanted and verbatim */=abc=/*

This is \( *1 + /1/ \) and *ab =a= cd* and *fo /f f f/ =abc=  and =abc *ab*= a* *=ab=*

abc =a *a* b=

This is =a *1  b= and *ab cd* foo

This is =a *1 b= and* b

This is =a *1* b= and *ab cd* foo

This is [[https://foobar/foo]] foo/

kristijanhusak · 2022-10-24T15:49:45Z

@gzagatti on previous PR you wrote this:

Allow for proper nested markup highlight. The syntax specification says text markup except for code and verbatim should accept any elements from the standard set, which sort of contradicts their own specification that says that text markup should only be preceded by blank space or a set of pre-defined characters (perhaps we can open an issue on the syntax specification upstream). This ensures that =foo= and foo =bar= foo have the same highlight. Also, Emacs and Pandoc export =foo= as bolded code.

Can you give me a better example of this? I don't seem to understand what is actually nestable and what is not.
I did some tests in Emacs orgmode, and it seems everything is nesting almost everywhere.
For example, this text:

zoo =and *boo /and= bar/ test* and *goo zar*

*foo =bar= foo*

Renders like this in emacs:

Second and has bold, italic and verbatim applied to self. Shouldn't some of the markup be missing from it because it's also a verbatim?

Also, second line has a bolded verbatim, which means verbatim is nesting inside the bold.
I also tried =foo *bold* bar= and it did a bold inside verbatim.

I don't think we need to do all this tracking of nested entries, we just need to figure out what to clear or not in this loop https://github.com/nvim-orgmode/orgmode/blob/master/lua/orgmode/colors/markup_highlighter.lua#L320-L328 .

When two entries are matched, this loop clears all "opened" markers that are found between them. If we figure out what we can leave here, other changes might not be necessary. Of course, for links we need to do some additional work, currently I'm talking about markup.

kristijanhusak · 2022-10-24T16:48:51Z

I did a more detailed comparison with Emacs, and I noticed few things:

Latex \word{something} is highlighted only up to { in Emacs, while we highlight the whole thing up to }
Line 22, This should only highlight =\ab= : \ab12 also highlights 1 after ab in Emacs, while we don't highlight numbers
Line 32, bolded math, in Emacs it's bolded only up to 2, while we bold the whole thing
Last line, Emacs bolds a inside verbatim, we don't.

Also, on line 28, if you give it some spacing around 2 in the latex, it will apply bold to it, and even combination of bold + italic if you use non-verbatim (In Emacs):

I think we can solve some of the nested highlights (like verbatim inside latex) by setting a proper priority to a extmark. LateX should always have greater priority over verbatim and code, but not greater priority over other markup.

gzagatti · 2022-10-25T03:09:26Z

Thanks for the detailed comments.

First let me go back to the syntax specification. It says that a text markup is composed of the following objects:

PRE MARKER BORDER BODY BORDER MARKER POST

None of the elements above are separated by whitespace characters.

The pattern BORDER BODY BORDER can be denoted as CONTENT. The CONTENT depends on the markup type:

If the markup is code or verbatim, all its content should be considered plain text.
Otherwise, all of its content should be considered as a series of objects from the standard set, i.e. it can contain nested org content.

BORDER should be a non-whitespace character.

Now let me try to tackle some examples to illustrate some points --- all of the images are obtained after exporting the org file to html using Emacs:

In =a + 1= , a + 1 should be considered verbatim.

In =(*foo 1 2)= bar*, (*foo 1 2) which could easily be Lisp code should be considered verbatim. Next, bar* should be plain text. However Emacs will parse this text incorrectly, as it will take ( as verbatim, *foo 12) as bold verbatim and bar* as bold.

In zoo =and *boo /and= bar/ test* and *goo zar* the same logic as in the previous bullet point applies, so and *boo /and should be verbatim, goo zar should be bold, no other markup should apply. However, Emacs will again parse this text incorrectly.

In abc =a *a* b= the same logic as in the previous bullet point applies, so a *a* b should be verbatim, no other markup should apply. However, Emacs will again parse this text incorrectly.

Now, let's pick a nestable markup:

In *a =foo= b*, a =foo=b should be bold and =foo= should be bold and verbatim.

In *=foo=*, =foo= should be bold and verbatim. You might say that this is wrong, because =foo= was preceded by * which is not included in the PRE markers. However, in this case we should consider that = was preceded by the beginning of a line and thus the markup is correct.

Moving on to LaTex markup. Here the specification says that a LaTex fragment should follow on the the patterns below:

\NAME BRACKETS
\( CONTENTS \)
\[ CONTENTS \]

None of the elements above are separated by whitespace characters. But note that CONTENTS can begin with whitespaces.

Fragments like \foo and \foo{bar} follow the first pattern above, since BRACKETS is optional.

NAME consists of alphabetic characters, so it excludes numeric characters. Therefore \ab12 should only highlight \ab. Again, in this case the Emacs implementation is incorrect.

The reason why numeric characters are excluded is because in LaTex macros can only be composed of alphabetic characters. The LaTex wiki states:

Name your new command \wbalTwo and not \wbal2 as digits cannot be used to name macros — invalid characters will error out at compile-time.

Next, concerning \foo{bar}. It is a matter of personal taste whether to include the text {bar} in the highlight or not. Since the syntax specification says that this is just plain text, I prefer to highlight to make this clear to the user.

Now, we can consider some additional examples --- all of the images are obtained after exporting the org file to html using Emacs:

In \( a + 1 \), a + 1 should be LaTex.

In \( 1 + = *2* = \) \(a ~2~ \), both 1 + 2 = *2* = and a ~2~ should be LaTex. No additional markup should apply because LaTex markup is non-nestable. Again, Emacs will parse this text incorrectly. Note that ~ is a special character in LaTex so it does not show up in the exported image.

Finally, there should be no priority of verbatim inside of LaTex because is non-nestable. So what appears as verbatim in LaTex fragment should be parsed as plain text. The same applies for LaTex in verbatim, and for any markup inside non-nestable markups.

In summary, Emacs parses some in-line markup incorrectly, specifically it fails when there are overlapping markups like:

*foo /ab cd* ef/

The specification says that the content of a nestable markup should contain elements of the standard set of objects, i.e. it can contain nested org content. It is thus ambiguous with regard to the above construction. I would argue that all nested markup should be closed within its parent. So, to obtain the desired result above we should have:

*foo /ab cd/* /ef/

Fortunately, all html exports from Emacs agree to that. The implementation in this PR is much closer to the markup in the exported html files and follow the syntax specification more strictly.

I hope that this clarifies most of the points you raised above. Let me know if anything else is unclear. I will then try to tackle:

I don't think we need to do all this tracking of nested entries, we just need to figure out what to clear or not in this loop https://github.com/nvim-orgmode/orgmode/blob/master/lua/orgmode/colors/markup_highlighter.lua#L320-L328 .

kristijanhusak · 2022-10-25T08:37:00Z

Thank you for the detailed explanation and examples.

One sentence caught my attention:

However Emacs will parse this text incorrectly, as it will take ( as verbatim, foo 12) as bold verbatim and bar as bold

Does this mean that html export is more correct than Emacs itself? I'm trying to figure out what source should we trust the most?

Emacs
Syntax specification
Exports

My thought was that first there was an Emacs implementation, from which syntax specification and exports were created.
Some of the things in the html export does really make sense, like having "raw strings" in the verbatim/code, but Emacs just doesn't behave that way. I'm not familiar with Emacs/Orgmode history to know how things were developed, and in which order.

I did try to export some of these into PDF via org-latex-to-pdf, and results are similar, so it seems that Emacs does really have less correct highlighting than others.

gzagatti · 2022-10-25T09:33:13Z

I am also not very familiar with Emacs/Orgmode history. I have been using it for about a year now and have converted all my notes to orgmode after having tried many different markup languages.

According to the authors of the syntax specification:

This document describes and comments on Org syntax as it is currently read by its parser (org-element.el) and, therefore, by the export framework. This is intended as a technical document for developers and those particularly interested in the syntax. Most users will be better served by the Org manual.

My interpretation is that the syntax tries to follow the export rules as close as possible and should be the source of truth for developers. Therefore, I would set my trust in the following order: (1) syntax specification, (2) exports and (3) Emacs.

Also, according to org-element.el the syntax can be represented as a list. So, overlapping markups are not correct.

Therefore, when the specification says that the content of a markup can be a series of objects from the standard set, it means that all elements must be fully contained in the markup.

When writing this, I have thought about these additional corner cases:

This *Ab *ab cd*

This *Ab *ab* cd*

This *Ab *ab* cd**

This *Ab *ab cd**

I think I should be able to tackle them by tweaking this PR.

kristijanhusak · 2022-10-25T09:59:43Z

Ok, lets follow the syntax specification then. It does make more sense.

I think I should be able to tackle them by tweaking this PR.

Would you be willing to do some refactorings, that I also did as part of #432 ?
I have few things on my mind:

Merge latex_result and result since it's the same thing, just don't apply concealing for latex (I did this in my PR)
Track ranges per line, and then merge them at the end into a single result to iterate and apply extmarks.
Do not track nesting, but instead clear things once we come to the end of the line.

I think if we do things per line we could potentially handle nesting and conflicts more easily. Basically in your edge case example:
This *Ab *ab cd*, we could go from first * to last *, and once we hit the end of line, we can figure out that second * is basically not a valid marker. Of course, this is theory, implementation is more complex.
I don't have much experience writing parsers, so I'm open for suggestions.

If you think it's not possible to do it that way, lets just try to avoid string.gmatch and similar regex stuff as much as possible, since this code runs on every redraw, and I would love to have it as performant as possible.

gzagatti · 2022-10-25T10:02:54Z

I will try to follow your suggestions and see if I can implement until next week.

pleshevskiy · 2022-10-25T10:17:57Z

I used to solve a similar problem with brackets

(()(()))
 ^
 open_stack = ( (

  ^ 
  open_stack = (
   we close one bracket so we pop last bracket
   from the stack and wrap whole text between

kristijanhusak

I did some refactoring to the markup highlighter file to make it a bit more readable. It was a bit hard to read due to all types being in single loop.
Please adapt your changes to it, and remove the highlight group changes to headlines.

kristijanhusak · 2023-11-06T20:38:18Z

DOCS.md

@@ -1196,14 +1196,6 @@ For adding/changing TODO keyword colors see [org-todo-keyword-faces](#org_todo_k
  * `OrgTSCheckboxChecked`: A checkbox checked with either `[x]` or `[X]`
  * `OrgTSCheckboxHalfChecked`: A checkbox checked with `[-]`
  * `OrgTSCheckboxUnchecked`: A empty checkbox
-  * `OrgTSHeadlineLevel1`: Headline at level 1


Lets leave all these highlight changes for a separate PR.

kristijanhusak · 2023-11-06T20:38:40Z

ftplugin/org.vim

-setlocal foldmethod=expr
-setlocal foldexpr=nvim_treesitter#foldexpr()
-setlocal foldtext=OrgmodeFoldText()
+" setlocal fillchars+=fold:\ 


This needs to be uncommented

kristijanhusak · 2023-11-06T20:38:58Z

lua/orgmode/colors/markup_highlighter.lua

@@ -213,6 +213,15 @@ local function load_deps()
  vim.treesitter.query.add_predicate('org-is-valid-latex-range?', is_valid_latex_range)
 end

+-- a function that splits a string on '.'
+local function split(string)


There is a vim.split helper function that you can use instead.

kristijanhusak · 2024-02-14T20:14:15Z

Latest master changes should address all of these issues. If you find anything else please open up a separate issue. thanks!

gzagatti added 5 commits October 24, 2022 17:37

partial implementation of nested markup in links.

052cb8f

deal with orphan markups.

68f5d43

linting.

a5024f7

alternate orphan.

08d0c73

deal with orphaned markups.

b8ef244

gzagatti mentioned this pull request Oct 24, 2022

Nested markup and orphan characters #430

Closed

kristijanhusak mentioned this pull request Oct 24, 2022

Fix nesting markup in links #432

Closed

joaomsa and others added 11 commits March 9, 2023 15:49

Add level and is_archived to OrgHeadline (nvim-orgmode#437)

91669e5

[docgen] Update doc/orgmode_api.txt

7ec04ca

Calendar next/prev supports count prefix (nvim-orgmode#438)

7cc01f4

Merge remote-tracking branch 'origin/master' into nested-link

c9889c6

fix issues merging with master.

824aa58

fix merge issues part ii.

8463554

Merge remote-tracking branch 'origin/master' into nested-link

9c8073f

do not set foldlevel in ftplugin

0b7d984

Merge remote-tracking branch 'origin/master' into nested-link

377a823

uses treesitter fold from neovim instead

e90e31e

fix injection language, allow spell highlight

2513880

kristijanhusak requested changes Nov 6, 2023

View reviewed changes

kristijanhusak force-pushed the master branch 2 times, most recently from 9094379 to abff0dc Compare November 6, 2023 20:52

kristijanhusak closed this Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested link and orphan characters #431

Nested link and orphan characters #431

gzagatti commented Oct 24, 2022

kristijanhusak commented Oct 24, 2022

kristijanhusak commented Oct 24, 2022

gzagatti commented Oct 25, 2022

kristijanhusak commented Oct 25, 2022

gzagatti commented Oct 25, 2022

kristijanhusak commented Oct 25, 2022

gzagatti commented Oct 25, 2022

pleshevskiy commented Oct 25, 2022 •

edited

Loading

kristijanhusak left a comment

kristijanhusak Nov 6, 2023

kristijanhusak Nov 6, 2023

kristijanhusak Nov 6, 2023

kristijanhusak commented Feb 14, 2024

Nested link and orphan characters #431

Nested link and orphan characters #431

Conversation

gzagatti commented Oct 24, 2022

kristijanhusak commented Oct 24, 2022

kristijanhusak commented Oct 24, 2022

gzagatti commented Oct 25, 2022

kristijanhusak commented Oct 25, 2022

gzagatti commented Oct 25, 2022

kristijanhusak commented Oct 25, 2022

gzagatti commented Oct 25, 2022

pleshevskiy commented Oct 25, 2022 • edited Loading

kristijanhusak left a comment

Choose a reason for hiding this comment

kristijanhusak Nov 6, 2023

Choose a reason for hiding this comment

kristijanhusak Nov 6, 2023

Choose a reason for hiding this comment

kristijanhusak Nov 6, 2023

Choose a reason for hiding this comment

kristijanhusak commented Feb 14, 2024

pleshevskiy commented Oct 25, 2022 •

edited

Loading