Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested link and orphan characters #431

Closed
wants to merge 16 commits into from

Conversation

gzagatti
Copy link
Contributor

This PR fixes issue #430.

To fix this issue, I do not allow nested markups in regular links. However, in the future we need to deal with markup on the description of the link as per the syntax specification.

Second, I created a list of orphan characters that might appear inside of a markup that does not allow nested markups. Once out of such markup, I skip any orphan character with an end capture.

Let's say we have the following:

zoo =and *boo and= bar* and *goo zar*

Then Treesitter would collect the following: =and *boo and=, *boo and= bar* and *goo zar*. In this case, we should skip *boo and= bar*.

Similarly, if we have:

This is =a *1 b= and *ab cd* foo

Then Treesitter would collect the following =a *1 b=, *1 b= and * and *ab cd*. In this case, we should skip *1 b= and *.

Feel free to test against additional corner cases. All highlights are correct in the file below:

* Circle

Let a circle with radius \( r \), then the perimeter is given by \(2\pi r\) and the area by \( π * r^2 \).

Some examples:
 - Small circle \( 1 \), \( A = \pi \)
 - Big circle \(1000 \), \(A = \pi * 1000^2\)
 - this is /slanted/
 - a \test
 - a \goof{test}

One more thing \(\).

\( a + b \) and foo

\( a \) b

The book is on the table \enlargethispage{2\baselineskip} and \( \mathbb{R} + 1 \) and \foo{bar} 

And \foo{ bar } and \foo{bar } and \foo{bar} and \foo{} and \foo{bar and foo}

hello \foo{bar} and (\(1 - 2\)) and \alpha and \boo and \foo. And (\foo)

Foobar \(n\)-th character, -th should not be higlighted. 

This should only highlight =\ab= : \ab12

This should not be math: =\( 1 + 1\)= ~\(1 + 1\)~ 

This should only be verbatim: =*1*=

This should only be latex: \( 1 + =*2*= \) \(a ~2~ \)

This should not get higlighted: \2ab

T : *\( 1 + *2* \)*

This is *important* and ~foo(x) = x + 1~ foo /bar/

Bold and verbatim *=abc=* slanted and verbatim /=foo=/ bold and slanted */foo/* bold, slanted and verbatim */=abc=/*

This is \( *1 + /1/ \) and *ab =a= cd* and *fo /f f f/ =abc=  and =abc *ab*= a* *=ab=*

abc =a *a* b=

This is =a *1  b= and *ab cd* foo

This is =a *1 b= and* b

This is =a *1* b= and *ab cd* foo

This is [[https://foobar/foo]] foo/

@kristijanhusak
Copy link
Member

@gzagatti on previous PR you wrote this:

Allow for proper nested markup highlight. The syntax specification says text markup except for code and verbatim should accept any elements from the standard set, which sort of contradicts their own specification that says that text markup should only be preceded by blank space or a set of pre-defined characters (perhaps we can open an issue on the syntax specification upstream). This ensures that =foo= and foo =bar= foo have the same highlight. Also, Emacs and Pandoc export =foo= as bolded code.

Can you give me a better example of this? I don't seem to understand what is actually nestable and what is not.
I did some tests in Emacs orgmode, and it seems everything is nesting almost everywhere.
For example, this text:

zoo =and *boo /and= bar/ test* and *goo zar*

*foo =bar= foo*

Renders like this in emacs:
screenshot

Second and has bold, italic and verbatim applied to self. Shouldn't some of the markup be missing from it because it's also a verbatim?

Also, second line has a bolded verbatim, which means verbatim is nesting inside the bold.
I also tried =foo *bold* bar= and it did a bold inside verbatim.

I don't think we need to do all this tracking of nested entries, we just need to figure out what to clear or not in this loop https://github.com/nvim-orgmode/orgmode/blob/master/lua/orgmode/colors/markup_highlighter.lua#L320-L328 .

When two entries are matched, this loop clears all "opened" markers that are found between them. If we figure out what we can leave here, other changes might not be necessary. Of course, for links we need to do some additional work, currently I'm talking about markup.

@kristijanhusak
Copy link
Member

I did a more detailed comparison with Emacs, and I noticed few things:

screenshot

  • Latex \word{something} is highlighted only up to { in Emacs, while we highlight the whole thing up to }
  • Line 22, This should only highlight =\ab= : \ab12 also highlights 1 after ab in Emacs, while we don't highlight numbers
  • Line 32, bolded math, in Emacs it's bolded only up to 2, while we bold the whole thing
  • Last line, Emacs bolds a inside verbatim, we don't.

Also, on line 28, if you give it some spacing around 2 in the latex, it will apply bold to it, and even combination of bold + italic if you use non-verbatim (In Emacs):

screenshot2

I think we can solve some of the nested highlights (like verbatim inside latex) by setting a proper priority to a extmark. LateX should always have greater priority over verbatim and code, but not greater priority over other markup.

@gzagatti
Copy link
Contributor Author

Thanks for the detailed comments.

First let me go back to the syntax specification. It says that a text markup is composed of the following objects:

PRE MARKER BORDER BODY BORDER MARKER POST

None of the elements above are separated by whitespace characters.

The pattern BORDER BODY BORDER can be denoted as CONTENT. The CONTENT depends on the markup type:

  1. If the markup is code or verbatim, all its content should be considered plain text.

  2. Otherwise, all of its content should be considered as a series of objects from the standard set, i.e. it can contain nested org content.

BORDER should be a non-whitespace character.

Now let me try to tackle some examples to illustrate some points --- all of the images are obtained after exporting the org file to html using Emacs:

  • In =a + 1= , a + 1 should be considered verbatim.

image

  • In =(*foo 1 2)= bar*, (*foo 1 2) which could easily be Lisp code should be considered verbatim. Next, bar* should be plain text. However Emacs will parse this text incorrectly, as it will take ( as verbatim, *foo 12) as bold verbatim and bar* as bold.

image

  • In zoo =and *boo /and= bar/ test* and *goo zar* the same logic as in the previous bullet point applies, so and *boo /and should be verbatim, goo zar should be bold, no other markup should apply. However, Emacs will again parse this text incorrectly.

image

  • In abc =a *a* b= the same logic as in the previous bullet point applies, so a *a* b should be verbatim, no other markup should apply. However, Emacs will again parse this text incorrectly.

image

Now, let's pick a nestable markup:

  • In *a =foo= b*, a =foo=b should be bold and =foo= should be bold and verbatim.

image

  • In *=foo=*, =foo= should be bold and verbatim. You might say that this is wrong, because =foo= was preceded by * which is not included in the PRE markers. However, in this case we should consider that = was preceded by the beginning of a line and thus the markup is correct.

image

Moving on to LaTex markup. Here the specification says that a LaTex fragment should follow on the the patterns below:

\NAME BRACKETS
\( CONTENTS \)
\[ CONTENTS \]

None of the elements above are separated by whitespace characters. But note that CONTENTS can begin with whitespaces.

Fragments like \foo and \foo{bar} follow the first pattern above, since BRACKETS is optional.

NAME consists of alphabetic characters, so it excludes numeric characters. Therefore \ab12 should only highlight \ab. Again, in this case the Emacs implementation is incorrect.

The reason why numeric characters are excluded is because in LaTex macros can only be composed of alphabetic characters. The LaTex wiki states:

Name your new command \wbalTwo and not \wbal2 as digits cannot be used to name macros — invalid characters will error out at compile-time.

Next, concerning \foo{bar}. It is a matter of personal taste whether to include the text {bar} in the highlight or not. Since the syntax specification says that this is just plain text, I prefer to highlight to make this clear to the user.

Now, we can consider some additional examples --- all of the images are obtained after exporting the org file to html using Emacs:

  • In \( a + 1 \), a + 1 should be LaTex.

image

  • In \( 1 + = *2* = \) \(a ~2~ \), both 1 + 2 = *2* = and a ~2~ should be LaTex. No additional markup should apply because LaTex markup is non-nestable. Again, Emacs will parse this text incorrectly. Note that ~ is a special character in LaTex so it does not show up in the exported image.

image

Finally, there should be no priority of verbatim inside of LaTex because is non-nestable. So what appears as verbatim in LaTex fragment should be parsed as plain text. The same applies for LaTex in verbatim, and for any markup inside non-nestable markups.

In summary, Emacs parses some in-line markup incorrectly, specifically it fails when there are overlapping markups like:

*foo /ab cd* ef/

The specification says that the content of a nestable markup should contain elements of the standard set of objects, i.e. it can contain nested org content. It is thus ambiguous with regard to the above construction. I would argue that all nested markup should be closed within its parent. So, to obtain the desired result above we should have:

*foo /ab cd/* /ef/

Fortunately, all html exports from Emacs agree to that. The implementation in this PR is much closer to the markup in the exported html files and follow the syntax specification more strictly.

I hope that this clarifies most of the points you raised above. Let me know if anything else is unclear. I will then try to tackle:

I don't think we need to do all this tracking of nested entries, we just need to figure out what to clear or not in this loop https://github.com/nvim-orgmode/orgmode/blob/master/lua/orgmode/colors/markup_highlighter.lua#L320-L328 .

@kristijanhusak
Copy link
Member

Thank you for the detailed explanation and examples.

One sentence caught my attention:

However Emacs will parse this text incorrectly, as it will take ( as verbatim, foo 12) as bold verbatim and bar as bold

Does this mean that html export is more correct than Emacs itself? I'm trying to figure out what source should we trust the most?

  • Emacs
  • Syntax specification
  • Exports

My thought was that first there was an Emacs implementation, from which syntax specification and exports were created.
Some of the things in the html export does really make sense, like having "raw strings" in the verbatim/code, but Emacs just doesn't behave that way. I'm not familiar with Emacs/Orgmode history to know how things were developed, and in which order.

I did try to export some of these into PDF via org-latex-to-pdf, and results are similar, so it seems that Emacs does really have less correct highlighting than others.

@gzagatti
Copy link
Contributor Author

I am also not very familiar with Emacs/Orgmode history. I have been using it for about a year now and have converted all my notes to orgmode after having tried many different markup languages.

According to the authors of the syntax specification:

This document describes and comments on Org syntax as it is currently read by its parser (org-element.el) and, therefore, by the export framework. This is intended as a technical document for developers and those particularly interested in the syntax. Most users will be better served by the Org manual.

My interpretation is that the syntax tries to follow the export rules as close as possible and should be the source of truth for developers. Therefore, I would set my trust in the following order: (1) syntax specification, (2) exports and (3) Emacs.

Also, according to org-element.el the syntax can be represented as a list. So, overlapping markups are not correct.

Therefore, when the specification says that the content of a markup can be a series of objects from the standard set, it means that all elements must be fully contained in the markup.

When writing this, I have thought about these additional corner cases:

This *Ab *ab cd*

This *Ab *ab* cd*

This *Ab *ab* cd**

This *Ab *ab cd**

I think I should be able to tackle them by tweaking this PR.

@kristijanhusak
Copy link
Member

Ok, lets follow the syntax specification then. It does make more sense.

I think I should be able to tackle them by tweaking this PR.

Would you be willing to do some refactorings, that I also did as part of #432 ?
I have few things on my mind:

  1. Merge latex_result and result since it's the same thing, just don't apply concealing for latex (I did this in my PR)
  2. Track ranges per line, and then merge them at the end into a single result to iterate and apply extmarks.
  3. Do not track nesting, but instead clear things once we come to the end of the line.

I think if we do things per line we could potentially handle nesting and conflicts more easily. Basically in your edge case example:
This *Ab *ab cd*, we could go from first * to last *, and once we hit the end of line, we can figure out that second * is basically not a valid marker. Of course, this is theory, implementation is more complex.
I don't have much experience writing parsers, so I'm open for suggestions.

If you think it's not possible to do it that way, lets just try to avoid string.gmatch and similar regex stuff as much as possible, since this code runs on every redraw, and I would love to have it as performant as possible.

@gzagatti
Copy link
Contributor Author

I will try to follow your suggestions and see if I can implement until next week.

@pleshevskiy
Copy link

pleshevskiy commented Oct 25, 2022

I used to solve a similar problem with brackets

(()(()))
 ^
 open_stack = ( (

  ^ 
  open_stack = (
   we close one bracket so we pop last bracket
   from the stack and wrap whole text between

Copy link
Member

@kristijanhusak kristijanhusak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some refactoring to the markup highlighter file to make it a bit more readable. It was a bit hard to read due to all types being in single loop.
Please adapt your changes to it, and remove the highlight group changes to headlines.

@@ -1196,14 +1196,6 @@ For adding/changing TODO keyword colors see [org-todo-keyword-faces](#org_todo_k
* `OrgTSCheckboxChecked`: A checkbox checked with either `[x]` or `[X]`
* `OrgTSCheckboxHalfChecked`: A checkbox checked with `[-]`
* `OrgTSCheckboxUnchecked`: A empty checkbox
* `OrgTSHeadlineLevel1`: Headline at level 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets leave all these highlight changes for a separate PR.

setlocal foldmethod=expr
setlocal foldexpr=nvim_treesitter#foldexpr()
setlocal foldtext=OrgmodeFoldText()
" setlocal fillchars+=fold:\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be uncommented

@@ -213,6 +213,15 @@ local function load_deps()
vim.treesitter.query.add_predicate('org-is-valid-latex-range?', is_valid_latex_range)
end

-- a function that splits a string on '.'
local function split(string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a vim.split helper function that you can use instead.

@kristijanhusak kristijanhusak force-pushed the master branch 2 times, most recently from 9094379 to abff0dc Compare November 6, 2023 20:52
@kristijanhusak
Copy link
Member

Latest master changes should address all of these issues. If you find anything else please open up a separate issue. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants