Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPEC 12: Formatting mathematical expressions #326

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

tupui
Copy link
Member

@tupui tupui commented Jun 7, 2024

This follows the proposal on the forum https://discuss.scientific-python.org/t/spec-12-formatting-mathematical-expressions

See other linked discussions as well.

@tupui tupui added the New SPEC label Jun 7, 2024
@tupui
Copy link
Member Author

tupui commented Jun 7, 2024

cc @mdhaber @stefanv @jarrodmillman @j-bowhay and add yourself as co-authors of course 😉

-a
```

- Within a group, if operators with different priorities are used, add whitespace around the operators with the lowest priority(ies).
Copy link
Contributor

@mdhaber mdhaber Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the definition of "group" above, it is not possible for operators within a group to have different priority.

I believe the definition of "group" was supposed to be something more like a sequence of operations that relies on implicit order of operations rules. Examples include:

  • a logical line
  • operations within parentheses
  • the expression and for/if clauses of a list comprehension

Comment on lines 28 to 30
We define a _group_ as a collection of operators having the same priority.
e.g. `a + b + c` is a single group, `a + b * c` is composed of two groups `a`
and `b * c`. A group is also a collection delimited with parenthesis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 77 to 80
(
a/b
+ c*d
)
Copy link
Contributor

@mdhaber mdhaber Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should show an example where splitting the line makes sense. We would not want to suggest that a / b + c*d should be split across four lines. Consider referring to PEP8 "Should a Line Break Before or After a Binary Operator".

I'm not sure if the use of the term "logical block" is correct. This is a single logical line split across multiple physical lines.


```
a/d*b**c
a*(b**c)/d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to follow the rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so?

Comment on lines 63 to 64
If this is technically an issue (e.g. restriction on the AST), add
parenthesis or spaces.
Copy link
Contributor

@mdhaber mdhaber Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule appears to conflict with the previous rule about space around operators * and /.

I think the reason for an exception - if there needs to be one - should be more explicit. I remember discussing in person that linting tools might check the AST to ensure that it is not modified by an auto-correction, but this is not something that the user will necessarily be thinking abou. A user might be ordering sequences of operations in a particular way to get floating point arithmetic to do what they want. If they are tempted to break the rules to do so:

  • the linter does not have to be able to make the correction automatically
  • the user is welcome to use parentheses
  • the user is welcome to declare an exception to the rule with noqa

I think the rules need to be more complete before we can assess whether there a need for an exception, though.

Copy link
Member Author

@tupui tupui Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general thought, rules could conflict if we ask them to be applied one after the order in a strict order.

The principle of these rules is that users should not care at all and even learn them. The linters are here for that. If a user does something in a certain order for arithmetic reasons and there is a reordering happening, then we can ask tools to either not reorder or provide a skip for a given rule. See the last point in the notes.

(a*b) * (c*d)
```

- Operators within a group are ordered from the lowest to the highest priority.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "group" here is being used in the same sense as elsewhere. I remember discussing in person the idea that a/d*b**c is preferable to a*b**c/d unless there are explicit parentheses, like (a*b**c)/d, but it would not be wrong to do a/d*b**c + e/f*g**h, yet the plus comes after higher priority operators.

```

- There is no space before and after operators `*` and `/`. Only exception is if the expression consist of a single operator linking two groups with more than one
element.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be another exception below.
https://github.com/scientific-python/specs/pull/326/files#r1631763770

According to "Only exception is if the expression consist of a single operator linking two groups", there is no exception for:

(a*b)*(c*d)*(e*f)

because there are three groups? Or do you mean that you need a space when any binary operator is linking two explicit "groups" enclosed by parentheses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example it means no spaces because we have 3 groups.

And if we have 2 groups then each group must have at least one operator in it (ie not just a variable or single number).

Comment on lines 105 to 106
i = i + 1
submitted += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these bad?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are good, for now o just copy pasted the whole block in Black. But yes we should remove the correct ones from there.

spec-0012/index.md Outdated Show resolved Hide resolved
Copy link
Member Author

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt. If you are going to be an author, feel free to just go ahead and commit changes 😉

@tupui
Copy link
Member Author

tupui commented Aug 11, 2024

Hi @charliermarsh @ambv 👋 I had pinged you both on Twitter some years back about a standard for mathematical equations. At the time you both said this could interest you for both Ruff and Black if we, the Scientific Python community, could come to an agreement.

To be clear, I am of course in no position to ask for any commitment from you and I just hope that you find the topic interesting enough.

This is getting into shape and we have a preliminary draft we would like to present you 😃

Any comments would greatly help and in the end this can also only work if both Ruff and Black would be able and willing to implement this specification 🙏

Thanks again both!

@charliermarsh
Copy link

Awesome, I look forward to reading + engaging here.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 11, 2024

Thanks @charliermarsh. I think the main question right now is whether this is close to being precise enough to be implementable. For example, I don't have much background with ASTs, so perhaps you can suggest ideas that would replace the notions like "implicit subexpressions" I attempted to define. One we have a better understanding of how to write this standard so that it's implementable, we will want to get feedback from a wider audience about adjustments to the particular rules.

surrounding whitespace. For example, prefer `-x**4` over `- (x ** 4)`.
2. Always surround non-PEMDAS operators with whitespace, and always make the priority of
non-PEMDAS operators explicit. For example, prefer `(x == y) or (w == t)` over
`x==y or w==t`.[^1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that we should add parentheses here (as in: is this an exception to Rule 0)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is intended to be an example of "otherwise specified".

levels increases and when multiple non-PEMDAS operators are involved. Portions of this
acronym, namely MD and AS, will be used below to refer to the corresponding operators.

## Implementation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should any of these rules differ based on the expression type? E.g., all of the examples below use Name nodes (like x, y, etc.). What if the expression uses calls or subscripts or similar? Like f() ** 2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add examples, but no - to keep things simple, I didn't consider changing the rules based on that.

A "subexpression" is subset of an expression that is either explicit or could
be made explicit (i.e. with parentheses) without affecting the order of
operations. In the example above, `j` and `range(1, i + 1)` can also be
referred to as explicit subexpressions of the whole expression, and `1` and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...of the whole expression"

Assuming I'm reading this correctly, I would be careful here as for j in range(1, i + 1) is not an "expression" in the sense of the Python AST. for is a kind of "statement".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we correct this as simply as:

Suggested change
referred to as explicit subexpressions of the whole expression, and `1` and
referred to as explicit subexpressions of the whole statement, and `1` and

?
I think the point I was trying to make would still be made.

explicit as in `x + (y*z)` without changing the order of operations. However, `x + y`
would not be a subexpression because `(x + y)*z` would change the order of operations.
Note that `x + y*z` as a whole may also be referred to as a "subexpression" rather than
an "expression" even though `(x + y*z)` is not a proper subset of the whole.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "proper subset of the whole" mean? Sorry!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A proper subset of a set $A$ is a subset of $A$ that is not equal to $A$.

This is saying that even though an expression like (x + y*z) might be the entire expression in question, we might still refer to it as a "subexpression" when the distinction is not important.

Does this need a clarification, or do you think it would be clear to those familiar with the term "proper subset"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense! I think it will be clear to those that are familiar with the term.

@mdhaber
Copy link
Contributor

mdhaber commented Sep 12, 2024

How do you identify mathematical expressions? Does it solely depend on the use of an operator (compare, binary, ...)?

I think it would depend solely on the use of the operators. In that example of file paths, the formatting would be the same with these rules as what you've shown - spaces around the division operator since it is a "simple" (as opposed to compound) expression. There may be cases in which these rules would suggest funny things for overloaded operators, but my guess is that in most cases, it would just mean a # noqa here and there.

One goal of Black's and Prettier's formatting style is to be predictable and easy to understand (to avoid cases where you're like: Uhhm, why did it do this?). That's why they try to avoid context-specific formatting where e.g. an expression gets formatted differently when embedded in a different context (exceptions apply).

Sure. There are costs to the simplicity, though. PEP8 explicitly recommends "If operators with different priorities are used, consider adding whitespace around the operators with the lowest priority(ies)". For the sake of simplicity (presumably), Black/E225 just ignore this recommendation. The downside is that some find this makes the resulting expression more difficult to read. The goal of this SPEC is to capture the spirit of the PEP8 recommendation in rules. They will not be as simple as "add whitespace around all operators", but some may find that the benefits outweigh this cost.

Have you looked at how other languages like Rust, Dart, JS, format mathematical expressions?

  • Rust rules are here. e.g. "Do include spaces around binary ops." "Use parentheses liberally; do not necessarily elide them due to precedence."
  • Dart rules are here. e.g. "Spaces around binary and ternary operators." "Place binary operators on the preceding line in a multi-line expression." "No spaces around unary operators."
  • JS rules are here. e.g. "a single internal ASCII space also appears in the following places only...4. On both sides of any binary or ternary operator."

So, for the whitespace question, they all have essentially the same recommendation as Black and E225: always surround operators with whitespace. Others may have different goals, but one of my motivations here was to avoid that blanket rule. I think one of the reasons $LaTeX$ is preferred by many for mathematical documents is it goes to some lengths to make the whitespace look (subjectively) "good". We aren't going going that far here (and can't, as we're assuming monospaced font), but it's a small step in this direction.

My biggest concern with Ruff's and Black's formatting is the nested expression...

Because of 8) "If line breaks must occur within a compound subexpression, the break should be placed before the operator with lowest priority." and "If there are multiple candidates, include the break at the first opportunity.", these (and some other variations) would satisfy the rules prescribed here:

(aaaaaaaaa
 + bbbbbbbbbbbb*ccccccccccccccc*ddddddddddddddd*xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 + yyyyyyyyyyyy + zzzzzzzzzzzzzz)

(
 aaaaaaaaa
 + bbbbbbbbbbbb*ccccccccccccccc*ddddddddddddddd*xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 + yyyyyyyyyyyy
 + zzzzzzzzzzzzzz
)

@lucascolley
Copy link
Contributor

We can, and should, get this in now and it would be marked as a draft at minima.

Is this ready to go in as draft @stefanv ?

@stefanv
Copy link
Member

stefanv commented Sep 28, 2024

@lucascolley Thanks for the ping. Yes, it's definitely in good enough shape to shop around for discussion.

Note that there is one unclosed LaTeX expression, which would be good to fix first.

Some personal remarks:

  • It feels like there could be more clarity between (1) and (5) around when brackets may be / should be inserted to improve an expression.
  • a + x*y**3 is rather hard to parse for my eyes, would probably have preferred a + x * y**3, but could be just me, and I suppose we'll get used to it. It's kind of funny to see how the pendulum swings. In the early days, everything was compressed: a+x*y**3, then spaces started creeping in, eventually expressions were fully spaced: a + x * y ** 3. Now the tide is rolling back again :)
  • The definition section is a bit daunting, with all the descriptions of expressions, subexpressions, etc.; may be worth thinking about how to simplify that introduction, or move it to the end (most of the language is fairly intuitive, where it is used, so should be OK to have a glossary for reference?)
  • Intro: "This leads to individual interpretation and styles which may conflict with those of others."; that sentence can perhaps be clarified along the lines of: "This leads to varying, even conflicting, mathematical expression styles across the ecosystem."

Anyone on here should feel free to move the PR out of draft status and merge, when they're ready. Or give me the thumbs-up and I'll do so.

@lucascolley
Copy link
Contributor

lucascolley commented Sep 28, 2024

a + x*y**3 is rather hard to parse for my eyes

I shared this concern, but after speaking to Matt I think it's fine considering the specification in (5) that there must be just a single ** in the rightmost position if there is one. I think it's difficult to parse only because I'm used to expecting the operators to appear in any order, rather than having all expressions formatted in a standard way.

And I think the rules don't forbid developers to include extra parentheses in cases like this if they would like to be extra clear.

@lucascolley
Copy link
Contributor

lucascolley commented Sep 28, 2024

Thanks again @MichaReiser for the points you raised above. I've had a little more time to think about them and I think you did a great job of explaining the challenges of adopting this in Ruff.

  • We decided not to implement rules that conflict with the formatter because formatted code should never raise new lint violations. This property is essential to integrate the formatter into check.
  • Some of the proposed style guide rules do conflict with the formatter.
    The style guide conflicts with existing rules, and we can't just remove them. We have users who depend on them, even if the rules might go beyond what PEP-8 specifies. Arguably, we should never have added those rules,, but here we are.

One idea that I'm not sure has been considered yet would be a completely separate mode of the formatter + linter. The start point would be implementing this specification as lint rules and being able to format code to comply with it, but in isolation (just this spec). If that could be achieved, then any rules and formatting that do not conflict with this spec could then be added into this separate mode.

This separate mode would never be able to offer a superset of ruff's regular functionality. But it could still be enough for us.

  • Implementing a formatting style guide is a multi-week, if not multi-month, project. It also increases the complexity to push any new formatting changes because they now need to fit into and be tested with multiple styles guides.

Of course, the main problem is always time. At least with this suggestion, the separate mode wouldn't block new formatting changes from integrating with "normal" ruff. That work would have to happen at some point if we wanted those changes in the separate mode, but in theory it could be done at a way later point.

  • I'm very hesitant about introducing a new style guide into the formatter because it would undo some of Black's accomplishments in establishing a widely agreed upon style guide.

Makes sense, although the context for this discussion is that Black’s accomplishments have been largely unable to reach the Scientific Python community due to concerns like this and the formatting of arrays. So we would at least be extending past Black’s accomplishments in some ways :)

Now I can see a lot of arguments for why a completely separate mode like this wouldn’t belong in Ruff’s API but be a separate package which uses Ruff’s machinery under the hood. Maybe that is what we’re looking for.


Still, I think there would be value in translating the non-conflicting parts of this spec to additional lint rules. I’m just trying to be ambitious :)

@mdhaber
Copy link
Contributor

mdhaber commented Sep 29, 2024

Note that there is one unclosed LaTeX expression, which would be good to fix first.

Done.

It feels like there could be more clarity between (1) and (5) around when brackets may be / should be inserted to improve an expression.

Can you give an example expression?

a + x*y**3 is rather hard to parse for my eye

Perhaps you're happy with Black's rules for mathematical expressions, and that's fine. I got the sense that others weren't, though, so the rules here attempt to balance the conflicting desires:

  • include whitespace, which makes it easier to distinguish multi-character variables from one another and from operators, and
  • remove whitespace to visually reinforce the order of operations (as typeset math does and PEP8 calls for)

without requiring the addition of lots of parentheses in expressions involving the common PEMDAS operators.

When I type $a+x \cdot y^3$, it renders as $a+x \cdot y^3$, adding more space around the addition than the multiplication, and none for the exponentiation. (And of course, we usually omit the $\cdot$ operator unless there's a reason to add it for emphasis, so $xy$ usually renders without any space.) There might be something to be said for including different amounts of whitespace for different priority operators, e.g. x + y * x**3, and I'd be happy to suggest a set of rules based on that. But assuming that's not in the cards and we only get one whitespace around operators, I preferred to use it to show that the multiplication happens before the addition even though the addition operation appears first on the line.

That's the rationale, but other viewpoints (e.g. "whitespace is only omitted around the highest-priority operator of a compound expression") are perfectly valid, too, so I don't think we'll be able to reason out which is best. I think it may be a matter of either adopting a mostly-baked proposal like this one or putting forth equally complete alternatives and voting. Or maybe we accept something like this as a draft and vote on specific changes like that.

The definition section is a bit daunting...

Ah, right, Pamphile mentioned that before, and we discussed moving it to a reference section at the end. I'll do that.

Intro...

Yeah, I'll take another look at that.

And I think the rules don't forbid developers to include extra parentheses in cases like this if they would like to be extra clear.

The first rule is "do not add extraneous parentheses", but the last is "Any of the preceding rules may be broken if there is a clear reason to do so." So if the rules lead you to an expression that is particularly difficult to parse, break them. But a + (x * y**3) is not recommended because the rules would permit a + x*y**3.

(Another set of rules I considered would be based on "Whitespace is only omitted around the highest-priority operator of a compound expression" and something like "Expressions with more than two operator priority levels must make subexpressions with two operator priority levels explicit.* This would mean you would have to write a + (x * y**3). But I didn't think people would want all those parentheses.)

@lucascolley
Copy link
Contributor

The first rule is "do not add extraneous parentheses", but the last is "Any of the preceding rules may be broken if there is a clear reason to do so."

I guess I was alluding to the possibility that a formatter wouldn't add parentheses by default, but also wouldn't take them away if they are already there.

@mdhaber mdhaber marked this pull request as ready for review October 12, 2024 19:59
@stefanv
Copy link
Member

stefanv commented Oct 12, 2024

pre-commit.ci autofix

@stefanv
Copy link
Member

stefanv commented Oct 12, 2024

  • It feels like there could be more clarity between (1) and (5) around when brackets may be / should be inserted to improve an expression.

Sorry, that should have been (0) and (5).

It's not super intuitive to me when you can / should / must use brackets; but it may well be that the rules are comprehensive.

Wouldn't it be OK to always allow stylistic brackets, for aesthetic grouping, or grouping that is logical to the author?

[This discussion is not a blocker to merging the PR, btw.]

@mdhaber
Copy link
Contributor

mdhaber commented Oct 12, 2024

Yes, see rule 9.

@stefanv
Copy link
Member

stefanv commented Oct 13, 2024

Yes, see rule 9.

"Any of the preceding rules may be broken if there is a clear reason to do so."

@mdhaber
Copy link
Contributor

mdhaber commented Oct 13, 2024

Yes? Because you asked:

Wouldn't it be OK to always allow stylistic brackets, for aesthetic grouping, or grouping that is logical to the author?

It even shows examples of adding parentheses that aren't necessary.

@stefanv
Copy link
Member

stefanv commented Oct 13, 2024

I just posted the quote so others don't have to look it up.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanv Math still didn't render. Any thoughts?

image

spec-0012/index.md Outdated Show resolved Hide resolved
spec-0012/index.md Show resolved Hide resolved
spec-0012/index.md Outdated Show resolved Hide resolved
@stefanv
Copy link
Member

stefanv commented Oct 14, 2024

@stefanv Math still didn't render. Any thoughts?

image

Requires a theme update. Fix has already been made. /cc @jarrodmillman

spec-0012/index.md Outdated Show resolved Hide resolved
spec-0012/index.md Outdated Show resolved Hide resolved
spec-0012/index.md Outdated Show resolved Hide resolved
spec-0012/index.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants