Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve html entities and multiple spaces #59

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

brondsem
Copy link
Contributor

@brondsem brondsem commented Nov 9, 2012

A few commits to address preservation of html entities and multiple spaces, and fix general escaping that occurs with backticks. More details in commit messages

This allows multiple sequential   entities to still be
multiple spaces, rather than getting collapsed.

Within `code` blocks, neither a literal space nor a   work,
so a unicode nbsp char is used which seems to work in many markdown
renderers.  This fixes the output of the google doc code section.
@brondsem
Copy link
Contributor Author

Hey, just checking on this. Wondering if this is merge-able or if anything should be changed?

@aaronsw
Copy link
Owner

aaronsw commented Dec 13, 2012

Sorry, somehow this get lost in the shuffle. I don't think most users of a program like html2text want HTML in their output, so I'm not comfortable merging a patch that will cause HTML to appear in the output by default.

What's your motivation here?

@brondsem
Copy link
Contributor Author

In the first commit, HTML entities are used so that if your source HTML content is about HTML tags and entities, they will stay escaped and not "devolve" to actual tags and entites. For example &amp;copy; or &lt;b&gt;foo&lt;/b&gt; will no longer turn into &copy; and <b>foo</b> (which render very differently from what the original HTML renders as)

The second commit doesn't add HTML to the markdown output.

The third commit preserves &nbsp; from the HTML into the markdown. This is illustrated in the GoogleDocMassDownload files in which there already was two spaces between "human" and "being". Previously, that was getting collapsed into one space. Now it'll preserve the two spaces. The downside to this is illustrated in the "nbsp.md" in which the &nbsp; entities from the HTML are carried through to the markdown unnecessarily. They could be a regular space and everything would render consistent to the original HTML render. Perhaps this should go under the "escape snob" flag.

My overall rationale for this is that we're importing a large amount of content into a markdown-based system, so we want to maintain accuracy to the original content. Specifically, we're using this within SourceForge as we upgrade projects from our legacy platform to our new platform. Lots of SourceForge forums and ticket content is technical, so there are literal HTML entities we need to preserve, as well as code snippets that have lines indented with many spaces (consecutive   entities).

Thanks

pombredanne pushed a commit to pombredanne/html2text that referenced this pull request Oct 10, 2015
Support for image sizing using raw html

Thanks @smblackburn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants