Preserve html entities and multiple spaces #59

brondsem · 2012-11-09T20:26:01Z

A few commits to address preservation of html entities and multiple spaces, and fix general escaping that occurs with backticks. More details in commit messages

This allows multiple sequential   entities to still be multiple spaces, rather than getting collapsed. Within `code` blocks, neither a literal space nor a   work, so a unicode nbsp char is used which seems to work in many markdown renderers. This fixes the output of the google doc code section.

brondsem · 2012-12-13T20:18:18Z

Hey, just checking on this. Wondering if this is merge-able or if anything should be changed?

aaronsw · 2012-12-13T20:28:45Z

Sorry, somehow this get lost in the shuffle. I don't think most users of a program like html2text want HTML in their output, so I'm not comfortable merging a patch that will cause HTML to appear in the output by default.

What's your motivation here?

brondsem · 2012-12-13T22:04:02Z

In the first commit, HTML entities are used so that if your source HTML content is about HTML tags and entities, they will stay escaped and not "devolve" to actual tags and entites. For example &copy; or <b>foo</b> will no longer turn into © and <b>foo</b> (which render very differently from what the original HTML renders as)

The second commit doesn't add HTML to the markdown output.

The third commit preserves   from the HTML into the markdown. This is illustrated in the GoogleDocMassDownload files in which there already was two spaces between "human" and "being". Previously, that was getting collapsed into one space. Now it'll preserve the two spaces. The downside to this is illustrated in the "nbsp.md" in which the   entities from the HTML are carried through to the markdown unnecessarily. They could be a regular space and everything would render consistent to the original HTML render. Perhaps this should go under the "escape snob" flag.

My overall rationale for this is that we're importing a large amount of content into a markdown-based system, so we want to maintain accuracy to the original content. Specifically, we're using this within SourceForge as we upgrade projects from our legacy platform to our new platform. Lots of SourceForge forums and ticket content is technical, so there are literal HTML entities we need to preserve, as well as code snippets that have lines indented with many spaces (consecutive entities).

Thanks

@smblackburn

Support for image sizing using raw html Thanks @smblackburn

brondsem added 3 commits November 9, 2012 13:15

escape &<> so that entities don't disappear during conversion

b76cbe3

set code flag properly so that escaping is not done within backticks

08e0168

smblackburn mentioned this pull request Apr 10, 2015

Retain escaping of html except within code or pre tags. Alir3z4/html2text#57

Merged

pombredanne pushed a commit to pombredanne/html2text that referenced this pull request Oct 10, 2015

Merge pull request aaronsw#59 from smblackburn/master

9e2991c

Support for image sizing using raw html Thanks @smblackburn

theSage21 mentioned this pull request Jul 11, 2016

unexpanded < > & Alir3z4/html2text#109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve html entities and multiple spaces #59

Preserve html entities and multiple spaces #59

brondsem commented Nov 9, 2012

brondsem commented Dec 13, 2012

aaronsw commented Dec 13, 2012

brondsem commented Dec 13, 2012

Preserve html entities and multiple spaces #59

Are you sure you want to change the base?

Preserve html entities and multiple spaces #59

Conversation

brondsem commented Nov 9, 2012

brondsem commented Dec 13, 2012

aaronsw commented Dec 13, 2012

brondsem commented Dec 13, 2012