> A semantic line break SHOULD occur after an […] em dash (—).
I agree with this, however it means that no existing markup language supports semantic line breaks, because every last one of them just turns the break into a space—and em dashes are, in most locales, not to be surrounded by a space. Consequently, you’ll end up with a stray space if you do this.
My irritation at being unable to break after an em dash (which I want to do quite frequently) was one of the things that headed me down the path of designing my own lightweight markup language (LML), to fix this and other problems I observe with existing LMLs. I’ve been using it for all my personal writing for something like four years now (though a a fair bit has changed since then), and I expect to finally have a functioning parser before the end of this year.
One of the other fun complications of this kind of line break in source code is languages that don’t have a word divider—inserting a space at all is incorrect in them.
CSS presently just leaves such decisions UA-defined <https://drafts.csswg.org/css-text-4/#line-break-transform>:
> any remaining segment break is either transformed into a space (U+0020) or removed depending on the context before and after the break. The rules for this operation are UA-defined in this level.
My LML currently turns segment breaks into a space unless the line ends with an en or em dash, unless there’s a colon or a space before that. I haven’t got anything in place for languages with no word separator yet, but it is unusually well-suited to such languages.
I'm a died-in-the-wool, responsive, readable, internationalizable, accessible, standards-based, enshyenist:
Instead of using an unbreakable em dash to rigidly and unbreakably connect two phrases by their last and first words, I prefer using an en dash, followed by a shy hyphen, and then another en dash, to elegantly hyphenate words connected by em dashes when they don't fit on the line. ;)
–­–
Few fonts will render this nicely; the dashes are unlikely to join. Also if it does break at the soft hyphen, you’ve got an extraneous hyphen added on the first line.
If I were doing that, I’d probably use a zero-width space instead of a soft hyphen. Same break opportunity, removes the extraneous undesirable hyphen if it breaks, but introduces a new word boundary so that wordwise selection can now split your wonky dash. Therefore I suggest <span style=user-select:all>–​–</span> because if you’re going to do something ridiculous you might as well embrace the ridiculosity.
More folks should define their own lightweight markup languages! It’s fun and makes your writing and notes feel more like your own.
I created a convention for defining sub-notes (with frontmatter) in a Markdown note and have found it really helpful over the past few years.
I used to do this with RST, though a backslash is needed at the end of the line to escape the newline.
I don’t like reStructuredText’s backslash behaviour, because it means two completely different things. Or arguably three. Normally it means to interpret the next character literally, but if it’s followed by whitespace (typically space or newline) it instead removes that next character. Except… not entirely in the case of newline, because it’s character-level markup, and at the end of a block it just does nothing. In
you might expect to get “a b” or an error, but actually you get a single-item definition list with term “a” and definition “b”, just the same as if you had omitted the backslash.A far more logical meaning of a trailing backslash is to escape the newline, meaning, in HTML terms, insert <br>. That’s what I chose in my LML, and I later learned CommonMark chose that too.
> meaning of a trailing backslash is to escape the newline
That's what it does in this example. Don't have to use other cases, and don't believe I did.
In hindsight “escape” was a poor choice of word, but I did explain it and you omitted that from your quote: “meaning, in HTML terms, insert <br>”. And that’s not what reStructuredText does. Rather, at the end of a line, backslash acts like a line continuation character (… that only works in certain circumstances), a behaviour commonly found in programming languages inside at least string literals, but such languages aren’t using backslash as “escape the next character”, but rather they have a fixed set of escape sequences like \n or \uXXXX.
> em dashes are, in most locales, not to be surrounded by a space
This is definitely not the case for at least French and Russian, which means markup renderers now have to guess text language or force authors to declare such in some metadata header. And it gets even more complicated with inclusion of block quotes in different languages.
It’s not hard and doesn’t need language awareness; I described how to detect it: if there’s no space before an end-of-line em dash, suppress the segment-break-replacing space.
There seem to be some locales or styles that use asymmetric spacing. From the Zen of Python—note different spacing based on context and position within the sentence:
You have missed a joke: https://bugs.python.org/issue3364.
Unicode has U+200B ZERO WIDTH SPACE for that purpose. In HTML and hence Markdown you can also use `<wbr>`. If you’re using a custom setup anyway, you can have it be inserted automatically by regex replacement, as a pre-rendering step.
I think you’ve misunderstood something? This is about suppressing the turning of a segment break into a space, not about line break opportunities.
> Unicode has U+200B ZERO WIDTH SPACE for that purpose.
ZWSP is not at all “for that purpose”. If you mean this:
Well, I am mildly surprised to find that no extra space is added in Gecko or Blink. But in WebKit, a space is still added; for this is part of the “UA-defined” bit I quoted.And if you’re willing to do preprocessing, you can just merge the lines, that’d actually work.
> In HTML and hence Markdown you can also use `<wbr>`.
I fail to see how <wbr> is relevant.
Indeed, I skimmed a bit and misread “unable to break” to mean that you wanted a line-break opportunity but the renderer didn’t allow for it when a letter is directly following an em dash. But it’s the other way around, you want a line break in the source after an em dash to not translate into a space in the rendering. This would likewise be possible to handle by regex replacement as a pre-rendering step.
More generally, I see markup languages and the details of how they are rendered as largely orthogonal. You don’t necessarily need to invent a different markup language in order to adjust the rendering.
> More generally, I see markup languages and the details of how they are rendered as largely orthogonal. You don’t necessarily need to invent a different markup language in order to adjust the rendering.
There’s not much to a markup language beyond how it’s rendered. If you don’t ever want to render it to something other than plain text, just write plain text however you desire. The reason for choosing a particular markup language is to express intended semantics (for plain-text and rendered use), and to render it. The semantics aspect is legitimate, so I won’t say the language and rendering are identical or parallel, but they’re definitely nothing like orthogonal. If you’re using a CommonMark pipeline, any preprocessing you do means you’re not actually writing in CommonMark, but an incompatible variant of it. You may well deem it worthwhile, but it’s no longer the same markup language.