Hacker News

CLRF vs LF strikes again. Partly at least.

I wonder why even have a max line length limit in the first place? I.e. is this for a technical reason or just display related?

brk 15 hours ago [ - ]

Wait, now we have to deal with Carriage Line Return Feeds too?

I wonder if the person who had the idea of virtualizing the typewriter carriage knew how much trouble they would cause over time.

keybored 14 hours ago [ - ]

Yeah, and using two bytes for a single line termination (or separation or whatever)? Why make things more complicated and take more space at the same time?

floren 14 hours ago [ - ]

Remember that back in the mists of time, computers used typewriter-esque machines for user interaction and text output. You had to send a CR followed by an LF to go to the next line on the physical device. Storing both characters in the file meant the OS didn't need to insert any additional characters when printing. Having two separate characters let you do tricks like overstriking (just send CR, no LF)

kstrauser 11 hours ago [ - ]

True, but I don’t think there was a common reason to ever send a linefeed without going back to the beginning. Were people printing lots of vertical pipe characters at column 70 or something?

It would’ve been far less messy to make printers process linefeed like \n acts today, and omit the redundant CR. Then you could still use CR for those overstrike purposes but have a 1-byte universal newline character, which we almost finally have today now that Windows mostly stopped resisting the inevitable.

saila 10 hours ago [ - ]

> now that Windows mostly stopped resisting the inevitable

I've been trying to get Visual Studio to stop mucking with line endings and encodings for years. I've searched and set all the relevant settings I could find, including using a .editorconfig file, but it refuses to be consistent. Someone please tell me I'm wrong and there's a way to force LF and UTF-8 no-BOM for all files all the time. I can't believe how much time I waste on this, mainly so diffs are clean.

kstrauser 9 hours ago [ - ]

Ugh, I didn't realize it was still that bad.

How far can you get with setting core.autocrlf on your machine? See https://git-scm.com/book/en/v2/Customizing-Git-Git-Configura...

floren 10 hours ago [ - ]

As I understand it (this may be apocryphal but I've seen it in multiple places) the print head on simple-minded output devices didn't move fast enough to get all the way back over to the left before it started to output the next character. Making LF a separate character to be issued after CR meant that the line feed would happen while the carriage was returning, and then it's ready to print the next character. This lets you process incoming characters at a consistent rate; otherwise you'd need some way to buffer the characters that arrived while the CR was happening.

Now, if you want to use CR by itself for fancy overstriking etc. you'd need to put something else into the character stream, like a space followed by a backspace, just to kill time.

kstrauser 9 hours ago [ - ]

I don't think that's right. Not saying that to argue, more to discuss this because it's fun to think about.

In any event, wouldn't you have to either buffer or use flow-control to pause receiving while a CR was being processed? You wouldn't want to start printing the next line's characters in reverse while the carriage was going back to the beginning.

My suspicion is there was a committee that was more bent on purity than practicality that day, and they were opposed to the idea of having CR for "go to column 0" and newline for "go to column 0 and also advance the paper", even though it seems extremely unlikely you'd ever want "advance the paper without going to column 0" (which you could still emulate it with newline + tab or newline + 43 spaces for those exceptional cases).

floren 8 hours ago [ - ]

I've seen this explanation multiple times through the years, but as I said it's entirely possible it was just a post-hoc thing somebody came up with. But as you said, it's fun to argue/think about, so here's some more. I'm talking about the ASR-33 because they're the archetypal printing terminal in my mind.

If you look at the schematics for an ASR-33, there's just 2 transistors in the whole thing (https://drive.google.com/file/d/1acB3nhXU1Bb7YhQZcCb5jBA8cer...). Even the serial decoding is done electromechanically (per https://www.pdp8online.com/asr33/asr33.shtml), and the only "flow control" was that if you sent XON, the teletype would start the paper tape reader -- there was no way, as far as I can tell, for the teletype to ask the sender to pause while it processes a CR.

These things ran at 110 baud. If you can't do flow control, your only option if CR takes more than 1/10th of a second is to buffer... but if you can't do flow control, and the computer continues to send you stuff at 110 baud, you can't get that buffer emptied until the computer stops sending, so each subsequent CR will fill your buffer just a little bit more until you're screwed. You need the character following CR (which presumably takes about 2/10ths of a second) to be a non-printing character... so splitting out LF as its own thing gives you that and allows for the occasional case where doing a linefeed without a carriage return is desirable.

Curious Marc (https://www.curiousmarc.com/mechanical/teletype-asr-33) built a current loop adapter for his ASR-33, and you'll note that one of the features is "Pin #32: Send extra NUL character after CR (helps to not loose first char of new line)" -- so I'd guess that on his old and probably worn-out machine, even sending LR after CR doesn't buy enough time and the next character sometimes gets "lost" unless you send a filler NUL.

Now, I haven't really used serial communications in anger for over a decade, and I've never used a printing terminal, so somebody with actual experience is welcome to come in and tell me I'm wrong.

kstrauser 7 hours ago [ - ]

That's fascinating! They got a lot of mileage out of those 2 transistors, didn't they?

But see, that's why I think there has to be more to it. That extra LF character wouldn't be enough to satisfy the timing requirements, so you'd also need to send NUL to appropriately pad the delay time. And come to think of it, the delay time would be proportional to the column the carriage was on when you sent the CR, wouldn't it? I guess it's possible that it always went to the end but that seems unlikely, not least because if that were true then you'd never need to send CR at all, just send NUL or space until you calculated it was at EOL.

OJFord 17 hours ago [ - ]

I haven't seen them other than in the submission - but if the length matches up it may be that they were processed from raw email, the RFC defines a length to wrap at.

Edit: yes I think that's most likely what it is (and it's SHOULD 78ch; MUST 998ch) - I was forgetting that it also specifies the CRLF usage, it's not (necessarily) related to Windows at all here as described in TFA.

Here it is in my 'notmuch-more' email lib: https://github.com/OJFord/amail/blob/8904c91de6dfb5cba2b279f...

FabHK 17 hours ago [ - ]

> it's not (necessarily) related to Windows at all here as described in TFA.

The article doesn't claim that it's Windows related. The article is very clear in explaining that the spec requires =CRLF (3 characters), then mentions (in passing) that CRLF is the typical line ending on Windows, then speculates that someone replaced the two characters CRLF with a one character new line, as on Unix or other OSs.

OJFord 17 hours ago [ - ]

Ok yeah I may have misinterpreted that bit in the article. It would be a totally reasonable assumption if you didn't happen to know that about email though, it wasn't a judgement regardless.

dgan 17 hours ago [ - ]

I am just wondering how it is good idea for a sever to insert some characters into user's input. If a collegue were to propose this, i d laugh in his face

It's just sp hacky i cant belive it's a real life's solution

jagged-chisel 17 hours ago [ - ]

“Insert characters”?

Consider converting the original text (maintaining the author’s original line wrapping and indentation) to base64. Has anything been “inserted” into the text? I would suggest not. It has been encoded.

Now consider an encoding that leaves most of the text readable, translates some things based on a line length limit, and some other things based on transport limitations (e.g. passing through 7-bit systems.) As long as one follows the correct decoding rules, the original will remain intact - nothing “inserted.” The problem is someone just knowledgeable enough to be aware that email is human readable but not aware of the proper decoding has attempted to “clean up” the email for sharing.

dgan 17 hours ago [ - ]

Okey it does sound better from this POV. Still wierd as its a Client/UI concern, not something a server is supposed to do; whats next,adding "bold" tags on the title? Lol

brookst 15 hours ago [ - ]

SMTP is a line-oriented protocol. The server processes one line at a time, and needs to understand headers.

Infinite line length = infinite buffer. Even worse, QP is 7-bit (because SMTP started out ASCII only), so characters >127 get encoded as three bytes (equal, then two hex digits), so a 500-character non-ASCII UTF8 line is 1500 bytes.

It all made sense at the time. Not so much these days when 7-bit pipes only exist because they always have.

flexagoon 17 hours ago [ - ]

When you post a comment on HN, the server inserts HTML tags into your input. Isn't that essentially the same thing?

dgan 17 hours ago [ - ]

No, because there is a clear separation between the content and the envelop. You wouldnt expect the post office to open your physical letters and write routing instructions to the postmen for delivery

But I agree with sibling comment: it makes more sense when its called "encoding" instead of "inserting chars into original stream"

1718627440 13 hours ago [ - ]

> You wouldnt expect the post office to open your physical letters and write routing instructions to the postmen for delivery

Digital communication is based on the postmen reading, transcribing and copying your letters. There is a reason why digital communication is treated differently then letters by the law and why the legally mandated secrecy for letters doesn't apply to emails.

direwolf20 16 hours ago [ - ]

It's called escaping, and almost every protocol has it. HN must convert the & symbol to & for displaying in HTML. Many wire protocols like SATA or Ethernet must insert a 1 after a certain number of consecutive 0s to maintain electrical balance. Don't remember which ones — don't quote me that it's SATA and Ethernet.

zoho_seni 11 hours ago [ - ]

Protocols that literally insert a bit are HDLC / PPP / CAN and they insert a 0 after a few 1s

layer8 16 hours ago [ - ]

Just wait until you learn what mess UTF-8 will turn your characters into. ;)