> I look forward to comments telling me how wrong I am.
Okay, I'll try to explain.
The job of a parser is to perform the opposite of stringification as precisely as possible. That string source is not always a human typing on a keyboard.
If your stringifier doesn't output scientific notation, then you reject scientific notation because it's invalid data, plain and simple. It's completely irrelevant that to a human it still looks like it could be some number. It's like picking up the wrong baby at the hospital. It might be lovely and all, but it wasn't what you were supposed to get, and that matters. It's a sign that someone screwed up, potentially badly, and your job is to hunt down where the bug was and how to address it, not just roll with whatever came your way and win some competition on how long you can ignore problems.
If you want to parse natural language then sure, it makes sense. That's what flags and different functions are good for. It doesn't mean that has to be the default. As far as programming languages go, the default assumption is that your Parse() is parsing whatever your ToString() produced on the same type.
"The job of a parser is to perform the opposite of stringification as precisely as possible."
Is it though? String-to-float parsers are quite flexible in the range of inputs they allow. "1000", "1E3", "10E2", "0.1E4", "1000.0", "00000001000.00000000000000" all return the same float value. https://dotnetfiddle.net/K08vk5
(See also sibling comment by "someone".)
If your question is "Is this string the canonical representation of some number?" then a parser function on its own is the wrong tool.
Yeah but floats can typically be formatted with E notation, i.e. it makes sense to parse floats with E notation. I’m not aware of any integer formatting functions that will emit integer strings in E notation. The leading zeros and unary sign mentioned by a sibling comment are typically available options for integer formatting, i.e. make sense to parse according to GPs reasoning. I assume the reason float parsing is more forgiving is because of how often floats are written in E notation, also float parsers typically have to handle things like NaN and inf.
By that logic, I think most toString/fromString pairs are broken. Many “string to int” functions accept some strings that their counterpart “int to string” never produces, for example “-0”, “0000”, and “01”.
(Chances are ‘fromString’ accepts strings with very long prefixes of zeroes)
Having said that, I do think accepting “1E9” may go too far. If you accept that, why would you disallow “12E3”, “1.23E4”, “12300E-2”?
If you’re going to widen what the function accepts, I would add support for hexadecimal (“0x34”) and binary (“0b1001”) before adding this.
Yes, the prefix of zeros (and a plus sign) are exactly what I'd expect to enable via flags, not as a default. Especially because zero prefix isn't even always insignificant - it denotes octal in some contexts.
Keeping it simple. Once you've got the mantissa and exponent out, you can check your number is in range by a simple check that the exponent is within range.
For a 32 bit signed integer the limit is 2E9. This means that the exponent is fine 0-8, or if the exponent is 9, then only if the mantissa is 1 or 2. This only works with a single digit mantissa.
For adding more digits to the mantissa, while a robust range check can be done, it gets more complicated. String-to-Integer functions are very conservatively written pieces of code designed to be quick.
> While we’re on the subject of hex numbers, I may be following this up with a proposal that “H” should mean “times 16 to the power of” in a similar style, but that’ll be for another day.
I like this idea but I think it should be "HE" for hexadecimal exponent.
I’m not sure why this is a “proposal” for other string to int parsers rather than a function the author wrote themselves. It seems rather trivial to implement on top of something like strtol (or whatever your language’s equivalent is).
You could say that for almost all of most language's standard libraries.
Imagine you ad a standard library string-to-integer parser that didn't know about minus signs. Sure, you could write your own function that wrapped the parser to allow for negative numbers, but wouldn't it be better if the standard library one did it?
I take your general point with the caveat that no negatives leaves half of all values for a given integer type unyieldable whereas lack of scientific notation support does not.
I was operating under an unfounded assumption that the blog post existed instead of the code to do the thing for your particular use case rather than in addition to it, which isn’t entirely fair given we have had no prior interactions and I have not investigated your work at all.
This is the kind of thing Services on MacOS used to be for - type 1e9 into a text box, hit command option control alt meta escape shift B and it converts it to 1000000000.
> I look forward to comments telling me how wrong I am.
Okay, I'll try to explain.
The job of a parser is to perform the opposite of stringification as precisely as possible. That string source is not always a human typing on a keyboard.
If your stringifier doesn't output scientific notation, then you reject scientific notation because it's invalid data, plain and simple. It's completely irrelevant that to a human it still looks like it could be some number. It's like picking up the wrong baby at the hospital. It might be lovely and all, but it wasn't what you were supposed to get, and that matters. It's a sign that someone screwed up, potentially badly, and your job is to hunt down where the bug was and how to address it, not just roll with whatever came your way and win some competition on how long you can ignore problems.
If you want to parse natural language then sure, it makes sense. That's what flags and different functions are good for. It doesn't mean that has to be the default. As far as programming languages go, the default assumption is that your Parse() is parsing whatever your ToString() produced on the same type.
"The job of a parser is to perform the opposite of stringification as precisely as possible."
Is it though? String-to-float parsers are quite flexible in the range of inputs they allow. "1000", "1E3", "10E2", "0.1E4", "1000.0", "00000001000.00000000000000" all return the same float value. https://dotnetfiddle.net/K08vk5
(See also sibling comment by "someone".)
If your question is "Is this string the canonical representation of some number?" then a parser function on its own is the wrong tool.
Yeah but floats can typically be formatted with E notation, i.e. it makes sense to parse floats with E notation. I’m not aware of any integer formatting functions that will emit integer strings in E notation. The leading zeros and unary sign mentioned by a sibling comment are typically available options for integer formatting, i.e. make sense to parse according to GPs reasoning. I assume the reason float parsing is more forgiving is because of how often floats are written in E notation, also float parsers typically have to handle things like NaN and inf.
By that logic, I think most toString/fromString pairs are broken. Many “string to int” functions accept some strings that their counterpart “int to string” never produces, for example “-0”, “0000”, and “01”.
(Chances are ‘fromString’ accepts strings with very long prefixes of zeroes)
Having said that, I do think accepting “1E9” may go too far. If you accept that, why would you disallow “12E3”, “1.23E4”, “12300E-2”?
If you’re going to widen what the function accepts, I would add support for hexadecimal (“0x34”) and binary (“0b1001”) before adding this.
Yes, the prefix of zeros (and a plus sign) are exactly what I'd expect to enable via flags, not as a default. Especially because zero prefix isn't even always insignificant - it denotes octal in some contexts.
I think you're mixing up parsers and deserialisers here.
Why limit to a single digit integer for the mantissa? I might just as well want to input 243E9 to get 243 billion.
Keeping it simple. Once you've got the mantissa and exponent out, you can check your number is in range by a simple check that the exponent is within range.
For a 32 bit signed integer the limit is 2E9. This means that the exponent is fine 0-8, or if the exponent is 9, then only if the mantissa is 1 or 2. This only works with a single digit mantissa.
For adding more digits to the mantissa, while a robust range check can be done, it gets more complicated. String-to-Integer functions are very conservatively written pieces of code designed to be quick.
“E” is also a plausible typo for numbers though because it’s next to the number row on the keyboard.
> While we’re on the subject of hex numbers, I may be following this up with a proposal that “H” should mean “times 16 to the power of” in a similar style, but that’ll be for another day.
I like this idea but I think it should be "HE" for hexadecimal exponent.
I’m not sure why this is a “proposal” for other string to int parsers rather than a function the author wrote themselves. It seems rather trivial to implement on top of something like strtol (or whatever your language’s equivalent is).
You could say that for almost all of most language's standard libraries.
Imagine you ad a standard library string-to-integer parser that didn't know about minus signs. Sure, you could write your own function that wrapped the parser to allow for negative numbers, but wouldn't it be better if the standard library one did it?
I take your general point with the caveat that no negatives leaves half of all values for a given integer type unyieldable whereas lack of scientific notation support does not.
I was operating under an unfounded assumption that the blog post existed instead of the code to do the thing for your particular use case rather than in addition to it, which isn’t entirely fair given we have had no prior interactions and I have not investigated your work at all.
This is the kind of thing Services on MacOS used to be for - type 1e9 into a text box, hit command option control alt meta escape shift B and it converts it to 1000000000.