A long time ago I worked with someone who read 1 byte at a time from a socket because they insisted data was cached so the kernel was going to batch it magically somehow. It took me days to convince them to measure it.
A long time ago I worked with someone who read 1 byte at a time from a socket because they insisted data was cached so the kernel was going to batch it magically somehow. It took me days to convince them to measure it.
I used to make it a general rule to start all my optimisation of any network code by running strace and look for excessive read's and write's, because you'd be shocked how many did stuff like that if they didn't know the length of a string, or to read the length first, instead of reading into a buffer.
I had to convince people with benchmarks regularly that, yes, you could write the handful of lines to do proper user-space buffering and trivially run rings around any code that did extra context switches, because a lot of people didn't realise the cost difference between system calls and calling their own functions.
This included, by the way, the MySQL client library, at one point, which would do small read for length fields instead of larger non-blocking reads into a buffer all the time
That's different: you're talking about the application code, like OP.
But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.
Is it? I get what you're saying, but asking for 1 byte 65536 times, is indeed different than asking for 65536 bytes, 1 time. There may be reasons, such as when you pull off the end of a buffer, it shifts. And the buffer size is 1 byte. Or 10. Or whatever.
No, I'm not saying that's why. I'm simply saying there is a difference between asking for 1 byte or 65k bytes of something. Even dd runs the same under Linux.
dd bs=10k count=1 is faster than bs=1 count=10k
I remember trying to recover some data from a spinning disk, and trying to slowly creep up on the data. So I wanted 1 byte per, I wanted it to nibble, until it hit whatever the errored part was. If I just grabbed the lot, it'd error out from the whole read.
I glanced at https://github.com/busterb/libc-openbsd/blob/master/stdio/fr... and https://chromium.googlesource.com/chromiumos/third_party/gli....
The latter (as usual when comparing OpenBSD and Linux) is more complex, but both multiply count by size and then go their way.
Also, the API contract allows fread to read fewer bytes than requested. I would except any implementation to do that.
But maybe, somebody interpreted the contract differently than major OSes, in the sense that a call isn’t allowed to write partial size-sized chunks to user memory and/or advance the file position further than its return value advocates (that, I think, is something that the implementations above can do, and might be considered a bug)
> asking for 1 byte 65536 times, is indeed different than asking for 65536 bytes, 1 time.
Yes it's different. As others have noted, the difference is what is returned if less than 65536 are available to read in the file: total failure vs partial read.
There is, unsurprisingly, no requirement that it has an unnecessarily inefficient implementation to meet this behavioral requirement. (The C standard doesn't talk about such things as syscalls but, even if it did, it surely wouldn't require such a thing.)
The irony is that that partial read is actually the default on both Windows and Posix (i.e. both ReadFile and read() will read up to the number of bytes specified). So a one-syscall implementation for fread would have been easier than multiple calls, and certainly would be standard compliant.
The dd example isn't comparable because dd is much lower level, and you really are specifying how the syscalls should be made.
Also you need to be careful what you read/write. In some cases.
As many examples out there use int/char etc to show how to use the thing. But if you switch to structs that fwrite can totally burn you if you use the sizeof call. As the sizeof a struct can vary between platforms and compilers. Depending on packing. Then endianness can sometimes mess you up. If you are reading/writing for yourself you can get away with a lot. But if you are trying to interop then you have to be wildly careful what you do.
fwrite is another one where people will do one byte at a time (same up to for the windows version). Bash out a loop, use the sizeof for the input to the for loop. copy and paste just doing 1 byte and you can easily end up here. One program I added a cache in front of the thing so it would always write on disk block boundaries and then come back for more. I started off with just packed struct sizes but the perf was just 'ok'. The file block boundary thing really made it fast. Not all OS's have a readahead/write buffer behind that call so perf can vary.
It is honestly such an easy mistake to make. As many of the examples/docs do not really show you why/how to use both of those calls in the way needed. You sort of have to stumble into it and work it out.
Once you see it you know. But until then you do not really notice if it is 'working'.
Another possibility for why it needs to be done that way is dealing with error conditions.
I've not looked at the code (or even the man pages) and it is a long time since I touched anything that low level, so this might be completely wrong, but if there is an error before the next 64KiB (including just hitting EOF) then the semantics could be different. Asking for 1x64KiB I would expect to just error as there aren't the requested number of bytes. Asking for 64Ki lots of 1 byte might simple error just the same, or it might at least populate the buffer with what it can read, or if the meaning of 1,65536 is actually “up to 64Ki lots of 1B” then it would populate the buffer as far as possible and return the amount read rather than an error condition.
If the per-byte option is slow but still fast enough, and dealing with the semantics is less faf, then people will go for that because the tiny time loss is worth the larger effort reduction. Of course this assumes the underlying system doesn't change, as with the “making local code to run as on-demand networked code” example higher in the thread which changes the relative performance characteristics of the two calling methods significantly.
dd is designed to request a certain block size from the kernel. fread is not and should just multiply the two arguments and read that many bytes, just like calloc.