If you want to play with software rendering, here's probably the shortest code that will get an ARGB8888 2D array from main memory to the screen efficiently for all platforms using SDL2 in C https://gist.github.com/CoryBloyd/6725bb78323bb1157ff8d4175d... you'll need to do the translation from a 320x200x8-bit palletized framebuffer to ARGB yourself ;)
If you want to get inspired by what can be done with palletized framebuffers check out http://www.effectgames.com/demos/canvascycle/ (click Show Options) and the GDC presentation by the artist https://youtu.be/aMcJ1Jvtef0
With that you can fire up https://github.com/mriale/PyDPainter for that classic Deluxe Paint IIe vibe. Or, https://www.aseprite.org/ for something more modern.
At least with SDL3, you don't even need the renderer or the texture anymore. SDL_GetWindowSurface to get the surface and SDL_UpdateWindowSurface to present. That's the more software-graphics you can get from my understanding of the library. SDL still does the double-buffering for you.
Thank you for sharing this. There's a handful of very popular Quake forks already, but Planimeter publishes a Quake-VS2026 fork that doesn't introduce changes. The team is working on x64 builds, which requires replacing the old SciTech Mult-platform Graphics Library (x86 only) with SDL3 (or port scitech-mgl to x64, which I don't think will happen) and the last I understood, the software renderer may be dropped.
But maybe a software renderer and SDL_Texture could preserve it?
It's certainly the most rudimentary. Small optimisation on the inner-loop would be to pre-calculate the scanline offset before going into the pixel loop:
I'd be surprised if the compiler didn't make that optimisation on its own.
Possibly, but always check the assembly.
The even faster version, opts aside, would be to initialize the pointer at y*screenRect.w and ++ at every loop to avoid the addressing arithmetic.
Certainly check the assembly, but loop invariant code motion and strength reduction are basic optimizations. C compilers tend to be good at optimizing indexing patterns even at -O1.
Take a look, GCC and Clang go further than these suggestions by adding screenRect.w to the pointer each iteration to avoid the multiplication: https://godbolt.org/z/YfroqK7T6
Writing anything but pixels[y*screenRect.w + x] in an attempt to be faster, without checking the assembly first, is obfuscation.
(For what it's worth, you can beat the compiler by using *pixels++. I didn't profile the code to check it actually was faster in practice however.)