A couple of personal experiences from the ancient world:

I wrote code for a microprocessor that did not have a stack as an actual concept. There are still embedded processors out there today like this. To work around this you stored a value at a zero page location, and when you wanted to jump to a subroutine, you would first load up this value from zero-page, advance it, put the program counter into the memory address now pointing to a new memory location, then execute a simple go to, and at the end of your function, to return, you would load up that value in zero page, then load up your return address, decrement your pointer, store it back to zero page, then go to back to the calling function. Every value you wanted to pass to the subroutine would be either in one of your three registers, or you would pass those values by storing them in memory pointed to by your zero page value. The 6502 offers nice instructions to do these very operations, by turning the assembly/machine code into microcode that does the exact same thing, but more succinctly.

Another trick I used was bank weaving. You only had a limited amount of addressable memory, but you could bank switch ROMs, so you'd write your code to have it reside at a fixed memory location, and another chunk of code at another fixed memory location in a different ROM bank, then your first code would execute down to a point where it would switch banks based on a condition, and when the other ROM bank switched in, your alternative code path would be waiting for you, you'd execute that, then switch the ROM bank again back to the first one, and the PC (program counter) will have advanced several hundred bytes possibly, so you'd return to a point in your code, back in the first ROM bank, that was a completely different point in memory, but still the same calling function.

A few years later I used a similar technique on a large text adventure game, a compiler and a word processor, where the core of the program was always resident, but chunks of code could be loaded in at fixed memory locations from disk. So if you ran a spell check, the spell check functions would load in at a known memory location, various bits were set in RAM to indicate which functions were resident, and the application could perform the spell check, or run the logic for a part of the text adventure game. And functions would automatically unload to make room for new functions based on the last time they were invoked.

I wrote code for a Timex processor that had a "execute the instruction at the following address" opcode, so effectively your instruction here, would execute that instruction over there, but only for one instruction. It made writing the equivalent of switch/case in assembly interesting, and also good for self-modifying code.

Zero-page on some old CPUs was a faster RAM, a few dozen bytes, or even 256 bytes, so if you had a tight loop, you'd copy the code into zero page, perhaps having to move other stuff out of the way first, and execute it there.

I wrote a word processor that stored only a tiny fraction of its data in under 3KB of RAM, the rest was stored on big floppy disks. As you scrolled through the text, it would page through the pages (memory pages, not pages of text) on the floppy disc, but the memory pages were not stored continuously, either in RAM or on disk. The RAM acted more like a cache, and a couple of memory pages were reserved for edit operations and even copy & paste. To copy and paste entire pages was fast, you only had to adjust a few pointers in RAM and on disk, plus a few hundred bytes for head and tail page operations so moving large blocks of text around was very fast, and inserting large blocks of text, or even just typing, was very fast, because the buffers were quite small. It was the only word processor I know of that came with a disk defrag operation built in, but the company called it "housekeeping" in the manual with no explanation to the end user of what it was doing. I learned a lot about "don't mark that as deleted until you are damn sure the operation is completed and you've verified it worked."

I did a 6507 and Z80 emulator on the MIPS R3000 that ran the code in a very pedestrian way, but as the 6507 or Z80 ran through the code, each memory address was added to a list, and the 6507/Z80 opcode found there was translated into an intermediate assembly language, and then I post-processed that intermediate assembly into R3000, which gave me a huge performance boost; post-process dynamic recompilation effectively. I had to do other sneaky things with it too because the original hardware raced the electron beam whereas the target platform just had a big VRAM buffer. Used the same trick to port 6507 code to an early ARM processor for an old handheld too.

There's a lot of other tricks we used to in the before-times, such as LFSR and LCG to permute game objects in seemingly random patterns, cheating on distance checks by counting the clock cycles between two objects drawn on screen, low-rez bitmaps of the screen to speed up collision detection, compiled graphics/sprites, even sprite blitting routines that were hard-coded to specific pixel offsets.