Interestingly, gcc-amigaos-gcc 6.5 uses dbra without having to jump through any of those contortions, as long as the optimisation level is set to at least -O1:
_clear_screen:
move.w #28672,3932160
move.w #1,3932164
move.l #3932162,a0
move.w #-13570,d1
move.w #1279,d0
.L2:
move.w d1,(a0)
dbra d0,.L2
rts
One thing that I heard from folks who do development for retro Atari platforms is that the 68k support in GCC has been getting worse as time has gone on, and it's very difficult to get the maintainers to accept patches to improve it, since 68k is not exactly widely used at this point.
Specifically, I heard that the 68k backend keeps getting worse, whilst the front-end keeps getting better. So choosing a GCC version is a case of examining the tradeoffs between getting better AST-level optimisations from a newer version, or more optimised assembly language output from an earlier version.
I imagine GCC 6.5 probably has a backend that makes better use of the 68k chip than the GCC 11.4 that ngdevkit uses (such as knowing when to use dbra) but is probably worse in other ways due to an older and less capable frontend.
I tried this with the old SAS/C Amiga compiler. It put addresses in A0 and then moved value into (A0) on next instruction, so the setup part was a bit more inefficient. And refused to use "dbra" no matter what I tried.
I was once into 68k so I may be rusty, but shouldn't it be move.w d1,(a0)+ (increment the target address after each step)?
The hardware increments an internal pointer after each access. The view to that address is through value in a0.