As someone who does a lot of genetic programming (like, old-school, without AI/LLMs, etc), I can confirm that the fitness function is very difficult to get right, especially if you are trying to evolve programs that have "adversarial fitness" -- you'd need to maintain a hall-of-fame, and that just makes the runs take _much_ longer, because, chances are, your fitness function is the bottle-neck.

So, it is very hands-off, but also very expensive, and it is never clear if optimizing the fitness function is worth it, because the fitness function itself may be insufficiently or incorrectly specified.

However, I do think that people should try, even with just a whiteboard or a notebook, to design a fitness-function, for their problem, as if they were going to try to evolve it, because (1) it forces them to explicate their correctness constraints, and (2) they may discover that the program that they are trying to write _is equivalent_ to the fitness function.

I'll give you an example for point 2. Many years ago, I had to parse a gnarly language, and I chose to do it via Chomsky Grammars (that automatically build a tree based on the grammar-spec). Chomsky Grammars are cool, in that they are basically just a state-machine, but they are incredibly difficult to debug: when they work, they might work incorrectly (malformed tree), and when they fail, they give no reason for failure (because even with a trace, you are trying to figure out which backtrack should not have happened). So, out of desperation, I started to consider using genetic programming to just evolve a correct Chomsky Grammar. It became clear that there are only 2 possible fitness functions (1) a function that tests a hand-picked input against a hand-crafted tree-output (which is vulnerable to over-fitting), and (2) a function that is not (well, is much less) vulnerable to over-fitting, but is effectively a pre-existing, correct grammar that can produce those trees.

If you are in situation 2, then the genetic programming is not necessary, unless you are trying to create an optimized (or obfuscated) parser, and even then the optimization may be overfit to the test-inputs (even if they are generated test-inputs from the grammar itself). If you are in situation 1, then you are better off re-evaluating your approach (I abandoned the Chomsky Grammar notation, and invented one that is much easier to understand and debug, without losing any of the expressiveness -- it also happens to be slower, but fast-and-broken is worthless compared to not-so-fast-and-works-fine).

One place where genetic programming has been consistently awesome, is in parameter-search style problems (e.g. your genome is a long list of floats, representing weights and/or anti-weights, and you need to find out which weights give you more fitness (or less error)). I hear good things about variable-neighborhood-search, but have yet to try it.