TDD helps a lot, but it’s no guarantee - LLM is smart enough to “fake” the code to pass tests .
I’m working on project - a password manager, where I have full end to end test harnesses - cli client makes changes, sync them to the server and then observe the data in iOS app running in the emulator. More than once I noticed codex just hard coded expected values from the test harnesses directly into UI layout in iOS app to make the test pass…
Similar issues in the crypto layer - tests were written first , then code was written . During the review I noticed that the code was made to just pass the test - the logic was to check if signature values exists instead of checking if crypto signature is valid.
LLM can help with code reviews as well, but it has to be guided specifically what to look for for. This is with codex 5.4 model