Hacker News

I wanted to test the capabilities of the low one, hoping it would be good enough.

I have a quizzes application, and my quizzes only supported flashcards (implemented via table inheritance to provide flexibility for other types of quizzes).

The entire repo is handcrafted, never used any ai on it (it was more of an excuse to test elixir and write code by hand).

Since fable 5 got released the moment I was done with some work, I decided to throw at implementing multi choice questions.

After all it had only to copy the flashcard approach across ui/routing/db, and only had to create a table for the multi choice questions and one for the answers enforcing that all quizzes had one correct question. I told him it had access to sqlite3, chrome mcp for testing and mix commands.

I did a test for low, mid, high. Repeated it twice each.

low-1, and low-2 failed both. In low-1 the UI for adding another choice answers was broken. In low-2 it failed with some unique constraint. It took it 4m36 and 3m59.

Both mid-1 and mid-2 succeeded without issues also implementing the correct ui. They both wanted to use dash at all times. They both wrote tests for the "controller" (or context how they call it in Elixir). They both tried to use the repl to test the behaviour of the schemas.

10m and 12m39.

High didn't demonstrate much gains over mid for this kind of task, it was simply too easy. Times were comparable to mid, but interestingly it used much less bach, and read way more files. Token usage was almost twice the other ones.

But here's the interesting part: I went back to low and added to the prompt two bullet points, to write tests for the controllers and to test the entire flow with chrome mcp.

It produced the same output as mid or high just by adding two instructions to the prompt.