FTA: In our "Mobile Actions" evaluation, fine-tuning transformed the model’s reliability, boosting accuracy from a 58% baseline to 85%. This confirms that for edge agents, a dedicated, trained specialist is an efficient path to production-grade performance.

I would be wary of having a LLM with 85% accuracy call tools on my system. Isn’t that fairly far away from production-grade performance?

I also don’t see that the fact that accuracy can be boosted from 50% to 85% is any indication that it can be boosted further.

There are ways around this. You can push the success rate close to 100% if you use chain of thought and a quorum selection. It isn't great, and it slows response times, but if 85% isn't good enough, you just need to flip the coin about 5 times to get nearly(!) guaranteed results.

Good insight here, we actually did not include thinking into this model partly because we saw how incredibly fast it was to just get the minimum amount of tokens to output an answer.

Thinking helps performance scores but we'll leave it up to users to add additional tokens if they want. Our goal here was the leanest weight and token base for blazing fast performance for you all.

Coin flipping works only if the fails are roughly independent. More important is the complexity ceiling above which they fail all the time.

So my solution to non-binary failure states is

1. Generate a potential solution

2. If the solution is complex, chunk it up into logical parts

3. Vote on each chunk and select those with more than k votes

By doing this you can filter out outliers (not always desirable) and pull the signal out of the noise.