Experimented with this a lot while building agent tooling.
The short answer: yes, description wording measurably matters, and it's more sensitive than most people expect.
A few things we've found that move the needle:
1. Concrete use-case beats generic label. 'Send a transactional email after payment confirmation' outperforms 'Send email' by a wide margin across models — the agent doesn't have to infer when to use it.
2. Negative constraints help. Explicitly saying 'Don't use this for bulk sends or newsletters' actually reduces hallucinated misuse, because the model has a clear mental model of the boundary.
3. Schema field names are silent prompts. A field named 'recipient_email' gets filled correctly far more than 're' or even 'to'. The model is pattern-matching on names it's seen in similar contexts.
4. Example I/O beats description prose. If you include one example input and what the output looks like, agents are significantly more likely to call the tool correctly on the first attempt.
For cross-model consistency: Claude tends to do better with structured JSON schema + examples. GPT-4 responds well to prose descriptions with analogies. Gemini seems to weight field name patterns heavily.
The broader pattern: treat your tool description like a micro few-shot prompt, not an API doc. Optimize for the model's context, not for human readers.