I run a multi-agent orchestration system where each terminal has access to skill templates. The orchestrator (T0) picks which skill to assign based on the task — so I've spent months tuning how skill descriptions affect agent behavior. What I found: the description is the entire selection surface. The agent doesn't read your code, doesn't check your tests, doesn't browse your docs. It reads the description and decides in one pass.

Three things that actually moved the needle:

Negative boundaries work better than positive claims. "Generates reports from structured receipts. Does NOT execute code, modify files, or make API calls" gets called correctly way more often than "A powerful report generation tool." Trigger words matter more than you'd think. I maintain explicit trigger lists per skill — specific phrases that should activate it. Without those, the agent pattern-matches on vibes and gets it wrong ~30% of the time. With explicit triggers, that drops to under 5%.

Schema is the real interface. Clean parameter names with sensible defaults beat elaborate descriptions. If your tool takes query: string vs search_query_input_text: string, the first one gets called more reliably across models.

But here's the thing the "agent economy" framing gets wrong: you don't want fully autonomous tool selection. An agent choosing freely between 50 tools is like giving a junior developer admin access to everything — it'll work sometimes and break spectacularly other times. What works better is constraining the agent's scope upfront. Give it 3-5 relevant skills for the task, not your entire toolkit. Or build workflow skills that chain multiple tools in a fixed sequence — the agent handles the content, the workflow handles the routing.

The uncomfortable truth: you're not optimizing for "discovery" in the human sense. There's no brand loyalty, no trust built over time. Every single invocation is a cold start where the model reads your description and decides. That's actually freeing — it means the best-described tool wins, regardless of who built it.

One thing I’ve noticed is that as my context grows, often performance degrades. So how are you battling your agents being exposed to too many descriptions? I how this works in curated agents where you’re tending it like a garden, but not when we’re looking for organic discovery of how to accomplish a task. It feels like order matters a lot there.

Context bloat is a real problem — and yes, order matters more than most people realize. Descriptions near the top of the tool list get preferentially selected, especially in long contexts where attention degrades.

Two things I do to fight this:

First, skill scoping per task. Instead of exposing all 20+ skills to every agent, each terminal only sees the 3-5 skills relevant to its current dispatch. The orchestrator decides which skills to load before the agent even starts. Less noise, better selection accuracy.

Second, context rotation to prevent context rot setting in. When an agent's context fills up, the system automatically writes a structured handover, clears the window, and resumes in a fresh context. This is critical because a degraded context doesn't just pick worse tools — it starts ignoring instructions entirely. A fresh context with a good handover outperforms a bloated one every time.

I'm actually testing automatic refresh at 60-70% usage right now — not waiting until the window is nearly full, but rotating early to prevent context rot before it starts. Early results suggest that's the sweet spot: late enough that you've done meaningful work, early enough that the handover quality is still high.

The organic discovery problem you're describing is essentially unsolvable with a flat tool list. The more tools you add, the worse selection gets — it's not linear degradation, it's closer to exponential once you pass ~15-20 tools in context. The only path I've found is hierarchical: a routing layer that narrows the set before the agent sees it.

Update: I ended up building this into a full closed-loop pipeline. A PreToolUse hook detects context pressure at 65%, the agent writes a structured handover (task state, files, progress), tmux clears the session, and a rotator script injects the continuation into the fresh session.

The key insight from testing: rotating at 60-70% — before quality degrades — matters more than the rotation mechanism itself. At 80%+ auto-compact kicks in and races with any cleanup you try to do.

Wrote it up as a Show HN if anyone's curious: https://news.ycombinator.com/item?id=47152204