Hacker News

Hi,

I built Axiomeer, an open-source marketplace protocol for AI agents. The idea: instead of hardcoding tool integrations into every agent, agents shop a catalog at runtime, and the marketplace ranks, executes, validates, and audits everything.

How it works: - Providers publish products (APIs, datasets, model endpoints) via 10-line JSON manifests - Agents describe what they need in natural language or structured tags - The router scores all options by capability match (70%), latency (20%), cost (10%) with hard constraint filters - The top pick is executed, output is validated (citations required? timestamps?), and evidence quality is assessed deterministically - If the evidence is mock/fake/low-quality, the agent abstains rather than hallucinating - Every execution is logged as an immutable receipt

The trust layer is the part I think is missing from existing approaches. MCP standardizes how you connect to a tool server. Axiomeer operates one layer up: which tool, from which provider, and can you trust what came back?

Stack: Python, FastAPI, SQLAlchemy, Ollama (local LLM, no API keys). v1 ships with weather providers (Open-Meteo + mocks). The architecture supports any HTTP endpoint that returns structured JSON.

Looking for contributors to add real providers across domains (finance, search, docs, code execution). Each provider is ~30 lines + a manifest.

AutoJanitor 2 months ago [ - ]

  The trust/validation layer is the interesting part here. We run ~20 autonomous AI agents on BoTTube (bottube.ai) that create videos, comment, and
  interact with each other - the hardest problem by far has been exactly what you're describing: knowing whether an agent's output is grounded vs
  hallucinated. We ended up building a similar evidence-quality check where agents that can't back up a claim just abstain.

  Curious how the routing score weights (70/20/10) were chosen - have you experimented with letting agents adjust those weights based on task type? For
  something like content generation the capability match matters way more than latency, but for real-time data feeds you'd probably want to flip that.

ujjwalreddyks 2 months ago [ - ]

Thanks for checking this out! 20 autonomous agents interacting with each other sounds intense that's exactly the kind of multi-agent coordination problem I am trying to make easier.

On the weights (70/20/10 for capability/latency/cost):

Honestly, those were empirically tuned from my own usage patterns. Started with equal weights, then noticed that capability mismatch was causing way more failures than slow responses or high costs. So I kept bumping capability weight until the "wrong tool selected" rate dropped.

You're spot on about task-type sensitivity though. I actually have additional weights for trust (15%) and semantic relevance (25%) that kick in during the ranking phase. But dynamic weight adjustment per task type is on the roadmap.

The idea would be something like:

- "real-time" or "live" in query → boost latency weight to 40% - "cheap" or "budget" in query → boost cost weight to 30% - "accurate" or "reliable" in query → boost trust weight to 25%

Haven't shipped it yet because I wanted to validate the static weights first. But your content generation vs real-time data example is exactly the use case.

On the trust layer - I do evidence-quality scoring where each API response includes a confidence field. APIs that return citations or source URLs get a trust boost. The abstention pattern you mentioned is interesting - I currently surface low-confidence results with a warning rather than hiding them, but abstention might be cleaner for agent-to-agent workflows.

Would love to hear more about how you handle trust scoring in BoTTube. Always looking for battle-tested patterns.

ujjwalreddyks 2 months ago [ - ]

Axiomeer v2 is live. Replaced all mock providers with 7 real, free APIs (weather, countries, exchange rates, dictionary, books, Wikipedia, math facts) zero API keys. The pipeline now routes to the best provider, validates evidence, and generates grounded answers with no hallucination(tested on real + fake queries using llama2:7b). 83 tests passing (74 unit, 9 integration). Test results are in Test Images/v2-results.

ujjwalreddyks 2 months ago [ - ]

[dead]