I built an open source LLM agent evaluation tool that works with any framework
rust
dev.to
Every team building AI agents hits the same wall. You ship a LangChain agent. It works great in demos. Then it goes to production and quietly starts hallucinating, calling the wrong tools, or giving answers that have nothing to do with what it retrieved. You don't find out until a user complains. The root cause is simple: there's no standard way to evaluate agent quality before and after every deploy. Every framework has its own story: LangChain has LangSmith — but it's a paid SaaS and