Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

Comments

codelion 21 February 2025
The DAG feature for subjective metrics sounds really promising. I've been struggling with the same "good email" problem. Most of the existing benchmarks are too rigid for nuanced evaluations like that. Looking forward to seeing how that part of DeepEval evolves.
llm_trw 20 February 2025
>This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal.

Have you done any work on dynamic data generation?

I've found that even taking a public benchmark and remixing the order of questions had a deep impact on model performance - ranging from catastrophic for tiny models to problematic for larger models once you get past their effective internal working memory.

fullstackchris 20 February 2025
Congrats guys! Back in the spring of last year I did an initial spike investigating tools that could evaluate the accuracy of responses in our RAG queries where I work. We used your services (tests and test dashboard) as a little demo.
nisten 20 February 2025
This looks nice and flashy for an investor presentation, but practically I just need the thing to work off of an API or if it is all local to at least have vllm support so it doesn't take 10 hours to run a bench.

The extra long documentation and abstractions for me personally are exactly what I DONT want to have in a benchmarking repo. I.e. what transformers version is this, will it support TGI v3, will it automatically remove thinking traces with a flag in the code or running command, will it run the latest models that need custom transformer version etc.

And if it's not a locally runnable product it should at least have a public accessable leaderboard to submit oss models too or something.

Just my opinion. I don't like it. It looks like way too much docs and code slop for what should just be a 3 line command.

pantsforbirds 21 February 2025
Does DeepEval allow you to set up custom metrics without an LLM-as-a-judge base?

If I want my result to be a JSON output, and I want to weight the keys based on some specific importance weighting, can I write a Python function/class to calculate and average those weighted scores as a metric for DeepEval?

I do have some annoyances with DSPy, but I think their approach to defining evals is decent.

tracyhenry 20 February 2025
This looks great. I would love to know more what makes Confident AI/DeepEval special compared to tons of other LLM Eval tools out there.
eamag 25 February 2025
stereobit 21 February 2025
DAG sounds interesting. Might help me to solve my biggest challenge with evals right now, which is testing subjective metrics e.g. “is this a good email”
TeeWEE 20 February 2025
Was also looking at Langfuse.ai or braintrust.dev

Anybody with experience can give me a tip of the best way to - evaluate - manage prompts - trace calls

jchiu220 20 February 2025
This is an awesome tool! Been using it since day 1 and will keep using it. Would recommend to anyone looking for an LLM Eval tool
avipeltz 20 February 2025
this is sick, all star founders making big moves ;)
calebkaiser 20 February 2025
<deleted>