
Summary
LLM and LLM based application evaluations are mandtory in any meaningful development lyfecycle. In this post we look at the most used frameworks within the industry.
The main difference between software development lyfecycle we followed for the last couple of decades, and the software development which relies on large language models - is that the later includes a large portion of non deterministic behaviour.
Which means - we do not 100% know what exact explanation LLM model will come up with, which decision LLM judge will make, or how it will sumarrise the answer.
Therefore to ensure the quality and significantly increase the usefulness of LLM based applications we build - we need well tunned and automated evaluation mechanisms.
This post lists the most popular LLM evaluation frameworks used in industry.
LLM Evaluation Frameworks
Ragas
Ragas is an open source framework which aims to provide systematic approach to the evaluation of LLM applications and tools.
They have experiments-first approach to LLM application testing, with the focus on recomended set of metrics.
Promptfoo
Promptfoo is a protocol which enables discovery and description of available tools. Because it serves as a description protocol, pointing agents to the actual endpoints, it not placing itself in the middle of AI agent requests. open source specification. Developed and maintained by open community.
Deepeval
Evals
EvidentlyAI
EvidentlyAI and open source.
Notes
We are posting the detailed tutorials for all of the above frameworks in action. With automation as main focus.