Lately open source large language models proliferate like the bunny rabbits in my yard and worse than that is the specter you might have to find a good reason to pick one over another.

In this video, I recommend HuggingFace's LLM leaderboard as a place to get started. It ranks these models according to a number of metrics with the widely-discussed Falcon presently leading in average score. I will drill into these metrics a bit... Just what do they measure and how? How might relative performance be different in the applications we are pursuing at my company? One point I would like to keep re-emphasizing is that the breadth and flexibility that might have attracted you to OpenAI's ChatGPT will not only be one of the more difficult qualities to reproduce in house, but it is also likely to be one of the more difficult qualities to understand via a standardized test.

Some basic conversation about LLM performance metrics

Alexander Mueller