Keeping Up with the AI Race — Bookmark this Leaderboard-Page to Track the Best AI Models by Quality, Speed and Price

September 14, 2024·Reading time: 7 min

AIChatGPTArtificialAnalysisBenchmarkingLLM-LeaderboardLLM

Quickly after the ground-breaking launch of OpenAI’s ChatGPT in 2022, a race has started of who can build the best Large Language Model (LLM).

Besides OpenAI, the main drivers in that race are Google, Meta and Amazon, who have quite a track record in AI.

In addition to these incumbents, new participants like Anthropic, Mistral AI and Perplexity have joined the race and try to be win with unique LLM solutions — for example, open-source models or such that put special emphasis on trust and safety.

With the quick iteration cycles that we are experiencing right now, it’s hard to identify who’s leading this race.

Nevertheless, how can you keep an overview?

Easy answer. By checking out LLM Benchmarking sites, who test and rank models by multiple characteristics and regularly publish their leaderboards.

There is a variety of such sites — each with a slightly different focus.

To keep efforts small, you should pick the right one.

The Choice of LLM Benchmarking Sites and Leaderboards

Quickly after the rise of ChatGPT and competing AI solutions, the first organizations like Stanford or LMSYS Org have begun to benchmark LLMs and populate their leaderboards.

To be able to do that, they created frameworks of fair benchmarking and compare the recent available AI models released by larger Tech companies and startups.

Focus is clearly on those, which are assumed to be the best of the best, meaning those companies who are developing at the frontier of LLM technology.

Andrew Ng, the renown AI expert and educator, calls out the following as the most important benchmarking services:

ArtificialAnalysis
LMSYS Chatbot Arena
Stanford HELM
Hugging Face open LLM Leaderboards

Andrew Ng on LLM Benchmarking — Andrew Ng of deeplearning.ai via X

But which one of those should you use?

Most of these benchmarking services and leaderboards exclusively focus on quality aspects.

But speed and price are becoming increasingly important to make holistic decisions about model selection. When the quality advantages gets thinner over time, this becomes even more true. For some use cases, models might become “good enough” at a certain point, which shifts the focus to speed and price. After all, there needs to be a focus on Return-on-Investment (ROI) to make this technology work long-term.

To understand benchmarking services better, here is a quick peek into each of the 4 mentioned services, including a glimpse at their most recent versions of leaderboards.

ArtificialAnalysis

How does it work?

At ArtificialAnalysis, the models are evaluated based on quality, speed and price. Quality evaluations include the overall Artificial Analysis quality index, which itself is using normalized values of other scores such as the Chatbot Arena Elo Score, MMLU, and MT Bench). The quality assessment also looks at the MMLU standalone (reasoning & knowledge), Scientific Reasoning & Knowledge (GPQA), Quantitative Reasoning (MATH), Coding (HumanEval), Communication (LMSys Chatbot Arena ELO Score) evaluations. Speed and Price follow individual testing methods developed by ArtificialAnalysis.

What exactly does it measure?

Quality: Accuracy scores
Price: USD per 1M Tokens (blended 3:1 ratio of input and output tokens with)
Speed: Latency/TTFT (Time to first token), Output speed (output token per second after first token received), Total Response Time for 100 Tokens

Note: Tokens are pieces (more precisely: subsets) of words. For simplification you can assume that tokens are like words. Consequently, the Price metric looks at USD per 1 million words generated and the Speed metrics look at time spent to generate words.

Where is the focus?

Quality, Price, Speed — ArtificialAnalysis focuses on all three dimensions

What does the current Leaderboard show?

ArtificialAnalysis is very much up to date.

One day after the release of OpenAI’s new models o1-preview and o1-mini on Sep 12, 2024, the evaluation scores have already been included into the rankings.

Let’s investigate them…

For the overall quality ranking, OpenAI’s o1-preview model now ranks first, o1-mini second and with substantial distance their previous model GPT-4o third. The spots after OpenAI’s models are taken by Anthropic’s Claude 3.5 Sonnet, Mistral AI’s Large 2 model and the best open-source model in the ranking, Meta’s Llama 3.1 405B.

ArtificialAnalysis Quality Ranking — ArtificialAnalysis Quality Leaderboard, https://artificialanalysis.ai/

For Speed, Google’s Gemini 1.5 Flash ranks first, which is Google’s more lightweight model designed for efficiency. It can provide 207 output tokens per second, a bit more than Meta’s Llama 3.1 8B, and much more than Anthropic’s Claude 3 Haiku and OpenAI’s GPT-4o mini — which are also smaller variants of leading frontier models.

ArtificialAnalysis Speed Ranking, https://artificialanalysis.ai/

On Price, Google’s Gemini 1.5 Flash is also leading with costs of only 0.1 USD per 1M tokens generated. It is followed by Meta’s Llama 3.1 8B, OpenAI’s GPT-4o mini and Antropic’s Claude 3 Haiku. Like for Speed, the smaller models also perform better on Price.

ArtificialAnalysis Price Ranking, https://artificialanalysis.ai/

LMSYS Chatbot Arena Leaderboard

How does it work?

The LMSYS Chatbot Arena follows a user-based approach, where user prompts are fed into 2 anonymous models at the same time to each generate a response. After that, the user evaluates, which response is better. The scores are summarized in an ELO score for each model. This scoring method is typically applied to rank chess players but serves this purpose quite well.

What exactly does it measure?

The LMSYS Chatbot Arena ranks OpenAI’s GPT-4o the highest with an Elo Score of 1316, followed by Google’s Gemini 1.5 Pro and xAI’s Grok 2.

Note that, likely due to the user-based approach, OpenAI’s newest models o1-preview and o1-mini are not included into the ranking as of this post.

Where is the focus?

Quality

What does the current Leaderboard show?

The LMSYS Chatbot Arena ranks OpenAI’s GPT-4o the highest with an Elo Score of 1316, followed by Google’s Gemini 1.5 Pro and xAI’s Grok 2.

Note that, likely due to the user-based approach, OpenAI’s newest models o1-preview and o1-mini are not included into the ranking as of this post.

LMSYS Chatbot Arena Leaderboard, https://lmarena.ai/

Stanford HELM Leaderboard

How does it work?

HELM (Holistic Evaluation of Language Models) stems from Stanford’s Center for Research on Foundation Models. Its performance evaluation is scenario-based using different kinds of tasks, for example, question answering or summarization. Each model is evaluated with a standardized prompt and completion method. The HELM Lite benchmarking, which is now used as the more lightweight and manageable version compared to the full evaluation, measures model performance across 16 core tests: NarrativeQA, NaturalQuestions (open-book), NaturalQuestions (closed-book), OpenbookQA, MMLU (Massive Multitask Language Understanding), MATH, GSM8K (Grade School Math), LegalBench, MedQA, WMT 2014.

What exactly does it measure?

Quality: Accuracy (F1 scores, other accuracy scores)
Compliance: Calibration (Expected Calibration Error, Selective Classification), Robustness (Invariance and equivariance), Fairness (counterfactual fairness, performance disparities), Bias (demographic representations, stereotypical associations), Toxicity (usage of Perspective API for toxicity score)
Efficiency: Efficiency scores(for example, inference runtime)

Where is the focus?

Quality (major focus), Compliance (major focus), Efficiency (minor focus)

What does the current Leaderboard show?

The Helm Lite Leaderboard shows GPT-4o on top of its accuracy ranking. With some distance, Claude 3.5 Sonnet by Anthropic ranks second with GPT-4, OpenAI’s predecessor model, ranking third.

Like for LMSYS Org, OpenAI’s newest models are also not included in the HELM ranking as of this post.

Stanford HELM Lite Leaderboard, https://crfm.stanford.edu/helm/lite/latest/

Hugging Face open LLM Leaderboards

How does it work?

For the HuggingFace open LLM Leaderboards, models are evaluated on benchmark datasets using the Eleuther AI Language Model Evaluation Harness framework. It includes tasks of text classification, question answering, and summarization. It measures performance across 6 core scenarios: IFEval, BBH (Big Bench Hard), MATH, GPQA (Graduate-Level Google-Proof Q&A Benchmark), MuSR (Multistep Soft Reasoning), MMLU-PRO (Massive Multitask Language Understanding — Professional).

What exactly does it measure?

Accuracy scores specific to the 6 included tests

Where is the focus?

Quality

What does the current Leaderboard show?

The HuggingFace open LLM leaderboard shows a leaderboard of hosted and potentially fine-tuned LLMs.

The models shown on top of the ranking substantially deviate from the ones that top other Leaderboards.

For this one, the only recognizable model by a major LLM player is the Qwen2–72B-Instruct model, which is listed in eighth place. Beyond that, the ranking does not provide clear insights about the forefront of LLM quality. It shows top performers of models, which are HuggingFace-hosted and, most likely, individually fine-tuned.

Hugging Face open LLM Leaderboard, https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

Summary and Recommendation

By analyzing the different leaderboards, we can draw conclusions on current LLM leadership in different categories:

Quality Leader: OpenAI o1-preview
Speed Leader: Google Gemini 1.5 Flash
Price Leader: Google Gemini 1.5 Flash
Open-Source Leader: Meta Llama 3.1 405B and 8B
European Leader: Mistral AI’s Mistral Large 2

My Recommended Leaderboard Site

ArtificialAnalysis clearly gave the most comprehensive answers when analyzing the LLM model race and determining, which is the best current model per category.

They not only focus on 3 dimensions (Quality, Speed and Price), but also have a rapid updating process, which was seen after the OpenAI update with o1-preview and o1-mini. Besides the LLM leaderboard there are extensive benchmarks of various API providers, which include fast inference providers like Groq, Cerebras and SambaNova.

The Stanford HELM still provides a super-clean scientific alternative when talking about quality assessment, which I will occasionally check out.

Yet, ArtificialAnalysis is my main source for model comparison and benchmarking and I highly recommend to bookmark ArtificialAnalysis.ai to pay regular visits.

Which leaderboards do you prefer and what are your favorite models for your use cases?

The Choice of LLM Benchmarking Sites and Leaderboards

ArtificialAnalysis

How does it work?

What exactly does it measure?

Where is the focus?

What does the current Leaderboard show?

LMSYS Chatbot Arena Leaderboard

How does it work?

What exactly does it measure?

Where is the focus?

What does the current Leaderboard show?

Stanford HELM Leaderboard

How does it work?

What exactly does it measure?

Where is the focus?

What does the current Leaderboard show?

Hugging Face open LLM Leaderboards

How does it work?

What exactly does it measure?

Where is the focus?

What does the current Leaderboard show?

Summary and Recommendation

My Recommended Leaderboard Site

Related Articles