How To Choose the Right LLM for Your Use Case - Coding, Agents, RAG, and Search
o3 and Gemini 2.5 Pro are the most capable models, but they aren't always the right choice for your use case
With new AI models arriving nearly every week and their often-confusing naming conventions, choosing the right one can feel overwhelming. To simplify this process, let's clearly match today's top-performing models to your specific usage scenario. Whether you're using GitHub Copilot, Cursor, ChatGPT Plus, or building specialized applications, this guide provides concise, practical recommendations.
Intro
OpenAI's recently released O3 model is the most capable overall, with Gemini 2.5 Pro close behind. O3 is significantly more expensive and typically slower than Gemini 2.5 Pro. Both are reasoning models and, as such, will be slower and more expensive than typical non-reasoning models. Depending on your application, other models should also be considered. Treat this information and public benchmarks as general guidelines to help narrow your initial exploration of models and establish good defaults. Ultimately, the model you choose should be based on your specific application and evaluations tailored to that application.
If you're using ChatGPT Plus
If you subscribe to ChatGPT Plus for $20 per month, you get 100 uses of O3 per week. That benefit alone often makes the subscription worthwhile. O3 is the best model overall and excels as a search and answer engine within the ChatGPT interface. Whereas the deep research query allowances provided by most services are often limited and time-consuming, O3 typically searches for an appropriate duration and delivers answers of a reasonable length. It's particularly effective for learning about topics, especially in domains with published scientific papers.
O3: Best when you need superior search, coding capability, or reasoning.
O4-mini-high: Also a reasoning model, though not quite as capable as O3. A good fallback for typical use to avoid hitting your 100-use limit for O3.
GPT-4.5: Very capable, but its usage limit is low. Its training data cutoff is October 2023, and it is not as proficient at search compared to O3.
ChatGPT-4o: No longer recommend due to frequent changes and sycophantic behavior. Previous versions had a high rating on LMSYS across the board. It is updated rapidly and there is no versioned option in the API. It is hard to assess given no versioned option being available, and it likely receives rapid updates with very little information released about the model characteristics.
If you're using GitHub Copilot
Based on the coding benchmarks and my experience, here are the recommended models:
Gemini 2.5 Pro: The second highest-ranked model in our coding benchmarks (Weighted Score: 0.914). It demonstrates strong performance across benchmarks, including SWE-Bench (63.8%) for bug fixing and Aider Polyglot (72.9%) for multilingual code edits. Offers robust reasoning capabilities.
Claude 3.7 Sonnet: A strong alternative, ranking third in coding benchmarks (Weighted Score: 0.876), performing well on SWE-Bench (62.3%) and Aider Polyglot (64.9%). Its "Extended Thinking" mode is noted as a feature that can aid in debugging complex issues.
O4-mini: Not yet available in GitHub copilot. Performance is anticipated to be competitive with Gemini 2.5 Pro and Claude 3.7 Sonnet.
If you're using Cursor
The same primary recommendations as GitHub Copilot (Gemini 2.5 Pro, Claude 3.7 Sonnet) apply. Be mindful that Cursor imposes a limit of 500 premium model uses per month across these models. Once you reach this limit, you may experience slower request processing or choose a different model.
DeepSeek V3 (0324): A reasonable alternative if you need faster responses after exhausting your premium model quota. While its SWE-Bench (42.0%) and Aider Polyglot (55.1%) scores are lower than the primary recommendations, it performs respectably, particularly in human-rated quality assessments (LMSYS Arena: 1353).
Privacy Note: If you have concerns about using DeepSeek models, note that Cursor uses Fireworks as the inference provider. With privacy mode enabled in Cursor, neither Cursor nor its providers store your data.
When paying for tokens / building LLM applications
The standard advice is to start with the best model you can afford and only transition to cheaper models once the concept has been validated using more capable ones. However, many applications do not require the absolute most powerful model. If you are working on side projects, perhaps some solely for personal use, using a model where cost isn't a major concern can be advantageous. Models like GPT-4.1-mini, Gemini 2.5 Flash (with thinking tokens off), or Gemini 2.0 Flash allow experimentation without significant cost anxiety. In all but the most extreme usage scenarios, it's difficult to spend more than $10 per month on these models. My general recommendation is to begin with these cost-effective options and, if capabilities fall short or budget allows, move up the expense ladder with targeted testing.
If you're building a RAG application (Retrieval-Augmented Generation)
Choosing a model for RAG involves balancing context window size, reasoning over that context window, instruction following, speed, and cost.
Gemini 2.5 Flash: With the thinking off, it is extremely cheap. Even with the thinking tokens on, it will still be cheaper than GPT-4.1 in most cases.
GPT-4.1: Best non-reasoning model from OpenAI available through the API. Very capable and has good long context performance.
GPT-4.1-mini: A much cheaper alternative to GPT-4.1.
Gemini 2.5 Pro: Despite being a reasoning model, it will still often be cheaper on a per task basis than GPT-4.1 and Claude 3.7 Sonnet. For long outputs, it can still be competitive on latency with non-thinking models due to the output speed being 3x that of 4.1 and 3.7 .
Claude 3.7 Sonnet: Quality model, but more costly and slower than GPT-4.1, make it generally not recommended.
Gemini 2.0 Flash: The most cost-effective option with decent long-context performance. A good starting point if budget is the primary constraint,
Recommendation: Start by evaluating Gemini 2.5 Flash (no thinking) and GPT-4.1-mini based on your budget and context needs. If results aren't sufficient or you need maximum context, test GPT-4.1. If complex reasoning over documents is key, test Gemini 2.5 Pro or Claude 3.7 Sonnet.
If you're building an LLM-based workflow or Agent and Latency is not a critical concern
When building a workflow or an Agent, multiple steps are involved and a small chance of errors can accumulate to an unreliable system. Because of that, smarter models are the first suggested choices here. If it works with the smarter models, try to reduce cost with cheaper models in part or all of the workflow.
Gemini 2.5 Pro: High quality and cost-effective for workflows where latency is not the primary concern.
Gemini 2.5 Flash: Adjustable thinking token budget and the ability to turn the thinking tokens off, make this attractive across a range of uses.
GPT-4.1: Many people are limited to OpenAI models. If you are, this is the primary recommendation.
OpenAI O4-mini(high): Outperforms GPT-4.1 at a higher cost
OpenAI O3: The most capable according to benchmarks but very expensive. Reserve its use for situations where its unique capabilities are absolutely required or when budget is not a constraint.
DeepSeek v3 - 03-24: Cheaper than GPT-4.1 and some benchmarks are comparable, LMSYS and coding primarily. Typically much slower than other models. Underperforms many models including cheaper ones on the multichallenge benchmark. Only recommended for short context and code generation.
Claude 3.7 Sonnet(Thinking): Gemini 2.5 Pro is typically more capable at a lower cost. Claude 3.7 significantly outperforms Gemini 2.5 Pro on the MASK benchmark that attempts to monitor model honesty. It seems that result would correlate with important aspects in some applications. It may be better at understanding conflicting information and following system instructions.
Cost-efficient high-throughput applications
GPT-4.1-mini
Gemini 2.5 Flash (with thinking tokens turned off)
Gemini 2.0 Flash
Models to Avoid
GPT-4o and GPT-4o-mini: The GPT-4.1 series models generally replace these, offering lower cost, higher capability, longer context windows, and faster performance for API usage.
ChatGPT-4o (within ChatGPT): This version is still reasonable to use within the ChatGPT interface itself. However, note the confusing naming. The ChatGPT-4o model available in the chat interface is consistently updated and shows good performance on the LMSYS leaderboard but lacks dated API versions. It's unclear precisely which underlying model version it represents at any given time, and it changes rapidly. Use with caution if stability and version consistency are critical.
Gemini 1.5 Models: With the release and strong performance of the Gemini 2.x series (Pro, Flash), the older 1.5 models are generally superseded.
Evaluate on your own workloads when possible
Remember, the optimal model is the one that aligns most closely with your specific project requirements and constraints. Start your evaluation by creating a small set of question answer pairs for your specific use. Keep these up to date as your needs change. Having even a minimal set of questions, helps immensely as you consider which model to use initially and as new models come out that you want to consider using.
Recommended benchmark resources
Regularly validate your model choices using trusted benchmarks:
Additional resources
Cursor models - shows which models are available and which are premium: https://docs.cursor.com/settings/models.
Cost, Accuracy, and Latency per Task
Visualization using the Aider Polyglot benchmark data
Benchmark results
Coding
1. Coding Benchmarks Overview
SWE-bench Verified – A rigorously curated subset of the original SWE-bench designed to more faithfully assess autonomous bug fixing in real codebases. It consists of 500 GitHub issues drawn from 12 popular Python projects, each paired with the exact test suite that closed the issue. Every task has been manually reviewed and confirmed by software engineers to be solvable, ensuring that success rates reflect genuine patching ability rather than noise or unsolvable edge cases. During evaluation, an AI agent receives the pre-fix repository and issue description, generates multi-file patches, and is measured by whether those patches pass the full regression suite—mirroring the complexity of coordinating cross-module changes in production maintenance.
Aider Polyglot (Pass Rate 2) – An end-to-end benchmark testing an AI’s capacity to translate natural-language coding requests into multi-file solutions that actually compile and pass unit tests across six widely used languages (C++, Go, Java, JavaScript, Python, and Rust). Built on 225 of the hardest Exercism exercises, it splits evaluation into two phases (architect and editor), reporting Pass Rate 2 as the fraction of exercises whose full test suite passes after the model’s final edits. This metric captures both the model’s reasoning about how to solve a problem and its ability to produce syntactically correct, executable code that integrates seamlessly into existing files.
LMSYS Arena (Style-Controlled Coding Subset) – A specialized slice of the Chatbot Arena coding leaderboard where human judges compare two models’ code side-by-side on authentic development prompts.
2. Raw Benchmark Results
| Model | SWE (%) | Aider (%) | Arena |
|:-----------------|--------:|----------:|------:|
| o3 | 72.0% | 79.6% | 1389 |
| Gemini 2.5 Pro | 63.8% | 72.9% | 1360 |
| Claude 3.7 | 62.3% | 64.9% | 1340 |
| o1 (Dec ’24) | 48.9% | 61.7% | 1324 |
| DeepSeek V3 | 42.0% | 55.1% | 1353 |
| Gemini Flash | — | 47.1% | 1330 |
| 4o (latest) | 33.2% | 45.3% | 1369 |
Notes:
LMSYS Arena scores are from the coding + style control subset.
DeepSeek V3 surprisingly scored well in Arena compared to other open models, despite lower SWE-Bench scores.
3. Weighting and Normalization
Final Chosen Weights:
| Benchmark | Weight |
| :--------------------------- | -----: |
| SWE-Bench Verified | 55.0% |
| Aider Polyglot (Pass Rate 2) | 22.5% |
| LMSYS Arena (coding) | 22.5% |
(SWE-Bench is emphasized most heavily because it best measures real software engineering tasks.)
Normalization:
Each metric is normalized by dividing it by the top-performing model's result in that category:
SWE_norm = SWE / 72.0
Polyglot_norm = Polyglot / 79.6
Arena_norm = Arena / 1389
Weighted Score Calculation:
Weighted Score = (0.55 × SWE_norm) + (0.225 × Polyglot_norm) + (0.225 × Arena_norm)
4. Normalized Scores and Final Weighted Scores
| Model | SWE | Poly | Arena | Weighted |
|:-----------------|------:|-------:|-------:|----------:|
| o3 | 1.000 | 1.000 | 1.000 | 1.000 |
| Gemini 2.5 Pro | 0.886 | 0.916 | 0.979 | 0.914 |
| Claude 3.7 | 0.865 | 0.815 | 0.965 | 0.876 |
| o1 (Dec ’24) | 0.679 | 0.775 | 0.953 | 0.762 |
| DeepSeek V3 | 0.583 | 0.692 | 0.974 | 0.696 |
| 4o (latest) | 0.461 | 0.569 | 0.986 | 0.603 |
| Gemini Flash | 0.000 | 0.591 | 0.957 | 0.348 |
5. Final Ranking (Best to Worst)
| Rank | Model | Weighted Score |
| ---: | :------------------------ | --------------:|
| 1 | o3 (high) | 1.000 |
| 2 | Gemini 2.5 Pro Preview | 0.914 |
| 3 | Claude 3.7 Sonnet (32k) | 0.876 |
| 4 | o1-2024-12-17 (high) | 0.762 |
| 5 | DeepSeek V3 (0324) | 0.696 |
| 6 | ChatGPT-4o-latest | 0.603 |
| 7 | Gemini 2.5 Flash Preview | 0.348 |
6. Interpretation (Revised)
o3 (high) remains the clear coding leader based on this weighted combination of SWE-Bench, Aider Polyglot, and LMSYS Arena scores, achieving the maximum possible score of 1.000.
Gemini 2.5 Pro Preview is extremely competitive and holds the second position (0.914), performing strongly across all three benchmarks.
Claude 3.7 Sonnet ranks a close third (0.876), showing solid performance particularly in SWE-Bench and Aider Polyglot.
o1-2024-12-17 follows in fourth place (0.762), with decent scores across the board.
DeepSeek V3 (0324) ranks fifth (0.696), performing better in human-rated quality (Arena) than in the automated correctness benchmarks (SWE-Bench, Aider).
Multi-Challenge Benchmark Results - from Scale AI
MultiChallenge is a benchmark designed to assess LLMs’ performance in realistic multi-turn dialogues by measuring four interdependent capabilities: instruction retention (maintaining directives from the first turn throughout the conversation), inference memory (recalling and integrating relevant details scattered across earlier turns), reliable versioned editing (supporting iterative revisions of materials through back-and-forth exchanges), and self-coherence (staying consistent with the model’s own prior responses without uncritical agreement). Each task combines accurate instruction-following, dynamic context management, and robust in-context reasoning to mirror the challenges of natural human–AI interactions.