Google's AI models are evolving at a rapid pace. From the basic Gemini 1.5 Flash (free for all) to the more advanced Gemini 1.5 Pro with Deep Research (paid) and everything in between, Gemini's models are turning the page faster than the competition - and users - can keep up. With improved reasoning, creativity, image generation, multimodality, precision, and speed, Google has raised the bar across the board to the extent that Gemini is currently the leading family of AI models on several fronts.
But how do Gemini's different models (of which there are 4 in operation) stack up against each other? While Gemini 2.0 is available via AI Overviews in Google Search everywhere, and is available for everyone through the Gemini chatbot (app and web), are the more advanced models worth the subscription cost. And if so, where do they excel?
To answer all this and more, we put all four Gemini models - 1.5 Flash, 2.0 Flash, 1.5 Pro, and 1.5 Pro with Deep Research (in that order) - through a rigorous prompt-based test. Follow along and find out how each of these models fared.
Math Test
There's nothing like a tricky math problem to gauge the reasoning and logic of an AI model.
Prompt: If 1=3, 2=3, 3=5, 4=4, and 5=4, then what is 6?
Answer: It's a common problem that can stun many highly intelligent people if they don't already know the trick, which is to count the letters. So one is 3, two is 3, three is 5.... six is 3. Let's see if the models get it.
Not only did both the free Gemini models (1.5 and 2.0) give the wrong answer, but they also gave very little explanation as to how they arrived at their answers.
Both the paid models (1.5 Pro and 1.5 Pro with Deep Research) gave the right answer. 1.5 Pro was quick at recognizing the pattern and highlighted its reasoning in a simple bullet-point format.
The 1.5 Pro with Deep Research model set up a plan of action, researched, found websites with similar questions, and arrived at an answer only after studying the arithmetic and geometric sequences within the pattern. Granted, using the Deep Research model is a little overkill. But it's good to see the AI model going the extra mile (even if takes as much time to return an answer).
Summarization Test
Now let's see how the AI models fare when providing a simple text summary. We provided each Gemini model with a 30-page research paper on the stylistic analysis of James Joyce's A Portrait of the Artist as a Young Man.
Note: Since the free models can't read PDFs, we copy pasted the entire text into Gemini so all four models received the same text.
Here's what the results looked like for the Gemini 1.5 and 2.0:
Note: We couldn't provide the chat link for Gemini 2.0 Flash Experimental since it currently doesn't allow the creation of public links.
Here's what the results looked like for the Gemini 1.5 Pro and 1.5 Pro with Deep Research:
While all four models did exceptionally well of summarizing the 30-page document within 500 words, they were a mix bag. Although things may vary from one chat to the next, 1.5 Flash was the worst of the lot with very generic and surface-level points (that could just as well be gleaned from the subheadings of the paper itself). Gemini 2.0 gave a better summary, more nuanced understanding that highlighted all the salient points, though it didn't categorize things into sections as well as 1.5 Flash.
Then we have the paid models - 1.5 Pro and 1.5 Pro (with Deep Research). Whilst both did a fairly decent job of summarizing, 1.5 Pro best utilized its words to compress the long-winded document into without leaving out the relevant things. However, it did so without any clear markers or headings to divide the summary. On the other hand the Deep Research model had all the headings but little substance within them.
The fine balancing act of providing all the important points categorized under the relevant sections and doing so within the set word limit is not easy. But Gemini 2.0 and 1.5 Pro models managed it best.
The 'End with a word' Test
Here's another test that to test the capability of an AI model, how well it understands the prompts, and how it goes about applying itself to the task. And it's a simple enough test. Simply ask it to write 10 sentences that end with a particular word, like 'camera', or 'apple', etc.
Prompt: Give 10 sentences that end with the word 'Camera'
It's laughable how wrong the AI models can be with as simple a task as this. Although none of the Gemini versions got it completely right, it's interesting to see the free 1.5 Flash model provide the most number of sentences (6) that complied with the request. And the 2.0 Flash Experimental model gave exactly 0 sentences ending with the word 'Camera'.
We could try regenerating the response, but that would defeat the purpose of the test.
Then there are the so-called advanced models. 1.5 Pro had 2 sentences ending with the word 'Camera', and the 1.5 Pro with Deep Research (for all the time it takes to research and go in-depth) only gave 3 sentences with the word 'Camera' at the end. It also came up with 13 sentences when the request was for 10. In any case, it only hurts its percentages.
For all the talk about AGI on the horizon, right now even generative AI finds it hard to tackle relatively simpler requests. Could there be a simpler task? Hardly.
Common Sense Test
Common sense tests can be tricky for AI models to tackle. But they've gotten better over time and one expects them to get right what usually is simple for humans. Here's what we chose for your common sense test:
Prompt 1: Which is heavier: 1kg of iron or 1kg of feathers?
Thankfully, all four Gemini AI models got it right. The difference only lay in how they went about providing the answer. The free 1.5 Flash model broke down the answer in easy to digest bullet points. 2.0 Flash Experimental just gave a block of text with no sources or links.
Similarly, the paid 1.5 Pro model was quick to figure out the trick question and even backed up its answer with the relevant sources. But the real star was 1.5 Pro with Deep Research. True to its name, it gave a highly researched answer with a primer on ' understanding weight', a well explained answer and a conclusion to boot.
Creativity Test
A creativity test is the best way to find how well the model can bring together disparate elements to create something holistic, meaningful, and aesthetically pleasing. Though there are several such creativity tests one can employ, a simple 'write a short story about...' test usually does the business.
Prompt: Write a short story about Yamraj in the style of Shakespeare in 100 words.
It's really a fun test to see how each of the models go about writing and putting their own spin on things.
Funnily enough, the biggest creative difference was between the two Flash models - 1.5 and 2.0. While 1.5 Flash gave us a couple of verses, 2.0 Flash Experimental went the way of prose. Both utilized only about 65-70 words out of the 100, which shows since their stories do feel incomplete. While they're hardly Shakespeare, the attempt by 1.5 Pro reads like a teenager trying hard to sound profound (with unnecessary elements thrown in like A maiden fair...). 2.0 Flash Experimental was slightly better and actually talked about the subject (as prompted) throughout the writeup.
On the other hand, both the paid models went the way of verse. But what's utterly surprising is that they both started in a very similar fashion. From the idea to word choice, things sound suspiciously similar with these two. The stories though diverge and the POV is also slightly different. But Gemini apparently can have some trouble getting creative (or being creative differently every time). Like its unpaid counterparts, 1.5 Pro also was quite happy with a 79-word story. On the other hand, the Deep Research model went overboard with a total of 127 words. By this time, it's safe to assume that Deep Research is just never going to stay within its limits and will take any chance it gets to show off the depth of its research.
One interesting thing to note is that 3 out of the four models started with the word 'Hark'. I'm pretty sure Shakespeare himself used the word sparingly. Perhaps that's just how Gemini understands the Bard (ha!).
Multimodal generative Test
Multimodal tests are designed to check how well the AI models use different modalities to convey a single cohesive message. Usually tested with visual and textual modes, they can either say the same thing or complement each other.
Prompt: Write a short children's story about sportsmanship and add images wherever appropriate.
All Gemini models but one simply failed at this test. While they wrote the story without any issue, they had trouble coming up with images. The two free Flash models just fell apart trying to give us anything to visualize (even with multiple attempts).
On the other hand, Gemini 1.5 Pro was the only model that gave us something to visualize. Since we didn't tell Gemini exactly what we wanted, it's good to at least see the model generate something. Unfortunately, Deep Research isn't designed for image creation. But even after multiple prompts, it downright refused to even write the story.
Translation Test
Translation tests are relatively easy for GenAI. However, the differences between the models tell us more about how well Gemini is improving over time. To that end, we gave each model a 365-word text written in Hindi (the first section of the famous story Grih Daah by the renowned Indian author Premchand). Here's how the different Gemini models fared.
Although the translations were quite on point, Gemini 1.5 Flash was the worst of the lot. It missed the first name of the main character and didn't include important bits like the name of the initiation ceremony in the story. It also didn't provide line breaks for character dialogues (as the source text had done). On the other hand, Gemini 2.0 Flash Experimental didn't make any such mistakes.
Note: We couldn't provide the chat link for Gemini 2.0 Flash Experimental since it doesn't allow the creation of public links yet.
As for the paid versions, the translation by Gemini 1.5 Pro was in a similar vein to the free 2.0 Flash Experimental model. Other than a few syntactical differences, there's nothing much to tell them apart.
Unfortunately, 1.5 Pro with Deep Research is only available in English. So it is already ruled out for this test.
Coding Test
Like translation and logic based tests, AI models tend to do coding related tests well. But there's always room for improvement from one model to the next. For our test, instead of asking it to provide a custom code, we used a optimization problem called the Travelling Salesman Problem. Here's how each model did to satisfy the prompt:
Prompt: Provide the Python code for the Travelling Salesman Problem.
The Gemini 1.5 Flash model gave a primer on understanding the Travelling Salesman Problem and a few approaches to solving it before it gave the copyable Python code. On the other hand, the 2.0 Flash Experimental model started with the code and added a few notes at the end.
Gemini 1.5 Pro is by far the best of the lot. Not only did it gave the correct code but it went out of its way to provide the explanations for the code. And while the 1.5 Pro with Deep Research provided additional info, explanations, examples, and a conclusion, it felt a little too much for this test. But if you like the extra info., the Deep Research model may be the one for you.
Needle in a Haystack Test
Needle in a haystack is another popular test for testing AI models. The test itself is quite simple. You inundate the AI with data and ask it to retrieve a single piece of info (usually a single sentence). If the AI can fetch it without any issue, it's a pass. So let's see how the four models fare.
We gave Gemini to read the short story 'White Nights by F. Dostoevsky and in between added a simple line "Mr. Jackson's son ate brown bread." Then we asked the following:
Prompt: Go through the text and tell me which bread did Mr. Jackson's son eat.
Interestingly, none of the four models could find the bit about Mr. Jackson in the story. They seemed to have been carried away by the main story and couldn't find the proverbial needle.
Even the Pro models couldn't get the required information either. Instead, they parroted the story and the character names in the story.
This is one test that we thought could've separated the wheat from the chaff, to use another similar phrase. But that wasn't to be the case.
Guess the Movie
This is a seemingly simple test wherein the AI is asked to guess the movie based on nothing but a still. But the AI can sometimes get be clueless depending on the image.
Prompt: Which movie is this from?
1.5 Flash got the movie right but it wrongly identified the actor in the still. On the other hand, 2.0 Flash Experimental couldn't identify the movie the first two times. Only on the third try did it get it right, though it played it safe and didn't go out of its way to say anything more about the still.
A similar thing happened with 1.5 Pro model as well. The first two times it didn't identify the movie but eventually got it the third time. And once it did, it went ahead and offered additional reasons for why it thought it was that movie. And since the Deep Research model doesn't take images, there's nothing to be said about it.
Image Generation
Gemini uses the Imagen3 model for image generation. But it's usually a hit or a miss depending on how detailed the prompt is. Let's see how the models that are capable of generating images go about the test.
Prompt: Create an image showing a blue whale flying around a gothic clocktower with dark skies. Make it in the style of Edvard Munch
Here are the results for Gemini 1.5 Flash model and 2.0 Flash Experimental model.
And here's what Gemini 1.5 Pro created.
Although all three images technically work, only the paid 1.5 Pro model got the Edvard Munch style right, especially with the background and the architecture of the buildings surrounding the clocktower.
There are elements of the Munch style with the image created by the 1.5 Flash model as well, but there's very little to imply that style with the 2.0 Flash Experimental.
The (un)surprising winner
Across all the different tests, the paid Gemini 1.5 Pro model came out the strongest. Though far from perfect, it managed to complete just about every task and came well on top where it had competition from the other models. But a close second was the free 2.0 Flash Experimental model. Although it's a recent release and is still titled 'experimental', it handles most jobs well, from summary generation to creativity to translation and everything in between. It's available everywhere (on web and app) and is also powering the AI Overviews of Google Search.
The 1.5 Pro with Deep Research is another one to look out for. But because it's not geared for image generation, cannot translate, and doesn't allow file upload, it's not yet up to the mark. On the other hand, the 1.5 Flash model got a lot of things right. On its own, it's an exceptional AI model. But it cannot compete with its advanced cousins that are clearly going to be taking the reins soon enough.
Conclusion
It's not easy comparing models within the same family. But after a long hard tussle, it's easy to recommend the 1.5 Pro model (for the paid version) and 2.0 Flash Experimental model (for the free version). Of course, if you require additional research and don't mind the missing multimodal and file upload options, 1.5 Pro with Deep Research will be your go-to.
We hope you could get something out of our analysis of the four models and now know which model to prefer for the different tasks.
Discussion