AI chatbot models are improving at the rate of knots. No sooner does one company release its latest model than the competition comes up with their own. And even though OpenAI's ChatGPT has had a head start in the AI race, the rest of the competition has caught up swiftly.

At the moment, the two most popular AI chatbots sporting heavyweight AI models of their own are ChatGPT and Gemini. While the two AI chatbots have different models for different purposes, the AI models that are most comparable are ChatGPT's 4o model and Gemini's 1.5 Pro.

Since both of these are paid models, it's worth knowing which one of the two does it better so you can decide for yourself the AI model and chatbot that is best suited for your use case.

To that end, we've put ChatGPT's 4o model and Gemini's 1.5 Pro through the wringer to see which one comes out on top. Let's get started.

Math Test

Let's start with a tricky math problem to check how the two AI models reason through a problem:

Prompt: If 1=3, 2=3, 3=5, 4=4, and 5=4, then what is 6?

The trick to answering this question is to count the letters. So, if one is 3, and three is 5... six is 3.

We expected both the models to get it right - which they did! But what's more important is to see their explanations. Here's what it looked like:

Both ChatGPT and Gemini provided a simple explanation to their answer. In that, there's very little to tell them apart. But ChatGPT did have a somewhat sophisticated answer, mentioning that the patter is "not numeric but rather linguistic". But that's only a subjective assessment, though it doesn't take anything away from Gemini's answer.

Winner: Tie

Summarization Test

For this test, we gave each model a long 27-page research paper to analyze and summarize in less than 100 words. The test here is to see which elements the models include and which ones they exclude (since it's not easy to whittle down that much content in a hundred words).

Here's what the results look like for ChatGPT 4o and Gemini 1.5 Pro:

ChatGPT's ability to synthesize is truly exceptional. It used all 100 words to come up with a concise summary that touched on all the major points of the research paper. However, the summary was all one block of text with no references to the exact statements in the paper. Perhaps the breakdown would have benefitted from a few bullet points and mention of the source.

On the other hand, Gemini held its own quite well and had a similar approach to the summary. But though it didn't provide much of a breakdown (like ChatGPT), or use up the entire quota of words (summary was only 83 words), it did support its claims with reference to text - which is a huge plus. Because there's little to tell them apart, these extra features go a long way.

Winner: Gemini

The 'End with a word' Test

This is perhaps the simplest of tests that most models get wrong. The task is easy enough: construct 10 sentences that end with a particular word. The word we're choosing is "ball".

Prompt: Give 10 sentences that end with the word 'Ball'.

Surprisingly, ChatGPT only managed 3 sentences that ended with the word 'ball'. While Gemini outperformed with 6 such sentences.

Analyzing the sentences, there's reason to assume that AI models are not counting the phrase or the clause that follows the specific word. But that is not what's prompted. When we say 'write sentences that end with the word ball', it should give us sentences that actually have 'ball' as the last word.

Since neither of the two AI models gave us all 10 sentences with 'ball' as the last word, it's best to go with the model that got the most right.

Winner: Gemini

Common Sense Test

Common sense tests can be quite fun, not least because AI models often get them wrong. But we don't expect paid models such as GPT-4o and Gemini 1.5 Pro to fail such tests. Here's what we asked the two models.

Prompt: If a blue ball falls into the red sea, which color is it now?

The answer it pretty straightforward. And here's how the two models fared:

As expected, both the models provided the right answer. However, while Gemini was happy to wrap up its answer in a couple of statements, ChatGPT went out of its way to gave additional details.

Again, this is similar to the first test. The additional explanation may or may not be needed, depending on what you prefer. But as far as getting the answer, they're both winners.

Winner: Tie

Creativity Test

When you're in a pinch, having an AI chatbot take care of creative tasks can be a life saver. So let's see how ChatGPT and Gemini get on with it. We're going with a simple short story test with a few stylistic twists to see how the AI handles it.

Prompt: Write a short story about Santa in the style of a drunken Chaucer in 100 words.

We'll let you be the judge of these creations.

Now, choosing which one's the better can be quite subjective. But it's worth taking note that Gemini has a habit of starting most creative tasks involving writing in verse with the word 'Hark'. Even in our comparison of the Gemini models we saw how 'Hark' is perhaps Gemini's go-to word when trying to imitate an older style of writing. Unfortunately for Gemini, which can otherwise be quite creative, it's ChatGPT that wins this round.

Winner: ChatGPT

Image Generation Test

Having nothing to do with text, image generation highlights widely different facets of the AI models. Let's see how the two models go about generating images for the following prompt:

Prompt: Create an image showing a black cat staring out the window at the fields of barley in the evening yellow light. Make it in the style of Vincent Van Gogh.

While ChatGPT was 4-5 seconds faster than Gemini, we think Gemini's image came out better overall. ChatGPT might have overdone it with the wiggly strokes. But it's really close, and we're just taking a subjective call. What do you think?

Note that ChatGPT also lets you edit parts of the image after it's generated (while Gemini doesn't have any such feature).

Since both models got the basics right and implemented the impressionistic styles of Vincent Van Gogh, we really can't call this one. And in the absence of any major issues, it's really up to the subjective viewer at the end of the day to choose which one they prefer. But if asked to choose one, we'd go with Gemini.

Winner: Gemini

Multimodal generative Test

Creative tests in a single modality are relatively easy for AI, which is why there's little to tell them apart. But combining modalities can easily separate the wheat from the chaff.

Prompt: Write a short children's story about sportsmanship and add 3 images wherever appropriate.

ChatGPT gave us nice and simple story with a moral dilemma at its heart, as well descriptions of images that it created once we gave it the go-ahead. Everything was pretty and easy and not too complicated.

On the other hand there's Gemini which had no issue creating a story. But, in our estimate, it wasn't as compelling or easy to read as the one by ChatGPT. Gemini also repeatedly failed to create any images at all, which was perhaps the most unfortunate thing. The decision, therefore, was too easy to make.

Winner: ChatGPT.

Translation Test

AI models excel at translating long pieces of text. But the devil is in the details and that is what you should be on the lookout for when going over the translations. Depending on the model and the text involved, the AI can miss important things.

For context, we've asked the AI models to translate the first two sections of the Hindi short story Grih Daah by Premchand.

Here's the chat link for ChatGPT's translation. ChatGPT's translation of the short story was one of the best we've seen by an AI model thus far. It quite closely followed the author's syntax and didn't miss a beat.

Note: Good translation follows meaning more than simply translating every word and sentence. But ChatGPT gets the balance right, especially if you pull up the source and the translation side by side.

Gemini was a lot harder to work with. It refused to translate the first time. And even on the second attempt, it took way too long to translate than ChatGPT. We didn't encounter such issues when comparing the 4 Gemini models. But Gemini is known to be much more inconsistent, especially when compared to ChatGPT. Here's the chat link for Gemini's translation.

Winner: ChatGPT

Coding Test

AI models do their best at coding. While human coders are still the mainstay, AI greatly helps with menial tasks as well as debugging. But instead of asking the models to create custom code, let's see how it does on a common optimization problem.

Prompt: Provide the Python code for the Travelling Salesman Problem.

Unsurprisingly, running the prompt automatically activated the Canvas mode for coding. All the code was generated in Canvas and could be 'Run' to check output and debug and the information provided in the main chat was sufficient as an explanation for the Travelling Salesman problem. Here's the link for ChatGPT's output.

Gemini had a few tricks up its sleeve too. While explaining the problem, it linked to credible sources (which adds the trust factor) and broke it down in steps. The code itself could be easily copied and we found no issues with it. Here's the link for Gemini's code output.

However, nothing beats ChatGPT's Canvas feature when it comes to coding. Although Gemini does its best to win this round, it just doesn't have a similar feature that doubles up as a coding interface where you can add and remove lines of code, translate it to other programming languages, get suggestions, and debug.

Winner: ChatGPT

Needle in a Haystack Test

This is a common test administered to AI models to check how well they can track and find a given piece of information. The idea is simple - ask the AI to find a piece of info (such as a sentence) amidst other data.

We gave both AI models the first part of Pushkin's short story The Captain's Daughter (over 3300 words) and added a simple line - "Mr. Joe's son ate brown bread" - that had nothing to do with the story. Then we asked the following:

Prompt: Go through the text and tell me which bread did Mr. Joe's son eat.

ChatGPT quickly identified the needle and gave us the correct answer without any delay.

Unfortunately, Gemini failed at the same task and couldn't find the needle in the haystack of Pushkin's story. This only goes to show that Gemini's 1.5 Pro model isn't doing a good job at taking in all the elements into consideration and is either mixing up bits of data or being easily overwhelmed by it.

Winner: ChatGPT

Guess the Movie Test

This is a fun test that tests the AI model's ability to read an image (often of a relatively popular movie) and match it. Let's see how the two models do with the following image:

Both ChatGPT and Gemini got the movie right. But while ChatGPT correctly identified the subjects in the image (Colin Farrell next to a donkey), Gemini thinks the donkey is Colm Doherty, which is a hilarious take on the movie (for those who know the plot).

Winner: ChatGPT

Winner

With 6 wins and 2 ties, ChatGPT's 4o model is the clear winner. But Gemini's 1.5 Pro gave it quite a tough fight. It shone where we least expected it to - that is, in summarization, image generation, and the 'end with a word' test, while tying with ChatGPT 4o on common sense and math tests.

However, where it matters most - coding, translation, creativity, tracking info, and even image analysis - ChatGPT' 4o beats Gemini 1.5 Pro to the punch. ChatGPT 4o is also more reliable and consistent with its answers, making it a trustworthy AI partner. Gemini blows hot and cold for the most part. But if you try enough times, prompt better, and regenerate until you get what you desire, you can get it to work too. But if you're anything like us, you'd want to go for the winner.