Google AI's new Gemini 2.0 model supports multimodality, meaning we can build both text and image-based AI applications. Moving beyond text-only intelligence is a big step forward for AI, and is particularly exciting for AI engineers looking to build more feature-rich software.
In this article, we'll explore some of Gemini's multimodal capabilities by building an AI agent that can describe underwater scenes and identify fish and coral species.
Setup Instructions
The code in this article has been tested both locally with Python 3.12.7
and in Google Colab with Python 3.10.12
.
To run locally, please refer to the setup instructions with uv
here. To run in Google Colab, simply run all cells in the provided notebook.
Loading Images
We're going to test Gemini against a few underwater images. The content of these images is fairly challenging, they're from a relatively uncommon environment and the image quality is okay but certainly not great. However, this is a perfect test of how Gemini might perform on real-world data.
To begin, we will load our images from the ./images
directory.
import os
from pathlib import Path
import requests
# check if the images directory exists
if not os.path.exists("./images"):
os.mkdir("./images")
png_paths = [str(x) for x in Path("./images").glob("*.png")]
# check if we have expected images, otherwise download
if len(png_paths) >= 4:
print("Images already downloaded")
else:
print("Downloading images...")
# download images from the web
files = ["clown-fish.png", "dotted-fish.png", "many-fish.png", "fish-home.png"]
for file in files:
url = f"https://github.com/aurelio-labs/cookbook/blob/main/gen-ai/google-ai/gemini-2/images/{file}?raw=true"
response = requests.get(url, stream=True)
with open(f"./images/{file}", "wb") as f:
for block in response.iter_content(1024):
if not block:
break
f.write(block)
png_paths = [str(x) for x in Path("./images").glob("*.png")]
print(png_paths)
Images already downloaded
['images/clown-fish.png', 'images/dotted-fish.png', 'images/many-fish.png', 'images/fish-home.png']
Let's see each of these images:
import matplotlib.pyplot as plt
from PIL import Image
# we use matplotlib to arrange the images in a grid
fig, axs = plt.subplots(2, 2, figsize=(14, 8))
for ax, path in zip(axs.flat, png_paths):
img = Image.open(path)
ax.imshow(img)
ax.axis('off')
ax.set_title(path)
plt.tight_layout()
We'll use Gemini to describe these images, detect the various fish and corals, and see how precisely Gemini can identify the various objects.
Describing Images
Let's start simple by asking Gemini to simply describe what it finds in each image.
from io import BytesIO
with BytesIO(open(png_paths[0], "rb").read()) as img_bytes:
# note: resizing is optional, but it helps with performance
image = Image.open(img_bytes).resize(
(1024, int(1024 * img.size[1] / img.size[0])),
Image.Resampling.LANCZOS
)
image
We setup our config. Within it we need:
- The
system_instruction
describing that we need the LLM to draw bounding boxes around something. - Our
safety_settings
which we will keep relatively loose to avoid overly sensitive guardrails against our inputs. - Set
temperature
for more/less creative output.
from google.genai import types
system_instruction = (
"Describe what you see in this image, identify any fish or coral species "
"in the image and tell us how many of each you can see."
)
safety_settings = [
types.SafetySetting(
category="HARM_CATEGORY_DANGEROUS_CONTENT",
threshold="BLOCK_ONLY_HIGH",
),
]
config = types.GenerateContentConfig(
system_instruction=system_instruction,
temperature=0.1,
safety_settings=safety_settings,
)
Before generating anything we need to initialize our client, for this we will need a Google API key. To get a key, you can setup an account in Google AI Studio.
After you have your account and API key, we initialize our google.genai
client:
import os
from getpass import getpass
from google import genai
# pass your API key here
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") or getpass(
"Enter Google API Key: "
)
# initialize our client
client = genai.Client()
Now let's see what we get.
from IPython.display import Markdown
model_id = "gemini-2.0-flash-exp"
# run our query against the clownfish image
response = client.models.generate_content(
model=model_id,
contents=[
"Tell me what is here",
image
],
config=config
)
# Check output
response.text
Certainly!
**Overall Scene:**
The image shows an underwater scene, likely a coral reef. The water is clear enough to
see the various marine life and coral formations. The lighting suggests it's daytime,
with natural light filtering through the water.
**Fish Species:**
1. **Clownfish (Amphiprion sp.):** There are two clownfish visible in the image. They
are characterized by their orange bodies with white stripes and black markings. They
are nestled within the anemone. Based on the black markings, these are likely Clark's
Clownfish (Amphiprion clarkii).
2. **Wrasse:** There is a small, slender fish with a blue stripe along its body, which
is likely a wrasse. It is swimming in the background.
**Coral Species:**
1. **Anemone:** The large, tentacled structures in the foreground are anemones. These
are not corals but are often found in coral reef environments. The clownfish are living
within the anemone.
2. **Hard Coral:** There are various types of hard corals visible in the background.
These include branching corals, plate corals, and some massive corals. The specific
species are difficult to identify without a closer view, but they contribute to the
overall structure of the reef.
**Counts:**
* **Clownfish:** 2
* **Wrasse:** 1
* **Anemone:** 1 (large cluster)
* **Hard Coral:** Multiple, various types
If you have any other questions or images you'd like me to analyze, feel free to ask!
That looks pretty good, let's make this more interesting by asking Gemini to draw bounding boxes around the fish in the image. We will need to modify the system_instruction
to explain how Gemini should do this.
system_instruction = (
"Return bounding boxes as a JSON array with labels. Never "
"return masks or code fencing. Limit to 25 objects. "
"If an object is present multiple times, label them according "
"to their scientific and popular name."
) # modifying this prompt much seems to damage performance
config = types.GenerateContentConfig(
system_instruction=system_instruction,
temperature=0.1,
safety_settings=safety_settings,
)
If we generate now we will receive a string of JSON objects containing all we need to programatically plot the bounding boxes.
response = client.models.generate_content(
model=model_id,
contents=[
"Highlight the different fish in the image",
image
],
config=config
)
Markdown(response.text)
[
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
{"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
{"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
{"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
{"box_2d": [519, 549, 68
Okay we got a lot of repetition, we can fix that by increasing the frequency_penalty
in our config.
system_instruction = (
"Return bounding boxes as a JSON array with labels. Never "
"return masks or code fencing. Limit to 25 objects. "
"If an object is present multiple times, label them according "
"to their scientific and popular name."
) # modifying this prompt much seems to damage performance
config = types.GenerateContentConfig(
system_instruction=system_instruction,
temperature=0.05,
safety_settings=safety_settings,
frequency_penalty=1.0, # reduce repetition
)
response = client.models.generate_content(
model=model_id,
contents=[
"Highlight the different fish in the image",
image
],
config=config
)
response.text # let's see the labels
[
{"box_2d": [104, 458, 139, 486], "label": "fish"},
{"box_2d": [279, 41, 318, 76], "label": "fish"},
{"box_2d": [439, 87, 465, 130], "label": "fish"},
{"box_2d": [106, 398, 150, 437], "label": "fish"},
{"box_2d": [279, 159, 320, 187], "label": "fish"},
{"box_2d": [518, 549, 679, 703], "label": "Amphiprion clarkii, Clark's anemonefish"},
{"box_2d": [497, 418, 631, 468], "label": "Amphiprion ocellaris, Ocellaris clownfish"},
{"box_2d": [106, 437, 135, 458], "label": "fish"},
{"box_2d": [106, 437, 135, 458], "label": "fish"},
{"box_2d": [279, 41, 318, 76], "label": "fish"},
{"box_2d": [439, 87, 465, 130], "label": "fish"}
]
Interesting, let's try plotting this and seeing what we get. The first thing we need to do is extract the JSON from our response, we do this by identifying the expected pattern with regex.
import re
import json
json_pattern = re.compile(r'```json\n(.*?)```', re.DOTALL)
json_output = json_pattern.search(response.text).group(1)
# convert our json string to a list of dicts
bounding_boxes = json.loads(json_output)
bounding_boxes
[
{'box_2d': [104, 458, 139, 486], 'label': 'fish'},
{'box_2d': [279, 41, 318, 76], 'label': 'fish'},
{'box_2d': [439, 87, 465, 130], 'label': 'fish'},
{'box_2d': [106, 398, 150, 437], 'label': 'fish'},
{'box_2d': [279, 159, 320, 187], 'label': 'fish'},
{'box_2d': [518, 549, 679, 703],
'label': "Amphiprion clarkii, Clark's anemonefish"},
{'box_2d': [497, 418, 631, 468],
'label': 'Amphiprion ocellaris, Ocellaris clownfish'},
{'box_2d': [106, 437, 135, 458], 'label': 'fish'},
{'box_2d': [106, 437, 135, 458], 'label': 'fish'},
{'box_2d': [279, 41, 318, 76], 'label': 'fish'},
{'box_2d': [439, 87, 465, 130], 'label': 'fish'}
]
We'll wrap this info a parse_json
function to make it easier to use.
def parse_json(llm_output: str) -> list[dict]:
json_output = json_pattern.search(llm_output).group(1)
return json.loads(json_output)
Finally, we create a plot_bounding_boxes
function to plot the bounding boxes.
from PIL import ImageDraw, ImageColor
colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]
def plot_bounding_boxes(image: Image, llm_output: str) -> Image:
# avoid modifying the original image
img = image.copy()
# we need the image size to convert normalized coords to absolute below
width, height = img.size
# init drawing object
draw = ImageDraw.Draw(img)
# parse out the bounding boxes JSON from markdown
bounding_boxes = parse_json(llm_output=llm_output)
# iterate over LLM defined bounding boxes
for i, bounding_box in enumerate(bounding_boxes):
# set diff color for each box
color = colors[i % len(colors)]
# from normalized to absolute coords
abs_y1 = int(bounding_box["box_2d"][0]/1000 * height)
abs_x1 = int(bounding_box["box_2d"][1]/1000 * width)
abs_y2 = int(bounding_box["box_2d"][2]/1000 * height)
abs_x2 = int(bounding_box["box_2d"][3]/1000 * width)
# coords might be going right to left, swap if so
if abs_x1 > abs_x2:
abs_x1, abs_x2 = abs_x2, abs_x1
if abs_y1 > abs_y2:
abs_y1, abs_y2 = abs_y2, abs_y1
# draw the bounding boxes on our Draw object
draw.rectangle(
((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=2
)
# draw text labels
if "label" in bounding_box:
draw.text((abs_x1 + 2, abs_y1 - 14), bounding_box["label"], fill=color)
return img
plot_bounding_boxes(image, response.text)
Gemini is doing well, but struggling to precisely label the various types of fish. Let's try asking Gemini for specific types of fish and corals in the image.
response = client.models.generate_content(
model=model_id,
contents=[
"Highlight the different corals in the image",
image
],
config=config
)
response.text
[
{"box_2d": [189, 172, 305, 316], "label": "Acropora coral"},
{"box_2d": [409, 398, 625, 470], "label": "Heteractis magnifica"},
{"box_2d": [189, 307, 345, 468], "label": "Acropora coral"},
{"box_2d": [175, 460, 319, 589], "label": "Acropora coral"},
{"box_2d": [305, 468, 479, 611], "label": "Acropora coral"},
{"box_2d": [305, 608, 479, 748], "label": "Acropora coral"},
{"box_2d": [163, 590, 305, 719], "label": "Acropora coral"},
{"box_2d": [468, 549, 687, 705], "label": "Heteractis magnifica"},
{"box_2d": [468, 713, 609, 845], "label": "Acropora coral"},
{"box_2d": [163, 709, 305, 842], "label": "Acropora coral"},
{"box_2d": [163, 840, 305, 972], "label": "Acropora coral"},
{"box_2d": [305, 741, 479, 884], "label": "Acropora coral"},
{"box_2d": [163, 939, 305, 1000], "label": "Acropora coral"},
{"box_2d": [305, 879, 479, 1000], "label": "Acropora coral"},
{"box_2d": [468, 839, 609, 972], "label": "Acropora coral"},
{"box_2d": [468, 937, 609, 1000], "label": "Acropora coral"},
{"box_2d": [609, 175, 743, 328], "label": "Acropora coral"},
{"box_2d": [609, 318, 743, 471], "label": "Acropora coral"},
{"box_2d": [609, 458, 743, 611], "label": "Acropora coral"},
{"box_2d": [508, 590, 687, 705], "label":"Heteractis magnifica"}
]
Let's view the image and bounding boxes:
plot_bounding_boxes(image, response.text)
Most of these labels are incorrect, but interestingly there are two heteractis magnifica (ie magnificant sea anemone) correctly identified. Both of these bounding boxes surround both the clownfish and the anemone itself. Given that clownfish tend to live amongst anemones, it is likely that Gemini knows that a anemone appearing with a clownfish is likely to be a heteractis magnifica — making the labelling task much easier.
Let's try asking Gemini to label the clownfish in the image.
response = client.models.generate_content(
model=model_id,
contents=[
"Highlight the different clownfish in the image",
image
],
config=config
)
plot_bounding_boxes(image, response.text)
We can see if Gemini can identify the cleaner wrasse in the image.
response = client.models.generate_content(
model=model_id,
contents=[
"Highlight any cleaner wrasse in this image",
image
],
config=config
)
plot_bounding_boxes(image, response.text)
Surprisingly, Gemini accurately labels the wrasse (labroides dimidiatus) to the left despite the limited resolution of the image. Let's try some more images:
with BytesIO(open(png_paths[1], "rb").read()) as img_bytes:
# note: resizing is optional, but it helps with performance
image = Image.open(img_bytes).resize(
(1024, int(1024 * img.size[1] / img.size[0])),
Image.Resampling.LANCZOS
)
response = client.models.generate_content(
model=model_id,
contents=[
"Highlight any fish in this image",
image
],
config=config
)
plot_bounding_boxes(image, response.text)
Here Gemini manages to identify the large fish in the middle of the image as a diagramma pictum, Sweetlips. Sweetlips is correct for the genus, but diagramma pictum is incorrect. Nonetheless, this is very close and a great start. Gemini also highlights several other fish in the background.
response = client.models.generate_content(
model=model_id,
contents=[
"What is the big fish in the middle of the image? Please highlight it.",
image
],
config=config
)
plot_bounding_boxes(image, response.text)
By specifying that we want to focus on the central fish, Gemini does so and labels it as a sweetlips — but this time, Gemini does not highlight the other fish in the background.
Let's try another image:
with BytesIO(open(png_paths[2], "rb").read()) as img_bytes:
# note: resizing is optional, but it helps with performance
image = Image.open(img_bytes).resize(
(1024, int(1024 * img.size[1] / img.size[0])),
Image.Resampling.LANCZOS
)
response = client.models.generate_content(
model=model_id,
contents=[
"Highlight any fish in this image",
image
],
config=config
)
plot_bounding_boxes(image, response.text)
This is interesting. There are many fish in the image and Gemini catches the majority of them. However, Gemini doesn't label them with any level of precision. Nonetheless, Gemini did label the two naso lituratus (ie unicornfish).
Let's try asking Gemini to label the corals in the image.
with BytesIO(open(png_paths[2], "rb").read()) as img_bytes:
# note: resizing is optional, but it helps with performance
image = Image.open(img_bytes).resize(
(1024, int(1024 * img.size[1] / img.size[0])),
Image.Resampling.LANCZOS
)
response = client.models.generate_content(
model=model_id,
contents=[
"Highlight the corals in this image",
image
],
config=config
)
plot_bounding_boxes(image, response.text)
It's hard to read the labels from the image, we can print them directly as before:
response.text
[
{"box_2d": [158, 530, 476, 709], "label": "Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [469, 679, 628, 815], "label": "Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [604, 397, 715, 493], "label": "Staghorn Coral (Acropora cervicornis)"},
{"box_2d": [680, 453, 810, 574], "label": "Staghorn Coral (Acropora cervicornis)"},
{"box_2d": [690, 574, 839, 682], "label": "Staghorn Coral (Acropora cervicornis)"},
{"box_2d": [170, 690, 354, 815], "label": "Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [369, 470, 502, 563], "label": "Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [481, 809, 677, 994], "label": "Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [315, 235, 469, 378], "label": "Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [405, 361, 528, 470], "label": "Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [675, 10, 849, 139], "label": "Staghorn Coral (Acropora cervicornis)"},
{"box_2d": [847, 105, 998, 306], "label": "Staghorn Coral (Acropora cervicornis)"},
{"box_2d": [764, 305, 918, 453], "label":"Staghorn Coral (Acropora cervicornis)"},
{"box_2d": [875, 413, 998, 560], "label":"Staghorn Coral (Acropora cervicornis)"},
{"box_2d": [195, 37, 408, 206], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [569, 137, 748, 315], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [690, 684, 810, 732], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [675, 753, 748, 780], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [195, 306, 391, 469], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [481, 530, 654, 648], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [748, 139, 825, 190], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [748, 341, 796, 415], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [476, 78, 518, 115], "label":"Brain Coral (Diploria labyrinthiformis)"},
{"box_2d": [267, 450, 343, 476], "label":"Brain Coral (Diploria labyrinthiformis)"}
]
Here we see a lot of brain and staghorn coral labels. For the most part they seem to be labelled incorrectly so Gemini does still seem be to struggling with corals. Let's try one final image:
with BytesIO(open(png_paths[3], "rb").read()) as img_bytes:
# note: resizing is optional, but it helps with performance
image = Image.open(img_bytes).resize(
(1024, int(1024 * img.size[1] / img.size[0])),
Image.Resampling.LANCZOS
)
response = client.models.generate_content(
model=model_id,
contents=[
"Where is the fish hiding in this image?",
image
],
config=config
)
plot_bounding_boxes(image, response.text)
Surprisingly, Gemini does a good job of highlighting a few almost hidden fish. The central fish is accurately labeled as a damselfish. Finally, let's ask Gemini to tell us what we're looking at with this final image.
system_instruction = (
"Describe what you see in this image, identify any fish or coral species "
"in the image and tell us how many of each you can see."
)
config = types.GenerateContentConfig(
system_instruction=system_instruction,
temperature=0.1,
safety_settings=safety_settings,
)
response = client.models.generate_content(
model=model_id,
contents=[
"Explain what this image contains, what is happening, and what is the location?",
image
],
config=config
)
Markdown(response.text)
Certainly!
The image shows an underwater scene featuring a large, cylindrical object with a hole in the center. The object appears to be made of metal and is covered in marine growth, giving it a textured, orange-brown appearance. There are several small fish swimming around the object and in the surrounding water. The water is a clear, turquoise color.
Based on the appearance of the object and the surrounding environment, it is likely that this is a part of a shipwreck. The cylindrical object could be a gun barrel or some other structural component of the ship. The location is underwater, likely in a tropical or subtropical region given the clear water and the presence of marine life.
I can see 1 fish inside the hole and many more swimming around the object. I cannot identify the species of fish or coral in the image.
Gemini does a good job of describing the scene. Let's challenge Gemini to tell us the exact location of the shipwreck.
response = client.models.generate_content(
model=model_id,
contents=[
"What is your best guess as to the exact location of this shipwreck?",
image
],
config=config
)
Markdown(response.text)
Certainly!
In the image, I see a section of a shipwreck underwater. The main focus is a large,
circular opening, possibly a gun port or a pipe, that is heavily encrusted with marine
growth, giving it a rough, orange-brown texture. The surrounding structure appears to
be part of the ship's hull or deck, with visible metal beams and panels. The water is a
clear, turquoise color, and there are numerous small fish swimming around the structure.
I can identify the following:
* **Fish:** There are many small, silvery fish, possibly baitfish, and a few larger,
darker fish. I can count at least 30 small fish and 3 larger fish.
* **Coral:** I do not see any coral in this image. The orange-brown growth on the
shipwreck appears to be encrusting organisms like sponges or algae, not coral.
I cannot determine the exact location of the shipwreck from the image alone. I do not
have access to external databases or the ability to make inferences about the location
based on the visual information.
Unsurprisingly Gemini does not manage to identify the exact location of the shipwreck. We'll keep this specific image and question as a test for Gemini's future iterations.
That's it for our intro to multi-modal text and image generation with Gemini 2.0 Flash. The model is already impressive but certainly places where Gemini can improve. Nonetheless, the multi-modal capabilities are more than enough for us to build some strong multi-modal AI applications.