Turning Digital Image into Words

Artificial Intelligence in computers has come a long way. Computers can now perform a lot of operations which a human being can. AI can be used for automated work and can also be used for logical thinking.

But we all know that. So, maybe you’re thinking, what’s new in this article?

Well, Google has developed a new machine learning system that can automatically and accurately write captions for photos. In case you don’t know, captions are little description or words that describe the contents of a picture.

This is a very big leap in the field of artificial intelligence and neural networking. The innovation could make it easier to search for images on Google, help visually impaired people understand image content and provide alternative text for images when Internet connections are slow.

The researchers of this system are Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan. They call this captioning system, Neural Image Caption (NIC).

NIC is based on techniques from the field of computer vision. This allows machines to see the world and process natural languages. It also tries to make human language meaningful to computers. The researchers' goal was to train the system to produce natural-sounding captions based on the objects it recognizes in the images.

People can describe a digital image at a single glance, but can a computer? Recent research in the field has greatly improved object detection, classification, and labeling. Accurately describing a complex scene requires a deeper representation of what’s going on in the scene, capturing how the various objects relate to one another and translating it all into natural language.

A test was performed to check if a machine can do these things and describe a digital image like a human being.

Let’s take a digital image like this one.

Human caption: "A group of men playing frisbee in the park."

Computer caption: "A group of young people playing a game of frisbee."

mazing! A computer being able to describe the digital image is unheard of till now. Take a moment and think about it as a computer science student. How is the program written, what parameters does it use, how does the program know if a human being is present in the digital image or not? These questions just excite me.

Now, let’s take another digital image.

Human caption: "Three different types of pizza on top of a stove."

Computer caption: "Two pizzas sitting on top of a stove top oven."

This one is a little confusing, even to the human eye. But after careful observation, we can see that there are definitely three pizzas and not just two. This is where the computer has made a slight error in examining the digital image.

This is what I’m talking about. How did the computer accurately describe the first digital image and fail to correctly describe the second one.

If you’re thinking that this research is fairly new, then as it turns out, the research is decades old. In 1966, some MIT undergrads spent the summer linking a camera to a computer and getting the computer to describe what it saw. This was the start of the research.

Captioning software will be very useful to us in the future. It can lead to other research and inventions that we haven’t even heard of. Let’s hope that this research will lead to a program that can describe a video. Wouldn’t it be cool?