An Improved Image Captioning Using Emotions
Nabagata Saha, Y. V. Akhila and P. Radha Krishna
Received in final form on March 25, 2021
Abstract
Image captioning has been a challenging area for generating captions that
closely resemble how humans would caption a particular image. The state of-the-art exists in factual captions to caption a given image that contains
inanimate objects. However, captioning images with humans using facial
expressions remains a field that has not been tinkered into. This paper
proposes a novel method that realizes this task. The emotion recognized on
the human subject present in the image is concatenated along with image
features and fed to an image captioning model. The caption generated
is more relevant and human-like. A deep learning model recognizes the
emotion, and an encoder-decoder network captions the image. A multilevel VGG19 network is used for Facial Emotion Recognition to extract
facial features, and Inception V3 (encoder) is used to extract the visual
features. These features are fed to attention-tuned Gated Recurrent Unit
(decoder) to produce the caption in a word-by-word manner. The presented approach provides a more realistic captioning of images, which can
generate natural-sounding video summaries.
Keywords
Emotion discovery, Natural Language Processing, Computer Vision, Facial Expression Recognition, Image Captioning, Encoder Decoder Network, LSTM model, VGG Net model, Feature Concatenation.
Cite This Article
Nabagata Saha, Y. V. Akhila, and P. Radha Krishna, An Improved Image Captioning Using Emotions, J. Innovation Sciences and Sustainable Technologies, 1(2)(2021), 91-118. https://doie.org/10.0608/JISST.2022944590
255 15 Download