In this work, we have tried to solve the Visual Relationship Detection Track competition launched by Kaggle. The aim of the competition is to check if computers can detect the relationship between objects presented in images. Not only it is a very state-of-the-art research area, but it is also a very challenging task to accomplish compared to existing computer vision tasks. It is a combination of two prominent tasks – object detection and image caption generation. Although deep learning models are able to produce highly accurate results in these individual tasks, it still struggles to perform with acceptable accuracy in the visual relationship detection task. In this paper, we have attempted a different approach to solve this problem and compared the result with the state-of-the-art baseline result. We have explored two attention-based caption generation models and modified them to solve the visual relation detection task.
Paper
Visualization of model weights in tensorboard
Model
Comments