Image Caption Generation Using CNN and LSTM

International Journal of Innovative Research in Computer and Communication Engineering

ISSN Approved Journal | Impact factor: 8.771 | ESTD: 2013 | Follows UGC CARE Journal Norms and Guidelines

| Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal | High Impact Factor 8.771 (Calculated by Google Scholar and Semantic Scholar | AI-Powered Research Tool | Indexing in all Major Database & Metadata, Citation Generator | Digital Object Identifier (DOI) |

TITLE	Image Caption Generation Using CNN and LSTM
ABSTRACT	Image captioning is becoming very important in computer world because machines need to understand pictures like humans do. Earlier peoples were making captions by hand which was taking too much time and not possible for lakhs of images. Old methods were using simple templates and hand-made rules which were giving very basic and repeated sentences. This project is solving problem of automatic caption generation where machine can see image and write proper sentence about it. The methodology is using Convolutional Neural Network for extracting features from images like objects, colors and shapes. After that Long Short Term Memory network is generating words one by one to make complete sentence. Also pre-trained vision-language model from Hugging Face is added which is making captions more natural and accurate. Flask framework is used for making web application where user can upload image and get caption immediately. Testing was done on different type of images and results are showing that captions are matching with actual content of images. The system is working fast and interface is easy to use for anyone.
AUTHOR	DEEKSHITH K N, DR. RAGHAVENDRA S P MCA Student, Dept. of MCA, Jawaharlal Nehru New College of Engineering, Shivamogga, Karnataka, India Assistant Professor, Dept. of MCA, Jawaharlal Nehru New College of Engineering, Shivamogga, Karnataka, India
VOLUME	184
DOI	DOI: 10.15680/IJIRCCE.2026.1405107
PDF	pdf/107_Image Caption Generation Using CNN and LSTM.pdf
KEYWORDS
References	1. Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft COCO captions: Data collection and evaluation server. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 740–755. https://doi.org/10.1109/TPAMI.2017.2765350 2. Cornia, M., Baraldi, L., Serra, G., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 51–64. https://doi.org/10.1109/TPAMI.2020.2990508 3. Dai, B., Lin, D., Urtasun, R., & Fidler, S. (2017). Towards diverse and natural image descriptions via a conditional GAN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 587–602. https://doi.org/10.1109/TPAMI.2019.2920138 4. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT for multimodal representation learning. Journal of Artificial Intelligence Research, 65(1), 417–465. https://doi.org/10.1613/jair.1.11692 5. Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, C. L., & Zweig, G. (2015). From captions to visual concepts and back. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652–663. https://doi.org/10.1109/TPAMI.2016.2646132 6. Gu, J., Cai, J., Wang, G., & Chen, T. (2018). Stack-captioning: Coarse-to-fine learning for image captioning. IEEE Transactions on Image Processing, 28(2), 602–612. https://doi.org/10.1109/TIP.2018.2868617 7. Huang, L., Wang, W., Chen, J., & Wei, X. (2019). Attention on attention for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6606–6618. https://doi.org/10.1109/TPAMI.2020.3037894 8. Jain, A., & Schwing, A. G. (2017). Diverse and controllable image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2117–2130. https://doi.org/10.1109/TPAMI.2018.2868351 9. Jiang, W., Ma, L., Jiang, Y.-G., Liu, W., & Zhang, T. (2020). Recurrent fusion network for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(7), 1671–1683. https://doi.org/10.1109/TPAMI.2019.2894203 10. Li, X., Jiang, Y., Jin, J., & Wu, X. (2019). Dual-level collaborative transformer for image captioning. Pattern Recognition, 90(1), 292–302. https://doi.org/10.1016/j.patcog.2019.01.034 11. Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2018). Improved image captioning via policy gradient optimization of SPIDEr. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1222–1235. https://doi.org/10.1109/TPAMI.2019.2896534 12. Lu, J., Yang, J., Batra, D., & Parikh, D. (2017). Hierarchical question-image co-attention for visual question answering. International Journal of Computer Vision, 126(1), 1–21. https://doi.org/10.1007/s11263-017-1037-6 13. Pan, Y., Yao, T., Li, Y., & Mei, T. (2020). X-linear attention networks for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6292–6308. https://doi.org/10.1109/TPAMI.2020.3017817 14. Park, C. C., Kim, B., & Kim, G. (2019). Expressive image captioning with grounded concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(9), 3114–3127. https://doi.org/10.1109/TPAMI.2020.2974837 15. Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2215–2228. https://doi.org/10.1109/TPAMI.2017.2727069 16. Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304. https://doi.org/10.1109/TPAMI.2016.2646371 17. Wang, Y., Lin, Z., Shen, X., Cohen, S., & Cottrell, G. W. (2017). Skeleton key: Image captioning by skeleton-attribute decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 307–320. https://doi.org/10.1109/TPAMI.2018.2792845 18. Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2017). What value do explicit high level concepts have in vision to language problems? IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(10), 2049–2061. https://doi.org/10.1109/TPAMI.2016.2621268 19. Xu, J., Mei, T., Yao, T., & Rui, Y. (2018). MSR-VTT: A large video description dataset for bridging video and language. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6), 1309–1321. https://doi.org/10.1109/TPAMI.2018.2828708 20. Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2019). Review networks for caption generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(7), 1644–1657. https://doi.org/10.1109/TPAMI.2019.2905817

About Us

The primary objective of IJIRCCE is to serve as an international scholarly platform that enables researchers, innovators, students, and research scholars to disseminate their research findings and technological advancements to a global academic audience.

About Us

GET IN TOUCH

Useful Links

ARTICLES

About Us

GET IN TOUCH

Useful Links