International Journal of Innovative Research in Computer and Communication Engineering

ISSN Approved Journal | Impact factor: 8.771 | ESTD: 2013 | Follows UGC CARE Journal Norms and Guidelines

| Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal | High Impact Factor 8.771 (Calculated by Google Scholar and Semantic Scholar | AI-Powered Research Tool | Indexing in all Major Database & Metadata, Citation Generator | Digital Object Identifier (DOI) |


TITLE Image Caption Generation Using CNN and LSTM
ABSTRACT Image captioning is becoming very important in computer world because machines need to understand pictures like humans do. Earlier peoples were making captions by hand which was taking too much time and not possible for lakhs of images. Old methods were using simple templates and hand-made rules which were giving very basic and repeated sentences. This project is solving problem of automatic caption generation where machine can see image and write proper sentence about it. The methodology is using Convolutional Neural Network for extracting features from images like objects, colors and shapes. After that Long Short Term Memory network is generating words one by one to make complete sentence. Also pre-trained vision-language model from Hugging Face is added which is making captions more natural and accurate. Flask framework is used for making web application where user can upload image and get caption immediately. Testing was done on different type of images and results are showing that captions are matching with actual content of images. The system is working fast and interface is easy to use for anyone.
AUTHOR DEEKSHITH K N, DR. RAGHAVENDRA S P MCA Student, Dept. of MCA, Jawaharlal Nehru New College of Engineering, Shivamogga, Karnataka, India Assistant Professor, Dept. of MCA, Jawaharlal Nehru New College of Engineering, Shivamogga, Karnataka, India
VOLUME 184
DOI DOI: 10.15680/IJIRCCE.2026.1405107
PDF pdf/107_Image Caption Generation Using CNN and LSTM.pdf
KEYWORDS
References 1. Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft COCO captions: Data collection and evaluation server. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 740–755. https://doi.org/10.1109/TPAMI.2017.2765350
2. Cornia, M., Baraldi, L., Serra, G., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 51–64. https://doi.org/10.1109/TPAMI.2020.2990508
3. Dai, B., Lin, D., Urtasun, R., & Fidler, S. (2017). Towards diverse and natural image descriptions via a conditional GAN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 587–602. https://doi.org/10.1109/TPAMI.2019.2920138
4. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT for multimodal representation learning. Journal of Artificial Intelligence Research, 65(1), 417–465. https://doi.org/10.1613/jair.1.11692
5. Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, C. L., & Zweig, G. (2015). From captions to visual concepts and back. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652–663. https://doi.org/10.1109/TPAMI.2016.2646132
6. Gu, J., Cai, J., Wang, G., & Chen, T. (2018). Stack-captioning: Coarse-to-fine learning for image captioning. IEEE Transactions on Image Processing, 28(2), 602–612. https://doi.org/10.1109/TIP.2018.2868617
7. Huang, L., Wang, W., Chen, J., & Wei, X. (2019). Attention on attention for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6606–6618. https://doi.org/10.1109/TPAMI.2020.3037894
8. Jain, A., & Schwing, A. G. (2017). Diverse and controllable image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2117–2130. https://doi.org/10.1109/TPAMI.2018.2868351
9. Jiang, W., Ma, L., Jiang, Y.-G., Liu, W., & Zhang, T. (2020). Recurrent fusion network for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(7), 1671–1683. https://doi.org/10.1109/TPAMI.2019.2894203
10. Li, X., Jiang, Y., Jin, J., & Wu, X. (2019). Dual-level collaborative transformer for image captioning. Pattern Recognition, 90(1), 292–302. https://doi.org/10.1016/j.patcog.2019.01.034
11. Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2018). Improved image captioning via policy gradient optimization of SPIDEr. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1222–1235. https://doi.org/10.1109/TPAMI.2019.2896534
12. Lu, J., Yang, J., Batra, D., & Parikh, D. (2017). Hierarchical question-image co-attention for visual question answering. International Journal of Computer Vision, 126(1), 1–21. https://doi.org/10.1007/s11263-017-1037-6
13. Pan, Y., Yao, T., Li, Y., & Mei, T. (2020). X-linear attention networks for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6292–6308. https://doi.org/10.1109/TPAMI.2020.3017817
14. Park, C. C., Kim, B., & Kim, G. (2019). Expressive image captioning with grounded concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(9), 3114–3127. https://doi.org/10.1109/TPAMI.2020.2974837
15. Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2215–2228. https://doi.org/10.1109/TPAMI.2017.2727069
16. Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304. https://doi.org/10.1109/TPAMI.2016.2646371
17. Wang, Y., Lin, Z., Shen, X., Cohen, S., & Cottrell, G. W. (2017). Skeleton key: Image captioning by skeleton-attribute decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 307–320. https://doi.org/10.1109/TPAMI.2018.2792845
18. Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2017). What value do explicit high level concepts have in vision to language problems? IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(10), 2049–2061. https://doi.org/10.1109/TPAMI.2016.2621268
19. Xu, J., Mei, T., Yao, T., & Rui, Y. (2018). MSR-VTT: A large video description dataset for bridging video and language. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6), 1309–1321. https://doi.org/10.1109/TPAMI.2018.2828708
20. Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2019). Review networks for caption generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(7), 1644–1657. https://doi.org/10.1109/TPAMI.2019.2905817
image
Copyright © IJIRCCE 2020.All right reserved