International Journal of Innovative Research in Computer and Communication Engineering

ISSN Approved Journal | Impact factor: 8.771 | ESTD: 2013 | Follows UGC CARE Journal Norms and Guidelines

| Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal | High Impact Factor 8.771 (Calculated by Google Scholar and Semantic Scholar | AI-Powered Research Tool | Indexing in all Major Database & Metadata, Citation Generator | Digital Object Identifier (DOI) |


TITLE Implementation on Multimodal Human-Computer Interaction Systems using Voice, Vision and Gesture Recognition
ABSTRACT This paper presents the design and implementation of a Multimodal Virtual Assistant System that integrates voice recognition, computer vision, and gesture recognition to enable natural human-computer interaction. Unlike traditional systems that rely on a single input method, this system combines multiple modalities to improve accuracy, usability, and accessibility. The system is implemented using Python with technologies such as Speech Recognition, YOLO-based object detection, MediaPipe for gesture tracking, and Tkinter for graphical user interface. The assistant can perform tasks such as voice command execution, real-time object detection, gesture-based control, and application automation. The implementation focuses on real-time performance, modular architecture, and efficient multimodal fusion. This project demonstrates how combining multiple AI techniques can create an intelligent and interactive assistant system.
AUTHOR ANKITA YADAV, ANUSHRI RAUT, HRIDAY PANCHMUKH, RAJBEER SACHAR, AMRITA SHIRODE Department of Artificial Intelligence & Machine Learning Department, AISSMS's Polytechnic, Pune, Maharashtra, India
VOLUME 182
DOI DOI: 10.15680/IJIRCCE. 2026.1403066
PDF pdf/66_Implementation on Multimodal Human-Computer Interaction Systems using Voice, Vision and Gesture Recognition.pdf
KEYWORDS
References 1. N. Mohamed, M. B. Mustafa, and N. Jomhari, “A Review of the Hand Gesture Recognition System: Current Progress and Future Directions,” IEEE Access, vol. 9, pp. 152785–152806, 2021.
2. H. M. Yishak and L. Li, “Advanced Face Detection with YOLOv8: Implementation and Integration into AI Modules,” Open Access Library Journal, vol. 11, pp. e112474, 2024. Available: https://doi.org/10.4236/oalib.
3. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.
4. D. Bhonde, K. Mongse, L. Naikwar, N. Dwivedi, and O. Mahulkar, “Gesture and Voice-Based Personal Computer Control System,” International Journal on Advanced Electrical and Computer Engineering, vol. 14, no. 1, 2025.
5. [5] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019.
6. S. Oviatt, “Multimodal Interfaces,” The Human–Computer Interaction Handbook, 2nd ed., CRC Press, pp. 413–432, 2012.
image
Copyright © IJIRCCE 2020.All right reserved