VoiceCraft AI: A Bilingual Speech-to-Text and Text-to-Speech Engine for English & Kannada

International Journal of Innovative Research in Computer and Communication Engineering

ISSN Approved Journal | Impact factor: 8.771 | ESTD: 2013 | Follows UGC CARE Journal Norms and Guidelines

| Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal | High Impact Factor 8.771 (Calculated by Google Scholar and Semantic Scholar | AI-Powered Research Tool | Indexing in all Major Database & Metadata, Citation Generator | Digital Object Identifier (DOI) |

TITLE	VoiceCraft AI: A Bilingual Speech-to-Text and Text-to-Speech Engine for English & Kannada
ABSTRACT	Developing highly accurate, bilingual speech processing systems for morphologically complex Indian languages alongside English remains a critical challenge in modern human-computer interaction. This paper introduces VoiceCraft AI, a cutting-edge bilingual Speech-to-Text (STT) and Text-to-Speech (TTS) system that integrates custom deep learning architectures with real-time dynamic language routing. Unlike conventional speech applications that rely on third-party cloud APIs and struggle with regional nuances, VoiceCraft AI employs a fully local, highly optimized neural architecture incorporating a custom Conformer-CTC model for robust STT and a stochastic VITS2 latent generator with HiFi-GAN vocoder for high-fidelity TTS. Seamless context switching and low-latency inference are ensured through a FastAPI asynchronous backend utilizing native PyTorch CUDA bindings on an NVIDIA DGX hardware cluster. Advanced memory management mechanisms — including dynamic VRAM model purging, automated text chunking, and continuous audio peak normalization — significantly enhance system resilience against CUDA Out-Of-Memory (OOM) crashes and waveform distortion. Experimental evaluation demonstrates a Word Error Rate (WER) of 5.2% for English and 10.2% for Kannada STT, alongside a TTS Mean Opinion Score (MOS) of 4.3, establishing VoiceCraft AI as a next-generation bilingual voice processing platform on localized institutional hardware.
AUTHOR	SUDEEP SAGAR, DR. LATHA B.M, MANJULA P, SPANDANA M.K, VASHISTHA C.V, RITHIN P. VALI UG Students, Department of Computer Science and Engineering, Jain Institute of Technology, Davangere, Karnataka, India Head of Department, Department of Computer Science and Engineering, Jain Institute of Technology, Davangere, Karnataka, India Assistant Professor, Department of Computer Science and Engineering, Jain Institute of Technology, Davangere, Karnataka, India
VOLUME	184
DOI	DOI: 10.15680/IJIRCCE.2026.1405074
PDF	pdf/74_VoiceCraft AI A Bilingual Speech-to-Text and Text-to-Speech Engine for English & Kannada.pdf
KEYWORDS
References	[1] A. Gulati et al., "Conformer: Convolution-augmented Transformer for Speech Recognition," Interspeech 2020, pp. 5036–5040. [2] J. Kim, J. Kong, and J. Son, "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech," ICML, PMLR, 2021. [3] J. Kong, J. Kim, and J. Bae, "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis," NeurIPS 33, 2020, pp. 17022–17033. [4] S. Watanabe et al., "ESPnet: End-to-End Speech Processing Toolkit," Interspeech 2018, pp. 2207–2211. [5] A. Sharma et al., "Conformer-Based Speech Recognition for Indic Languages," IEEE ICASSP, 2024. [6] J. Lee et al., "VITS2: Advancing End-to-End Text-to-Speech Synthesis," Journal of Neural Speech Processing, 2025. [7] B. R. Kumar and T. Desai, "Bilingual STT for English and Regional Code-Switching," IEEE Trans. Audio, Speech, Language Processing, 2025. [8] S. Rao and M. Patil, "Neural G2P Mapping for Dravidian Languages," AIP Conference Proceedings, 2024. [9] V. Desai et al., "CTC for Unsegmented Audio Alignment," IEEE Signal Processing Letters, 2025. [10] T. Chen et al., "Adversarial Training in Neural Vocoders," IEEE QPAIN Conference, 2024. [11] N. Gowda et al., "Deep Learning in Kannada Speech Recognition: A Comprehensive Review," FUDMA Journal of Science, 2025. [12] P. Joshi and K. Iyer, "Real-Time Inference Optimization for VITS-Based TTS," Bridge Journal, 2025. [13] M. Singh et al., "Hybrid Acoustic Architectures: Bridging CNNs and Transformers," IEEE SCEECS, 2025. [14] Solomon Omoze et al., "Multi-Loss Optimization for Speech Synthesis," AJERD Journal, 2025. [15] A. Baby et al., "Towards Offline and Privacy-Preserving Speech Assistants for Low-Resource Languages," Journal of Speech Technology, vol. 26, no. 4, pp. 112–125, 2025.

About Us

The primary objective of IJIRCCE is to serve as an international scholarly platform that enables researchers, innovators, students, and research scholars to disseminate their research findings and technological advancements to a global academic audience.

About Us

GET IN TOUCH

Useful Links

ARTICLES

About Us

GET IN TOUCH

Useful Links