International Journal of Innovative Research in Computer and Communication Engineering
ISSN Approved Journal | Impact factor: 8.771 | ESTD: 2013 | Follows UGC CARE Journal Norms and Guidelines
| Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal | High Impact Factor 8.771 (Calculated by Google Scholar and Semantic Scholar | AI-Powered Research Tool | Indexing in all Major Database & Metadata, Citation Generator | Digital Object Identifier (DOI) |
| TITLE | AI-Powered Data Quality Assessment: Detecting Semantic Anomalies and Business Rule Violations that Statistical Methods Cannot Identify |
|---|---|
| ABSTRACT | Modern enterprise data ecosystems generate billions of records daily across healthcare, financial services, e-commerce, and manufacturing - all subject to complex quality requirements that extend far beyond what statistical anomaly detection can assess. Statistical approaches excel at identifying numerical outliers, missing values, and format violations, but are fundamentally incapable of understanding that a "patient deceased in 2018 who was admitted for surgery in 2025" is anomalous, or that an "invoice with 150% discount applied to a zero-cost item" violates core business semantics. This paper presents a comprehensive AI-Powered Data Quality (AI-DQ) framework that deploys large language models (LLMs), fine-tuned transformers, graph neural networks, and retrieval-augmented generation (RAG) pipelines to identify semantic anomalies, business rule violations, cross-entity inconsistencies, temporal logic errors, and linguistic data defects that statistical methods miss entirely. Evaluated across 4.2 million records spanning five industry domains, the proposed framework achieves 92.1% overall anomaly detection (vs. 54.8% for statistical baselines), reduces false positives by 82%, and generates natural language explanations for every flagged record. A healthcare implementation case study demonstrates $2.4 million annual cost reduction through detection accuracy improvements and analyst hour savings. Our results confirm that semantic intelligence - not statistical power - is the critical gap in current enterprise data quality infrastructure. |
| AUTHOR | VENKATA VIJAY SATYANARAYANA MURTHY NEELAM Lead Software Engineer, Atlanta, Georgia, USA |
| VOLUME | 182 |
| DOI | DOI: 10.15680/IJIRCCE.2026.1403063 |
| pdf/63_AI-Powered Data Quality Assessment.pdf | |
| KEYWORDS | |
| References | [01] Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13. [02] Redman, T. C. (1996). Data Quality for the Information Age. Artech House Publishers. ISBN: 978-0890068915. [03] IBM Global Business Services. (2016). The Economic Impact of Bad Data. IBM Institute for Business Value Research Report. [04] Shaneck, A., & Seshan, S. (2021). Great Expectations: Data validation for Python. Proceedings of the 2021 SIGMOD Workshop on Data Management for End-to-End Machine Learning. [05] dbt Labs. (2022). dbt Documentation: Testing your models. https://docs.getdbt.com/docs/building-a-dbt-project/tests [06] Apache Software Foundation. (2021). Apache Griffin: A distributed data quality solution. https://griffin.apache.org/ [07] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation forest. Proceedings of the 8th IEEE ICDM, pp. 413–422. https://doi.org/10.1109/ICDM.2008.17 [08] Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of KDD 1996, pp. 226–231. [09] Sakurada, M., & Yairi, T. (2014). Anomaly detection using autoencoders with nonlinear dimensionality reduction. Proceedings of the MLSDA 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4–11. [10] Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., ... & Kloft, M. (2018). Deep one-class classification. Proceedings of the 35th ICML, pp. 4393–4402. [11] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. NeurIPS 2020, pp. 1877–1901. [12] OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774. [13] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. [14] Narayan, A., Chami, I., Orr, L., Ré, C., & Hellerstein, J. M. (2022). Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911. [15] Chiang, P., & Lee, M. (2023). Transformer-based semantic anomaly detection in structured enterprise data. Proceedings of the 2023 ACM SIGKDD, pp. 4412–4423. [16] Wang, Q., Mao, Z., Wang, B., & Guo, L. (2017). Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12), 2724–2743. |