Evaluating the computational reliability of ChatGPT in calculating intercoder reliability in content analysis: Evidence from simulated data

Authors

DOI:

https://doi.org/10.3145/infonomy.26.006

Keywords:

Content analysis, Intercoder reliability, Cohen’s Kappa, Artificial intelligence, ChatGPT, Percentage agreement

Abstract

The increasing integration of large language models (LLMs) into the research workflow has raised important questions regarding their reliability in performing statistical analyses. While prior studies have explored the use of LLMs in text classification and qualitative coding, little is known about their accuracy in computing core statistical metrics used in content analysis. This study addresses this gap by systematically evaluating the performance of ChatGPT in calculating percentage agreement, contingency tables, and Cohen’s Kappa. Using a series of controlled simulations, we varied key parameters including sample size, number of categories, distribution balance, and levels of coding error. The outputs generated by ChatGPT 5.3 Instant were benchmarked against results obtained by the author using standard statistical procedures (ground truth). Findings indicate that ChatGPT achieves high accuracy only under simple conditions, particularly with small samples, binary variables, and balanced distributions. However, its performance declines as analytical complexity increases. In moderately complex scenarios, the model shows partial accuracy, often reproducing contingency tables correctly but introducing deviations in derived statistics. In more complex settings, especially with unbalanced distributions or multiple categories, ChatGPT produces systematically biased results, typically overestimating agreement. Additionally, in large-scale datasets, the model fails to generate outputs due to operational limitations. Overall, the results reveal a lack of consistent reliability across realistic analytical scenarios. As a bottom line, the use of ChatGPT for computing these statistical metrics is not recommended, except in very simple cases involving small samples, and only under strict supervision and validation using established statistical software.

Author Biography

Manuel Goyanes, Universidad Carlos III de Madrid

References

Amin, M. M.; Mao, R.; Cambria, E.; Schuller, B. W. (2024). A wide evaluation of ChatGPT on affective computing tasks. IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2024.3419593

Andersen, J. P.; Degn, L.; Fishberg, R.; Graversen, E. K.; Horbach, S. P.; Schmidt, E. K.; Schneider, J. W.; Sørensen, M. P. (2025). Generative artificial intelligence (GenAI) in the research process: A survey of researchers’ practices and perceptions. Technology in Society, 81, 102813. https://doi.org/10.1016/j.techsoc.2025.102813

Belal, M.; She, J.; Wong, S. (2023). Leveraging ChatGPT as a text annotation tool for sentiment analysis. arXiv. https://arxiv.org/abs/2306.17177

Chubb, L. A. (2023). Me and the machines: Possibilities and pitfalls of using artificial intelligence for qualitative data analysis. International Journal of Qualitative Methods, 22, 16094069231193593. https://doi.org/10.1177/16094069231193593

Cook, D. A.; Ginsburg, S.; Sawatsky, A. P.; Kuper, A.; D’Angelo, J. D. (2025). Artificial intelligence to support qualitative data analysis: Promises, approaches, pitfalls. Academic Medicine, 100(10), 1134–1149. https://doi.org/10.1097/ACM.0000000000006134

Dobler, D.; Binder, H.; Boulesteix, A. L.; Igelmann, J. B.; Köhler, D.; Mansmann, U.; Pauly, M.; Scherag, A.; Schmid, M.; Tawil, A. A.; Weber, S. (2025). ChatGPT as a tool for biostatisticians: A tutorial on applications, opportunities, and limitations. Statistics in Medicine, 44(23–24), e70263. https://doi.org/10.1002/sim.70263

Fatouros, G.; Soldatos, J.; Kouroumali, K.; Makridis, G.; Kyriazis, D. (2023). Transforming sentiment analysis in the financial domain with ChatGPT. Machine Learning with Applications, 14, 100508. https://doi.org/10.1016/j.mlwa.2023.100508

Fu, Z.; Hsu, Y. C.; Chan, C. S.; Lau, C. M.; Liu, J.; Yip, P. S. F. (2024). Efficacy of ChatGPT in Cantonese sentiment analysis: Comparative study. Journal of Medical Internet Research, 26, e51069. https://doi.org/10.2196/51069

Gilardi, F.; Alizadeh, M.; Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.2305016120

Goyanes, M.; De-Marcos, L. (2025). Protocolo metodológico para el desarrollo de análisis de contenido asistido por inteligencia artificial fiable y válido: Guía práctica con ChatGPT. Anuario ThinkEPI, 19. https://doi.org/10.3145/thinkepi.2025.e19a07

Goyanes, M.; Lopezosa, C.; Jordá, B. (2025). Thematic analysis of interview data with ChatGPT: Designing and testing a reliable research protocol for qualitative research. Quality & Quantity. https://doi.org/10.1007/s11135-025-02199-3

Goyanes, M.; Piñeiro Naval, V. (2024). Análisis de contenido en SPSS y KALPHA: Procedimiento para un análisis cuantitativo fiable con la Kappa de Cohen y el alpha de Krippendorff. Estudios sobre el Mensaje Periodístico, 30(1), 123–140. https://doi.org/10.5209/esmp.92732

Grossmann, I.; Feinberg, M.; Parker, D. C.; Christakis, N. A.; Tetlock, P. E.; Cunningham, W. A. (2023). AI and the transformation of social science research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778

Hayes, A. F.; Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89. https://doi.org/10.1080/19312450709336664

Jim, J. R.; Talukder, M. A. R.; Malakar, P.; Kabir, M. M.; Nur, K.; Mridha, M. F. (2024). Recent advancements and challenges of NLP-based sentiment analysis: A state-of-the-art review. Natural Language Processing Journal, 100059. https://doi.org/10.1016/j.nlp.2024.100059

Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30(3), 411–433. https://doi.org/10.1111/j.1468-2958.2004.tb00738.x

Krippendorff, K. (2018). Content analysis: An introduction to its methodology (4th ed.). Sage. https://doi.org/10.4135/9781071878781

Lee, L. W.; Dabirian, A.; McCarthy, I. P.; Kietzmann, J. (2020). Making sense of text: Artificial intelligence-enabled content analysis. European Journal of Marketing, 54(3), 615–644. https://doi.org/10.1108/EJM-02-2019-0219

Lombard, M.; Snyder-Duch, J.; Bracken, C. C. (2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28(4), 587–604. https://doi.org/10.1111/j.1468-2958.2002.tb00826.x

Lossio-Ventura, J. A.; Weger, R.; Lee, A. Y.; Guinee, E. P.; Chung, J.; Atlas, L.; Linos, E.; Pereira, F. (2024). A comparison of ChatGPT and fine-tuned open pre-trained transformers (OPT) against widely used sentiment analysis tools: Sentiment analysis of COVID-19 survey data. JMIR Mental Health, 11, e50150. https://doi.org/10.2196/50150

Lu, Y.; Yang, R.; Zhang, Y.; Yu, S.; Dai, R.; Wang, Z.; … Zhou, F. (2025). Stateval: A comprehensive benchmark for large language models in statistics. arXiv. https://arxiv.org/abs/2510.09517

Matos, T.; Santos, W.; Zdravevski, E.; Coelho, P. J.; Pires, I. M.; Madeira, F. (2025). A systematic review of artificial intelligence applications in education: Emerging trends and challenges. Decision Analytics Journal, 15, 100571. https://doi.org/10.1016/j.dajour.2025.100571

Mondal, H.; Mondal, S.; Mittal, P. (2024). Evaluating large language models for selection of statistical test for research: A pilot study. Perspectives in Clinical Research, 15(4), 178–182. https://doi.org/10.4103/picr.picr_275_23

Morgan, D. L. (2023). Exploring the use of artificial intelligence for qualitative data analysis: The case of ChatGPT. International Journal of Qualitative Methods, 22, 16094069231211248. https://doi.org/10.1177/16094069231211248

Neuendorf, K. A. (2017). The content analysis guidebook. SAGE. ISBN: 978 1 412979474

Nguyen, D. C.; Welch, C. (2026). Generative artificial intelligence in qualitative data analysis: Analyzing—or just chatting? Organizational Research Methods, 29(1), 3–39. https://doi.org/10.1177/10944281251377154

Robila, M.; Robila, S. A. (2020). Applications of artificial intelligence methodologies to behavioral and social sciences. Journal of Child and Family Studies, 29(10), 2954–2966. https://doi.org/10.1007/s10826-019-01689-x

Shukla, M.; Pandey, D.; Kaur, S.; Agarwal, M.; Goyal, A.; Sharma, H.; Sharma, H., Jr. (2025). Evaluating the accuracy and explanatory quality of large language models ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat in statistical test selection for hypothesis testing decisions. Cureus, 17(10), eXXXXX. https://doi.org/10.7759/cureus.94949

Törnberg, P. (2025). Large language models outperform expert coders and supervised classifiers at annotating political social media messages. Social Science Computer Review, 43(6), 1181–1195. https://doi.org/10.1177/08944393241286471

Van Noorden, R.; Perkel, J. M. (2023). AI and science: What 1,600 researchers think. Nature, 621(7980), 672–675. https://doi.org/10.1038/d41586-023-02980-0

Xu, Y.; Liu, X.; Cao, X.; Huang, C.; Liu, E.; Qian, S.; … Zhang, J. (2021). Artificial intelligence: A powerful paradigm for scientific research. The Innovation, 2(4). https://doi.org/10.1016/j.xinn.2021.100179

Published

2026-04-01

How to Cite

Goyanes, M. (2026). Evaluating the computational reliability of ChatGPT in calculating intercoder reliability in content analysis: Evidence from simulated data. Infonomy, 4(2). https://doi.org/10.3145/infonomy.26.006

Downloads

Download data is not yet available.

Dimensions