Algorithms for Scientific Documents: Past and Present

Authors

DOI:

https://doi.org/10.3145/infonomy.25.026

Keywords:

Classification algorithms, Document-level classifications, Classifications, Science classification, Scientific databases, Scientometrics, Citation, Classification schemes, ASJC, Scopus, Web of Science

Abstract

This study offers a comprehensive overview of document-level classification algorithms in scientific research, proposed as an alternative to the journal-based categorizations employed by major bibliographic databases such as Web of Science and Scopus. These journal-driven schemes often introduce significant inaccuracies in both information retrieval and research evaluation, as they fail to categorize articles in accordance with their actual content. First, we provide a historical review of the main approaches developed since the emergence of scientific databases, highlighting their contributions as well as their limitations. Automatic clustering techniques and community detection algorithms have represented important advances in the organization of scientific knowledge, yet they cannot serve as a practical substitute for journal-based classifications. Other approaches, such as those relying on neural networks or text mining, face scalability issues that prevent their application at the global level of science. The most recent and promising strategies are built upon simple algorithms that, starting from existing journal categorizations, reclassify articles into the same thematic hierarchies used by bibliographic databases, relying primarily on the analysis of straightforward citation and reference patterns.

Author Biographies

Jesús M. Álvarez-Llorente, Universidad de Extremadura

Vicente P. Guerrero-Bote, Universidad de Extremadura

Félix De-Moya-Anegón, SCImago Research Group

References

Althouse, B. M.; West, J. D.; Bergstrom, C.T.; Bergstrom, T. (2009). Differences in impact factor across fields and over time. Journal of the Association for Information Science and Technology, 60(1), 27–34. https://doi.org/10.1002/asi.20936

Álvarez-Llorente, J. M. (2025). Nuevos algoritmos de clasificación de documentos científicos individuales basados en referencias para mejorar los análisis cienciométricos en las grandes bases de datos de ciencia [Doctoral thesis, University of Extremadura]. Institutional Repository of the University of Extremadura.

Álvarez-Llorente, J. M.; Guerrero-Bote, V. P.; De-Moya-Anegón, F. (2024). New fractional classifications of papers based on two generations of references and on the ASJC Scopus scheme. Scientometrics, 129(6), 3493–3515. https://doi.org/10.1007/s11192-024-05030-2

Álvarez-Llorente, J. M.; Guerrero-Bote, V. P.; De-Moya-Anegón, F. (2025). New paper-by-paper classification for Scopus based on references reclassified by the origin of the papers citing them. Journal of Informetrics, 19(2), 101647. https://doi.org/10.1016/j.joi.2025.101647

Álvarez-Llorente, J. M.; Guerrero‐Bote, V. P.; De-Moya-Anegón, F. (2023). Creating a collection of publications categorized by their research guarantors into the Scopus ASJC scheme. Profesional de la Información, 32(7). https://doi.org/10.3145/epi.2023.dic.04

Andersen, J. P. (2023). Field-level differences in paper and author characteristics across all fields of science in Web of Science, 2000-2020. Quantitative Science Studies, 4(2), 394–422. https://doi.org/10.1162/qss_a_00246

Blondel, V. D.; Guillaume, J. L.; Lambiotte, R.; Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.

Bornmann, L.; Leydesdorff, L. (2017). Skewness of citation impact data and covariates of citation distributions: A large-scale empirical analysis based on Web of Science data. Journal of Informetrics, 11(1), 164-175. http://dx.doi.org/10.1016/j.joi.2016.12.001

Bornmann, L.; Tekles, A.; Leydesdorff, L. (2019). How well does I3 perform for impact measurement compared to other bibliometric indicators? The convergent validity of several (field-normalized) indicators. Scientometrics, 119(2), 1187-1205. http://dx.doi.org/10.1007/s11192-019-03071-6

Boyack, K. W.; Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the Association for Information Science and Technology, 61(12), 2389–2404. https://doi.org/10.1002/asi.21419

Boyack, K. W.; Klavans, R. (2020). A comparison of large-scale science models based on textual, direct citation and hybrid relatedness. Quantitative Science Studies (1)4, 1570–1585. https://doi.org/10.1162/qss_a_00085

Boyack, K. W.; Newman, D.; Duhon, R. J.; Klavans, R.; Patek, M.; Biberstine, J. R.; Schijvenaars, B.; Skupin, A.; Ma, N.; Börner, K. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLOS ONE, 6(3), Article e18029. https://doi.org/10.1371/journal.pone.0018029

Boyack, K. W.; Small, H.; Klavans, R. (2013). Improving the Accuracy of Co-citation Clustering Using Full Text. J Am Soc Inf Sci Tec, 64: 1759–1767. https://doi.org/10.1002/asi.22896

Chumachenko, A.; Kreminskyi, B.; Mosenkis, I.; Yakimenko, A. (2022). Dynamical entropic analysis of scientific concepts. Journal of Information Science, 48(4), 561–569. https://doi.org/10.1177/0165551520972034

Clauset, A.; Newman, M.; Moore, C. (2004). Finding community structure in very large networks. Physical Review E, 70(6). https://doi.org/10.1103/physreve.70.066111

De-Moya-Anegón, F.; Herrero-Solana, V.; Jiménez-Contreras, E. (2006). A connectionist and multivariate approach to science maps: the SOM, clustering and MDS applied to library and information science research. Journal of Information Science, 32(1), 63–77. https://doi.org/10.1177/0165551506059226

Ding, J.; Ahlgren, P.; Yang, L.; Yue, T. (2018). Disciplinary structures in Nature, Science and PNAS: Journal and country levels. Scientometrics, 116(3), 1817–1852. https://link.springer.com/article/10.1007/s11192-018-2812-9

Eykens, J.; Guns, R.; Engels, T. C. E. (2019). Article level classification of publications in sociology: An experimental assessment of supervised machine learning approaches. In: Proceedings of the 17th International Conference on Scientometrics & Informetrics, Rome (Italy), 2–5 September, 738–743. https://hdl.handle.net/10067/1630240151162165141

Fang, H. (2015). Classifying Research Articles in Multidisciplinary Sciences Journals into Subject Categories. Knowledge Organization, 42(3), 139–153.

https://doi.org/10.5771/0943-7444-2015-3-139

Glänzel, W.; Schubert, A.; Czerwon, H. (1999a). An item-by-item subject classification of papers published in multidisciplinary and general journals using reference analysis. Scientometrics, 44(3), 427–439. https://doi.org/10.1007/bf02458488

Glänzel, W.; Schubert, A.; Schoepflin, U.; Czerwon, H. (1999b). An item-by-item subject classification of papers published in journals covered by the SSCI database using reference analysis. Scientometrics, 46(3), 431–441. https://doi.org/10.1007/BF02459602

Glänzel, W.; Thijs, B.; Chi, PS. (2016). The challenges to expand bibliometric studies from periodical literature to monographic literature with a new data source: the book citation index. Scientometrics, 109, 2165–2179. https://doi.org/10.1007/s11192-016-2046-7

Glänzel, W.; Thijs, B.; Huang, Y. (2021). Improving the precision of subject assignment for disparity measurement in studies of interdisciplinary research. In: W. Glänzel, S. Heeffer, PS. Chi, R. Rousseau, Proceedings of the 18th International Conference of the International Society of Scientometrics and Informetrics (ISSI 2021), Leuven University Press, 453–464. https://kuleuven.limo.libis.be/discovery/fulldisplay?docid=lirias3394551&context=SearchWebhook&vid=32KUL_KUL:Lirias&search_scope=lirias_profile&tab=LIRIAS&adaptor=SearchWebhook&lang=en

Gläser, J.; Glänzel, W.; Scharnhorst, A. (2017). Same data—Different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics, 111(2), 981–998. https://doi.org/10.1007/s11192-017-2296-z

Glenisson, P.; Glänzel, W.; Janssens, F.; De-Moor, B. (2005). Combining full text and bibliometric information in mapping scientific disciplines. Information Processing & Management, 41(6), 1548–1572. https://doi.org/10.1016/j.ipm.2005.03.021

Gómez-Crisóstomo, M. R. (2011). Study and comparison of the Web of Science and Scopus (1996-2007) [Doctoral thesis, University of Extremadura]. Institutional Repository of the University of Extremadura.

Guerrero-Bote, V. P.; De-Moya-Anegón, F. (2012). A further step forward in measuring journals’ scientific prestige: The SJR2 indicator. Journal of informetrics, 6(4), 674-688. https://doi.org/10.1016/j.joi.2012.07.001

Guerrero-Bote, V. P.; Zapico-Alonso, F.; Espinosa-Calvo, M. E.; Gómez-Crisóstomo, R.; De-Moya-Anegón, F. (2007). Import-export of knowledge between scientific subject categories: The iceberg hypothesis. Scientometrics, 71(3), 423–441. https://doi.org/10.1007/s11192-007-1682-3

Guerrero-Bote, V.P.; De-Moya-Anegón, F.; Herrero-Solana, V. (2002). Document organization using Kohonen’s algorithm. Information Processing and Management, 38(1), pp. 79-89. https://doi.org/10.1016/S0306-4573(00)00066-2

Hassan-Montero, Y.; De-Moya-Anegón, F.; Guerrero-Bote, V. P. (2022). SCImago Graphica: a new tool for exploring and visually communicating data. Profesional de la información, 31(5), Article e310502. https://doi.org/10.3145/epi.2022.sep.02

Hassan-Montero, Y.; Guerrero-Bote, V. P.; De-Moya-Anegón, F. (2014). Graphical interface of the SCImago Journal and Country Rank: an interactive approach to accessing bibliometric information. El profesional de la información, 23(3). http://dx.doi.org/10.3145/epi.2014.may.07

Huang, Y.; Glänzel, W.; Thijs, B.; Porter, A. L.; Zhang, L. (2021). The comparison of various similarity measurement approaches on interdisciplinary indicators (pp. 1–24). FEB - KU Leuven

Janssens, F.; Leta, J.; Glänzel, W.; De-Moor, B. (2006). Towards mapping library and information science. Information Processing & Management, 42(6), 1614–1642. https://doi.org/10.1016/j.ipm.2006.03.025

Janssens, F.; Glänzel, W.; De-Moor, B. (2008). A hybrid mapping of information science. Scientometrics, 75(3), 607–631. https://doi.org/10.1007/s11192-007-2002-7

Janssens, F.; Zhang, L.; De-Moor, B.; Glänzel, W. (2009). Hybrid clustering for validation and improvement of subject-classification schemes. Information Processing & Management, 45(6), 683–702. https://doi.org/10.1016/j.ipm.2009.06.003

Javitz, H.; Grimes, T.; Hill, D.; Rapoport, A.; Bell, R.; Fecso, R.; Lehming, R. (2010). U.S. Academic Scientific Publishing. Working paper SRS 11-201. Arlington, VA: National Science Foundation, Division of Science Resources Statistics.

Kandimalla, B.; Rohatgi, S.; Wu, J.; Giles, C. L. (2021). Large scale subject category classification of scholarly papers with deep attentive neural networks. Frontiers in Research Metrics and Analytics, 5, Article 600382. https://doi.org/10.3389/frma.2020.600382

Klavans, R.; Boyack, K. W. (2005). Identifying a better measure of relatedness for mapping science. Journal of the Association for Information Science and Technology, 57(2), 251-263. https://doi.org/10.1002/asi.20274

Klavans, R.; Boyack, K. W. (2006). Quantitative evaluation of large maps of science. Scientometrics, 68, 475–499. https://doi.org/10.1007/s11192-006-0125-x

Klavans, R.; Boyack, K. W. (2016). Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? Journal of the Association for Information Science and Technology, 68(4), 984–998. https://doi.org/10.1002/asi.23734

Lai, K.; Wu, S. (2005). Using the patent co-citation approach to establish a new patent classification system. Information Processing & Management, 41(2), 313–330. https://doi.org/10.1016/j.ipm.2003.11.004

Lancho-Barrantes, B. S.; Guerrero-Bote, V. P.; De-Moya Anegón, F. (2010b). What lies behind the averages and significance of citation indicators in different disciplines? Journal of Information Science, 36(3), 371-382. https://doi.org/10.1177/0165551510366077

Lancho-Barrantes, B. S.; Guerrero-Bote, V. P.; De-Moya-Anegón, F. (2010a). The iceberg hypothesis revisited. Scientometrics, 85(2), 443–461. http://dx.doi.org/10.1007/s11192-010-0209-5

Leydesdorff, L.; De-Moya‐Anegón, F.; Guerrero‐Bote, V. P. (2010). Journal maps on the basis of Scopus data: A comparison with the Journal Citation Reports of the ISI. Journal of the American Society for Information Science and Technology, 61(2), 352-369. http://dx.doi.org/10.1002/asi.21250

Leydesdorff, L.; De-Moya‐Anegón, F.; Guerrero‐Bote, V. P. (2015). Journal maps, interactive overlays, and the measurement of interdisciplinarity on the basis of scopus data (1996–2012). Journal of the Association for Information Science and Technology, 66(5), 1001-1016. http://dx.doi.org/10.1002/asi.23243

Li, K.; Chen, P.-Y.; Fang, Z. (2019). Disciplinarity of software papers: A preliminary analysis. Proceedings of the Association for Information Science and Technology (56), 706–708. https://doi.org/10.1002/pra2.143

Marshakova-Shaikevich, I. (2005). Bibliometric maps of field of science. Information Processing & Management, 41(6), 1534–1547. https://doi.org/10.1016/j.ipm.2005.03.027

McGillivray, B.; Astell, M. (2019). The relationship between usage and citations in an open access mega-journal. Scientometrics, 121, 817–838. https://doi.org/10.1007/s11192-019-03228-3

Milojević, S. (2020). Practical method to reclassify Web of Science articles into unique subject categories and broad disciplines. Quantitative science studies, 1(1), 183-206. https://doi.org/10.1162/qss_a_00014

Opthof, T.; Leydesdorff, L. (2010). Caveats for the journal and field normalizations in the CWTS (“Leiden”) evaluations of research performance. Journal of informetrics, 4(3), 423-430. https://doi.org/10.1016/j.joi.2010.02.003

Peña-Rocha, M.; Gómez-Crisóstomo, R.; Guerrero-Bote, V. P.; De-Moya-Anegón, F. (2025). Bibliometrics effects of a new paper level classification. Frontiers in Research Metrics and Analytics, 10. https://doi.org/10.3389/frma.2025.1531758

Rees-Potter, L. K. (1989). Dynamic thesaural systems: A bibliometric study of terminological and conceptual change in sociology and economics with application to the design of dynamic thesaural systems. Information Processing & Management, 25(6), 677–689. https://doi.org/10.1016/0306-4573(89)90101-5

Sachini, E.; Sioumalas-Christodoulou, K.; Christopoulos, S.; Karampekios, N. (2022) AI for AI: Using AI methods for classifying AI science documents. Quantitative Science Studies, 3(4), 1119–1132. https://doi.org/10.1162/qss_a_00223

Schildt, H.; Mattsson, J. (2006). A dense network sub-grouping algorithm for co-citation analysis and its implementation in the software tool Sitkis. Scientometrics, 67, 143–163. https://doi.org/10.1007/s11192-006-0054-8

Shu, F.; Julien, C.; Zhang, L.; Qiu, J.; Zhang, J.; Larivière, V. (2019). Comparing journal and paper level classifications of science. Journal of Informetrics, 13(1), 202–225. https://doi.org/10.1016/j.joi.2018.12.005

Šubelj, L.; Van Eck, N. J.; Waltman, L. (2016). Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods. PLOS ONE, 11(4), e0154404. https://doi.org/10.1371/journal.pone.0154404

Thelwall, M.; Pinfield, S. (2024). The accuracy of field classifications for journals in Scopus. Scientometrics, 129(2), 1097–1117. https://doi.org/10.1007/s11192-023-04901-4

Thijs, B.; Huang, Y.; Glänzel, W. (2021). Comparing different implementations of similarity for disparity and variety measures in studies on interdisciplinarity. FEB Research Report MSI_2103, Report No. MSI_2103. https://lirias.kuleuven.be/retrieve/610314

Van Eck, N.J.; Waltman, L. (2010). Software survey: VOSviewer, acomputer program for bibliometric mapping. Scientometrics, 84(2), 523–538. https://doi.org/10.1007/s11192-009-0146-3

Waltman, L.; Van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the Association for Information Science and Technology, 63(12), 2378–2392. https://doi.org/10.1002/asi.22748

Waltman, L.; Boyack, K. W.; Colavizza, G.; Van Eck, N. J. (2020). A principled methodology for comparing relatedness measures for clustering publications. Quantitative Science Studies, 1(2), 691-713. https://doi.org/10.1162/qss_a_00035

Wang, Q.; Waltman, L. (2016). Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus. Journal of Informetrics, 10(2), 347-364. https://doi.org/10.1016/j.joi.2016.02.003

Zhang, J.; Shen Z. (2024). Analyzing journal category assignment using a paper-level classification system: multidisciplinary sciences journals. Scientometrics. https://doi.org/10.1007/s11192-023-04913-0

Zhang, L.; Janssens, F.; Liang, L.; Glänzel W. (2010). Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research. Scientometrics, 82, 687–706. https://doi.org/10.1007/s11192-010-0180-1

Zhang, L.; Rousseau, R.; Glänzel, W. (2016). Diversity of references as an indicator of the interdisciplinarity of journals: Taking similarity between subject fields into account. Journal of the Association for Information Science and Technology, 67(5), 1257-1265. https://doi.org/10.1002/asi.23487

Zhang, L.; Sun, B.; Shu, F.; Huang, Y. (2022). Comparing paper level classifications across different methods and systems: an investigation of Nature publications. Scientometrics, 127(12), 7633–7651. https://doi.org/10.1007/s11192-022-04352-3

Published

2025-09-14

How to Cite

Álvarez-Llorente, J. M., Guerrero-Bote, V. P., & De-Moya-Anegón, F. (2025). Algorithms for Scientific Documents: Past and Present. Infonomy, 3(4). https://doi.org/10.3145/infonomy.25.026

Downloads

Download data is not yet available.

Dimensions

Issue

Section

Research