Python's Beautiful Soup for web scraping as a method for automated data extraction from websites
DOI:
https://doi.org/10.3145/infonomy.25.014Keywords:
Web scraping, Python, Beautiful Soup, Pandas, DataFrames, Selenium, Data mining, Data extractionAbstract
Beautiful Soup is a Python library for extracting, analysing and editing data from HTML documents. After introducing various concepts and technologies related to data scraping, this guide shows step-by-step explanation on how to set up an environment compatible with this technology and includes multiple examples of automated data extraction from web pages. In addition to Beautiful Soup, other Python modules such as pandas (Panel Data) for handling data and processing CSV files, and requests for managing HTTP requests in are also integrated. Additionally, more advanced solutions are introduced to circumvent common protection mechanisms.References
Alcaraz-Martínez, Rubén (2023) “Black hat SEO y otras técnicas poco éticas: evolución y situación actual”. Infonomy, v. 1, n. 1. https://doi.org/10.3145/infonomy.23.008
Cass, Stephen (2024). The top programming languages 2024. IEEE Spectrum. https://spectrum.ieee.org/top-programming-languages-2024
Diouf, Rabiyatou; Sarr, Edouard; Sall, Ousmane; Birregah, Babiga; Bousso, Mamadou; Mbaye, Sény Ndiaye (2019). Web scraping: state-of-the-art and areas of application. In: IEEE International Conference on Big Data (Big Data), pp. 6040-6042. https://doi.org/10.1109/BigData47090.2019.9005594
GitHub Staff (2024). Octoverse: AI leads Python to top language as the number of global developers surges. Octoverse. https://github.blog/news-insights/octoverse/octoverse-2024
Grasso, Giovanni; Furche, Tim; Schallhart, Christian (2013). “Effective web scraping with OXPath”. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 23-26. https://doi.org/10.1145/2487788.2487796
Khder, Moaiad-Admad (2021). Web scraping or web crawling: state of art, techniques, approaches and application. International journal of advances in soft computing & its applications, v. 13, n. 3, pp. 144-168. http://dx.doi.org/10.15849/IJASCA.211128.11
Krotov, Vlad; Johnson, Leigh; Silva, Leiser (2020). Tutorial: legality and ethics of web scraping. Communications of the Association for Information Systems, n. 47. https://doi.org/10.17705/1CAIS.04724
Lawson, Richard (2015). Web scraping with Python: scrape data from any website with the power of Python. Packt Publishing.
Maheshwari, Manish; Ali, Roohi (2013). Evolution of search engine optimization and investigating the effect of Panda update into it. International journal of scientific & engineering research, v. 4, n. 12, pp. 2045-2053.
Mitchell, Ryan (2024). Web scraping with Python: collecting more data from the modern web. O’Reilly.
Sarr, Edouard-Ngor; Sall, Ousmane; Diallo, Aminata (2018). FactExtract: automatic collection and aggregation of articles and journalistic factual claims from online newspaper. In: Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 336-341. http://dx.doi.org/10.1109/SNAMS.2018.8554421
Sirisuriya, D. S. (2015). A comparative study on web scraping. In: Proceedings of 8th International Research Conference, pp. 135-140. http://ir.kdu.ac.lk/handle/345/1051
Thomas, David-Mathew; Mathur, Sandeep (2019). Data analysis by web scraping using Python. In: 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 450-454. https://doi.org/10.1109/ICECA.2019.8822022
Vasilev, Ivan; Slater, Daniel; Spacagna, Gianmario; Roelants, Peter; Zocca, Valentino (2019). Python deep learning: exploring deep learning techniques and neural network architectures with PyTorch, Keras and TensorFlow. Packt Publishing.
Vording, Robbin (2021). Harvesting unstructured data in heterogenous business environments; exploring modern web scraping technologies. https://purl.utwente.nl/essays/85663
Downloads
Published
How to Cite
Downloads
Dimensions
Issue
Section
License
Copyright (c) 2025 Rubén Alcaraz-Martínez

This work is licensed under a Creative Commons Attribution 4.0 International License.