Python's Beautiful Soup for web scraping as a method for automated data extraction from websites

Authors

DOI:

https://doi.org/10.3145/infonomy.25.014

Keywords:

Web scraping, Python, Beautiful Soup, Pandas, DataFrames, Selenium, Data mining, Data extraction

Abstract

Beautiful Soup is a Python library for extracting, analysing and editing data from HTML documents. After introducing various concepts and technologies related to data scraping, this guide shows step-by-step explanation on how to set up an environment compatible with this technology and includes multiple examples of automated data extraction from web pages. In addition to Beautiful Soup, other Python modules such as pandas (Panel Data) for handling data and processing CSV files, and requests for managing HTTP requests in are also integrated. Additionally, more advanced solutions are introduced to circumvent common protection mechanisms.

Author Biography

Rubén Alcaraz-Martínez, University of Barcelona

References

Alcaraz-Martínez, Rubén (2023) “Black hat SEO y otras técnicas poco éticas: evolución y situación actual”. Infonomy, v. 1, n. 1. https://doi.org/10.3145/infonomy.23.008

Cass, Stephen (2024). The top programming languages 2024. IEEE Spectrum. https://spectrum.ieee.org/top-programming-languages-2024

Diouf, Rabiyatou; Sarr, Edouard; Sall, Ousmane; Birregah, Babiga; Bousso, Mamadou; Mbaye, Sény Ndiaye (2019). Web scraping: state-of-the-art and areas of application. In: IEEE International Conference on Big Data (Big Data), pp. 6040-6042. https://doi.org/10.1109/BigData47090.2019.9005594

GitHub Staff (2024). Octoverse: AI leads Python to top language as the number of global developers surges. Octoverse. https://github.blog/news-insights/octoverse/octoverse-2024

Grasso, Giovanni; Furche, Tim; Schallhart, Christian (2013). “Effective web scraping with OXPath”. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 23-26. https://doi.org/10.1145/2487788.2487796

Khder, Moaiad-Admad (2021). Web scraping or web crawling: state of art, techniques, approaches and application. International journal of advances in soft computing & its applications, v. 13, n. 3, pp. 144-168. http://dx.doi.org/10.15849/IJASCA.211128.11

Krotov, Vlad; Johnson, Leigh; Silva, Leiser (2020). Tutorial: legality and ethics of web scraping. Communications of the Association for Information Systems, n. 47. https://doi.org/10.17705/1CAIS.04724

Lawson, Richard (2015). Web scraping with Python: scrape data from any website with the power of Python. Packt Publishing.

Maheshwari, Manish; Ali, Roohi (2013). Evolution of search engine optimization and investigating the effect of Panda update into it. International journal of scientific & engineering research, v. 4, n. 12, pp. 2045-2053.

Mitchell, Ryan (2024). Web scraping with Python: collecting more data from the modern web. O’Reilly.

Sarr, Edouard-Ngor; Sall, Ousmane; Diallo, Aminata (2018). FactExtract: automatic collection and aggregation of articles and journalistic factual claims from online newspaper. In: Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 336-341. http://dx.doi.org/10.1109/SNAMS.2018.8554421

Sirisuriya, D. S. (2015). A comparative study on web scraping. In: Proceedings of 8th International Research Conference, pp. 135-140. http://ir.kdu.ac.lk/handle/345/1051

Thomas, David-Mathew; Mathur, Sandeep (2019). Data analysis by web scraping using Python. In: 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 450-454. https://doi.org/10.1109/ICECA.2019.8822022

Vasilev, Ivan; Slater, Daniel; Spacagna, Gianmario; Roelants, Peter; Zocca, Valentino (2019). Python deep learning: exploring deep learning techniques and neural network architectures with PyTorch, Keras and TensorFlow. Packt Publishing.

Vording, Robbin (2021). Harvesting unstructured data in heterogenous business environments; exploring modern web scraping technologies. https://purl.utwente.nl/essays/85663

Published

2025-04-30

How to Cite

Alcaraz-Martínez, R. (2025). Python’s Beautiful Soup for web scraping as a method for automated data extraction from websites. Infonomy, 3(2). https://doi.org/10.3145/infonomy.25.014

Downloads

Download data is not yet available.

Dimensions