This course is aimed at students who are new to Web scraping but already have an introductory understanding of Python, such as can be gained from our “Python for Absolute Beginners” courses - pt. 1 and 2. We expect you to be familiar with all the concepts described under "Python for Absolute Beginners" here: https://kubdatalab.github.io/python
Web scraping, or web harvesting or data extraction is what this course is about. A basic understanding of Python programming is recommended.
The course provides insight into how you can use Python to collect data from the web. We start by discussing HTML and examining HTML elements and attributes. Then, we try working with HTML in Python, and you are introduced to two libraries: Requests and BeautifulSoup. We attempt to locate data within the HTML structure using the methods .find and .find_all, and we read/extract data from the structure. We conclude with a mini-project that involves harvesting text data from a Wikipedia page.
Regardless of your academic background, the course will be relevant if you are interested in collecting material for your assignments or if you are simply interested in more advanced use of Python.
The course is based on material available here: Harvest data from the web
Before the course, please have Python installed on your computer, as well as either Jupyter Notebook or Jupyter Lab. The easiest way is to download and install the Anaconda package, as it provides everything at once. However, if you prefer not to do this, here is a guide on how to install Python first and then Jupyter.
Related LibGuide: Datalab by Christian Knudsen
lakj@kb.dk
To use this platform, the system writes one or more cookies in your browser. These cookies are not shared with any third parties. In addition, your IP address and browser information is stored in server logs and used to generate anonymized usage statistics. Your institution uses these statistics to gauge the use of library content, and the information is not shared with any third parties.