This course is aimed at students who are new to Regular Expressions but already have an introductory understanding of Python, such as can be gained from our “Python for Absolute Beginners” courses - pt. 1 and 2. We expect you to be familiar with all the concepts described under "Python for Absolute Beginners" here: https://kubdatalab.github.io/python
In the course, you can gain insight into what Regular Expressions (Regex) are and get ideas on how to use them. A basic understanding of Python programming is recommended.
We start by downloading a book digitized by the Royal Library, which provides an opportunity to focus on issues with Optical Character Recognition (OCR) scanned documents. Once the book is downloaded, we begin to clean and process the text data by removing unwanted characters, normalizing the text, and preparing it for further analysis.
We explore advanced Regex techniques, such as metacharacters; pipes, square brackets, and question marks, etc. We use these techniques to write patterns that can be used to extract information such as specific word endings and grammatical forms. We also use these techniques to build a concordance tool that can be used for "fuzzy search," which accounts for variations related to OCR issues. For example, word variations like "danish" or "danili."
If you use OCR digitized source material in your studies, the course is ideal because you can directly apply the course content to your study. If this is not the case, you can gain practical, hands-on experience with Regex techniques, patterns, as well as tips for cleaning and tokenization.
The course is based on material available here: Regex and digitized books
Before the course, please have Python installed on your computer, as well as either Jupyter Notebook or Jupyter Lab. The easiest way is to download and install the Anaconda package, as it provides everything at once. However, if you prefer not to do this, here is a guide on how to install Python first and then Jupyter.
Related LibGuide: Datalab by Christian Knudsen
lakj@kb.dk
To use this platform, the system writes one or more cookies in your browser. These cookies are not shared with any third parties. In addition, your IP address and browser information is stored in server logs and used to generate anonymized usage statistics. Your institution uses these statistics to gauge the use of library content, and the information is not shared with any third parties.