Event box
Text Processing with Regular Expressions and Python In-Person
In the course, you can gain insight into what Regular Expressions (Regex) are and get ideas on how to use them. A basic understanding of Python programming is recommended.
We start by downloading a book digitized by the Royal Library, which provides an opportunity to focus on issues with Optical Character Recognition (OCR) scanned documents. Once the book is downloaded, we begin to clean and process the text data by removing unwanted characters, normalizing the text, and preparing it for further analysis.
We explore advanced Regex techniques, such as metacharacters; pipes, square brackets, and question marks, etc. We use these techniques to write patterns that can be used to extract information such as specific word endings and grammatical forms. We also use these techniques to build a concordance tool that can be used for "fuzzy search," which accounts for variations related to OCR issues. For example, word variations like "danish" or "danili."
If you use OCR digitized source material in your studies, the course is ideal because you can directly apply the course content to your study. If this is not the case, you can gain practical, hands-on experience with Regex techniques, patterns, as well as tips for cleaning and tokenization.
The course is based on material available here: Regex and digitized books
Before the course, please have Python installed on your computer, as well as either Jupyter Notebook or Jupyter Lab. The easiest way is to download and install the Anaconda package, as it provides everything at once. However, if you prefer not to do this, here is a guide on how to install Python first and then Jupyter.
Related LibGuide: KUB Datalab by Christian Knudsen