Optical character recognition (OCR) allows you to extract text from an image. In this workshop we will explore how you can use OCR to build a corpus of texts for your research. Focusing on printed sources, primarily in English, secondarily with another language important for early Christian Studies (Greek, Latin, Syriac, and Coptic), we will survey some OCR applications that are available, and then demonstrate actual work through Tesseract..
DetailsThe automatic transcription of documents using OCR frequently entails a necessary step of cleaning the data that is produced. One way to optimize this cleanup process is the use of Regular Expressions (regex). Regular Expressions allow users to do advanced search and replace functions that can save significant amounts of time when working with large files or even across multiple files. This workshop will introduce the basics of regex, focusing on particular functions that might be useful in OCR data cleanup.
DetailsIn this workshop, we will learn the basics of Python. Python is a user-friendly general purpose programming language useful for digital humanities projects. In this workshop, we will learn to do Python in Google colab notebooks. We will write some simple code, and we will learn how to manipulate strings. We will learn how to use loops and variables. We will illustrate how Python is useful in the humanities. This workshop will prepare participants to do Natural Language Processing.
DetailsThis workshop will introduce participants to the basics of Natural Language Processing (NLP) using a scripting language called Python. Topics covered will include: an overview of NLP, basics of NLP such as tokenization and lemmatization, Python libraries for implementing NLP, and analyzing texts with NLP. Workshop participants will be given the opportunity to get their hands dirty with this exciting approach to textual analysis. This workshop will prepare participants to do Natural Language Processing.
Details