Processing a Collection of Documents into a Gensim Corpus for Topic Modeling¶

Here, I prepare a collection of documents, namely articles from the Journal of Biblical Literature (JBL), into a format that will be most useful for building a topic model. When the process is complete, there will be a dictionary which maps the articles (documents) to the appropriate metadata and two corpora: the first a general corpus in which each journal article is represented by a list of informative words (minus pre-defined stop words), and second a corpus in which each journal article is represented by a list of the most informative nouns only.

There are several python libraries which I use to process the JBL into a corpus ready to be modeled. The most important of which are:

spaCy, a natural language processing library which allows me to prepare the text in meaningful was such as removing stop words, part of speech tagging, and lemmatizing
gensim, a topic modeling library which does the computational work of building the topic model.

import os
import re
import collections
import xml.etree.ElementTree as ET
import spacy
from gensim import corpora
import json

Set up Preprocessing¶

The output of a topic model will only be as good as its input. It is important therefore to select the most informative words, or features, from the corpus. This allows the topic model to focus in on the "signal" by eliminating as much "noise" as possible. Toward that end, I take a few steps:

I identify a number of common abbreviations used in JBL and map them to their longer forms. If the model counts 'gen' and 'genesis' as two unique tokens, it may skew the results of the model.
I remove a number of tokens which are not informative features. These may be tokens that result from ocr inaccuracies, they may be roman numerals (used often in this corpus), they may be transliterated Greek or Hebrew, stop words, or they may be the names of cities and publishers referred to in the corpus. Whatever the case may be, they distract from more informative content words.
I also remove both English and German stop words from the corpus. The latter is necessary because there are a number of German quotations even in the English language documents.

Preprocessing also includes the process of tokenizing each document, that is to say, transforming each document to a list of discrete items. In this case, each item is a word. In the functions I define below for this purpose, for each document I create a list of general tokens (minus stop words) and a list of noun only tokens. This allows me to build to different versions of the topic model which I compare to see which of the models is most informative.

books_abbrs = [('gen', 'genesis'),('exod', 'exodus'),('ex', 'exodus'),('lev', 'leviticus'),('num', 'numbers'),
               ('deut', 'deuteronomy'),('josh', 'joshua'),('judg', 'judges'), ('jud', 'judges'),('sam', 'samuel'),('kgs', 'kings'),
               ('chr', 'chronicles'),('neh', 'nehemiah'),('esth', 'esther'),('ps', 'psalms'),('pss', 'psalms'),
               ('prov', 'proverbs'),('eccl', 'ecclesiastes'),('qoh', 'qoheleth'), ('isa', 'isaiah'),
               ('jer', 'jeremiah'),('lam', 'lamentations'),('ezek', 'ezekiel'),('hos', 'hosea'),('obad', 'obediah'),
               ('mic', 'micah'),('nah', 'nahum'),('hab', 'habakkuk'),('zeph', 'zephaniah'),('hag', 'haggai'),
               ('zech', 'zechariah'),('mal', 'malachi'),('matt', 'matthew'),('mk', 'mark'),('lk', 'luke'),
               ('jn', 'john'),('rom', 'romans'),('cor', 'corinthians'),('gal', 'galatians'),('eph', 'ephesians'),
               ('phil', 'philippians'),('col', 'colossians'),('thess', 'thessalonians'),('tim','timothy'),
               ('phlm', 'philemon'),('heb', 'hebrews'),('jas', 'james'),('pet', 'peter'),('rev', 'revelation'),
               ('tob', 'tobit'),('jdt', 'judith'), ('wis', 'wisdom of solomon'),('sir', 'sirach'), ('bar', 'baruch'),
               ('macc', 'maccabees'), ('esd', 'esdras'), ('tg', 'targum')]

custom_stop_words = ['ab', 'al', 'alten', 'america', 'atlanta', 'au', 'av', 'avrov', 'b', 'ba', 'bauer', 'berlin',
                    'boston', 'brill', 'brown', 'c', 'cad', 'cambridge', 'cf', 'ch', 'chap', 'chapter', 'charles',
                    'chicago', 'chs', 'cit', 'cite', 'claremont', 'college', 'craig', 'cum', 'd', 'dans', 'dennis',
                    'diese', 'dissertation', 'dm', 'dtr', 'ed', 'eds', 'eerdmans', 'ek', 'elisabeth', 'en', 'et',
                    'ev', 'ez', 'f', 'far', 'ff', 'fiir', 'g', 'gar', 'george', 'geschichte', 'gott', 'gottes',
                    'grand', 'h', 'ha', 'hall', 'hartford', 'hat', 'haven', 'henry', 'ia', 'ibid', 'io',
                    'isbn', 'iv', 'ivye', 'ix', 'jeremias', 'jesu', 'k', 'ka', 'kai', 'kal', 'kat', 'kee', 'ki', 'kim',
                    'kirche', 'klein', 'knox', 'l', 'la', 'le', 'leiden', 'leipzig', 'les', 'loc', 'louisville', 'm',
                    'ma', 'madison', 'marie', 'marshall', 'mohr', 'n', 'na', 'neuen', 'ni', 'nu', 'nur', 'o', 'ol',
                    'om', 'op', 'ov', 'ovadd', 'ovk', 'oxford', 'paper', 'paulus', 'ph', 'philadelphia', 'post',
                    'pres', 'president', 'press', 'pro', 'prof', 'professor', 'r', 'ra', 'rab', 'rapids', 'refer',
                    'reviews', 'ro', 'robert', 'robinson', 'rov', 's', 'sa', 'schmidt', 'schriften', 'scott', 'sec',
                    'section', 'seiner', 'sheffield', 'siebeck', 'stanely', 'studien', 't', 'thee', 'theologie',
                    'they', 'thing', 'thou', 'thy', 'tiibingen', 'tov', 'tr', 'tv', 'u', 'um', 'univ', 'unto', 'v',
                    'van', 'verse', 'vol', 'volume', 'vs', 'vss', 'vv', 'w', 'william', 'wunt',
                    'y', 'yap', 'ye', 'york', 'zeit']

with open('../romannumeral.txt') as f:
    rom_nums = f.read()
rom_nums = re.sub('romannumeral', '', rom_nums)
rom_nums = re.sub('lxx', '', rom_nums) # lxx is an abbr. for 'septuagint'
rom_nums = re.split(r'\t\n', rom_nums)

with open('../data/german_stop_words') as f:
    german_stop_words = f.readlines()
    german_stop_words = [word.strip() for word in german_stop_words]

nlp = spacy.load('en')
stop_words = spacy.en.STOPWORDS
stop_words.update(custom_stop_words)
stop_words.update(rom_nums)
stop_words.update(german_stop_words)


def substitute(list_tuples, string):
    for tuple_ in list_tuples:
        string = re.sub(r'\b' + tuple_[0] + r'\b', tuple_[1], string)
    return string

def get_lemmas(doc):
    tokens = [token for token in doc]
    lemmas = [token.lemma_ for token in tokens if token.is_alpha]
    lemmas = [lemma for lemma in lemmas if lemma not in stop_words]
    for index, item in enumerate(lemmas):
        item = substitute(books_abbrs, item)
        lemmas[index] = item
    return lemmas

def get_noun_lemmas(doc):
    tokens = [token for token in doc]
    noun_tokens = [token for token in tokens if token.tag_ == 'NN' or token.tag_ == 'NNP' or token.tag_ == 'NNS']
    noun_lemmas = [noun_token.lemma_ for noun_token in noun_tokens if noun_token.is_alpha]
    noun_lemmas = [noun_lemma for noun_lemma in noun_lemmas if noun_lemma not in stop_words]
    for index, item in enumerate(noun_lemmas):
        item = substitute(books_abbrs, item)
        noun_lemmas[index] = item
    return noun_lemmas
    
def process_text(text):
    doc = nlp(text)
    lemmas = get_lemmas(doc)
    noun_lemmas = get_noun_lemmas(doc)
    return lemmas, noun_lemmas

Extract and map metadata to text, implement preprocessing¶

JSTOR's Data for Research provided both metadata and full text articles for JBL in the form of xml files and txt files respectively. However, many of these articles are not useful for the topic model and are not processed:

Many articles in this collection are written in German. I have decided to eliminate them because a multilingual corpus is less likely to produce an informative model.
Other "articles" in the corpus are not articles at all, but are instead "front/back matter", "annual indices", or "volume information." The text in these articles are not informative for the purposes of this topic model.

After leaving those articles aside, I extract the metadata from the xml files, map the metadata to the relevant article, and then store this mapping as a dictionary for later reference. Then, I extract the full text for each article and run it through the preprocessing functions I defined above.

%%time
xml_files = sorted(os.listdir('../data/metadata/'))
txt_files = sorted(os.listdir('../data/ocr/'))
mapping_dict = collections.OrderedDict()
general_docs = []
noun_docs = []
i = 0
for xml, txt in zip(sorted(xml_files), sorted(txt_files)):
    article_dict = {}
    # read xml file
    tree = ET.parse('../data/metadata/' + xml)
    root = tree.getroot()
    # only process english articles
    lang = root.find('./front/article-meta/custom-meta-group/custom-meta/meta-value')
    if (lang.text == 'eng') or (lang.text == 'en'):
        # add title to article dict
        title = root.find('./front/article-meta/title-group/article-title')
        try:
            title = title.text
            title = title.lower()
        except AttributeError:
            book_reviewed = root.find('./front/article-meta/product/source')
            title = 'review of ' + book_reviewed.text.lower() # jbl does not title book reviews
        unwanted_titles = ['front matter', 'back matter', 'annual index', 'volume information'] # ignore these titles
        if not title in unwanted_titles:
            article_dict['title'] = title
            # add article_id to article_dict
            article_id = root.find('./front/article-meta/article-id')
            article_id = article_id.text.lower()
            article_dict['article_id'] = article_id
            # add author to article_dict
            f_name = root.find('./front/article-meta/contrib-group/contrib/string-name/given-names')
            l_name = root.find('./front/article-meta/contrib-group/contrib/string-name/surname')
            author = root.find('./front/article-meta/contrib-group/contrib/string-name')
            if f_name != None:
                author = l_name.text + ', ' + f_name.text
            elif author != None:
                author = root.find('./front/article-meta/contrib-group/contrib/string-name')
                author = author.text
            else:
                author = 'author not listed'
            article_dict['author'] = author
            # add publish date to article_dict
            pub_year = root.find('./front/article-meta/pub-date/year')
            article_dict['pub_year'] = pub_year.text
            key = 'doc_' + str(i)
            mapping_dict[key] = article_dict
            
            # read txt file
            with open('../data/ocr/' + txt, mode='r', encoding='utf8') as f:
                text = f.read()
            lemmas, nouns = process_text(text)
            if len(nouns) > 0: # only want docs which are not empty
                general_docs.append(lemmas)
                noun_docs.append(nouns)
                key = 'doc_' + str(i)
                mapping_dict[key] = article_dict
                i += 1
            else:
                continue
            if i % 500 == 0:  # displaying progress
                print('finhsied doc ', i)
            else:
                continue
    else:
        continue

finhsied doc  500
finhsied doc  1000
finhsied doc  1500
finhsied doc  2000
finhsied doc  2500
finhsied doc  3000
finhsied doc  3500
finhsied doc  4000
finhsied doc  4500
finhsied doc  5000
finhsied doc  5500
finhsied doc  6000
finhsied doc  6500
finhsied doc  7000
finhsied doc  7500
finhsied doc  8000
finhsied doc  8500
finhsied doc  9000
CPU times: user 1h 3min 10s, sys: 34.8 s, total: 1h 3min 44s
Wall time: 1h 5min 22s

with open('../data/doc2metadata.json', encoding='utf8', mode='w') as outfile:
    json.dump(mapping_dict, outfile)

Initialize Gensim corpora¶

Initializing a Gensim corpus (which serves as the basis of a topic model) entails two steps:

Creating a dictionary which contains the list of unique tokens in the corpus mapped to an integer id.
Initializing the corpus on the basis of the dictionary just created. Each document in a Gensim corpus is a list of tuples. The first element of the tuple is an integer id for each unique word type and the second tuple is a count of how often that unique word type occurs in the document. So, the first five elements of a document may look something like this: [(1, 1.0), (2, 1.0), (3, 5.0), (4, 1.0), (5, 9.0)]

The creation of the dictionary requires the researcher to specify two important parameters which further identify the most informative features.

no_below This parameter filters out words which are too rare to be informative. The value of this parameter indicates the number of documents in which a token appears. Here I have set the value to 100 indicating that if a token occurs in less than 100 documents, it will not be included in the topic model.
no_above This parameter filters out words which are too frequent to be informative The value of this parameter indicates the percentage of documents in which a token appears. Here, I have set the value to 0.5 indicating that if a token occurs in more than 50% of the documents, it will not be included in the topic model.

This filtering selects a "goldilocks zone" of informative features. Tokens which are too rare fail to register similarity among documents, because so few documents use them. By contrast, words which are too frequent fail to register difference among documents because so many documents share them in common.

There is no formula for deciding what counts as too rare or too frequent. It just depends on the size of the corpus and its lexical diversity. I tried a number of variations for these parameters for the regular corpus. First, taking no_above as a constant:

no_below=20 and no_above=0.5 which left 22,283 unique tokens
no_below=50 and no_above=0.5 which left 12,642 unique tokens
no_below=100 and no_above=0.5 which left 7,834 unique tokens

Then taking no_below as a constant:

no_below=100 and no_above=0.3 which left 7,617 unique tokens
no_below=100 and no_above=0.4 which left 7,753 unique tokens
no_below=100 and no_above=0.5 which left 7,834 unique tokens
no_below=100 and no_above=0.9 which left 7,901 unique tokens

As you can see, adjusting no_below had a greater effect on the number of unique tokens than did adjusting no_above. The model I tested with no_below=20 and no_above=0.5 contained more "junk topics" than did other models so I decided 22,283 unique tokens was too many features for this corpus. no_below=50 and no_above=0.5 rendered fewer junk topics, but no_below=100 and no_above=0.5 did even better.

I did a similar process for the noun-only corpus. First, taking the no_above as a constant:

no_below=20 and no_above=0.5 which left 15,761 unique tokens
no_below=50 and no_above=0.5 which left 8,264 unique tokens
no_below=100 and no_above=0.5 which left 4,780 unique tokens

Then taking the no_below as a constant:

no_below=100 and no_above=0.3 which left 4,685 unique tokens
no_below=100 and no_above=0.4 which left 4,758 unique tokens
no_below=100 and no_above=0.5 which left 4,790 unique tokens
no_below=100 and no_above=0.6 which left 4,813 unique tokens

For the noun-only corpus I also settled on no_below=100 and no_above=0.5

Regular Corpus¶

This corpus contains lemmatized forms of the words used in the text (minus the stopwords outlined above) regardless of their part of speech.

# create dictionary
general_dictionary = corpora.Dictionary(general_docs)
general_dictionary.filter_extremes(no_below=100, no_above=0.5)
general_dictionary.save('../general_corpus/general_corpus.dict')

# create corpus
corpus = [general_dictionary.doc2bow(doc) for doc in general_docs]
corpora.MmCorpus.serialize('../general_corpus/general_corpus.mm', corpus)

Noun-only Corpus¶

This corpus contains lemmatized forms of the nouns used in the text (minus the stop words outlined above).

# create dictionary
noun_dictionary = corpora.Dictionary(noun_docs)
noun_dictionary.filter_extremes(no_below=100, no_above=0.5)

noun_dictionary.save('../noun_corpus/noun_corpus.dict')

# create corpus
noun_corpus = [noun_dictionary.doc2bow(doc) for doc in noun_docs]
corpora.MmCorpus.serialize('../noun_corpus/noun_corpus.mm', noun_corpus)