Building Topic Models¶

The core of this project is the building of multiple topic models based on the same corpus of documents, the JBL. There are three primary variations that are being tested:

The number of topics assigned to each model
The value assigned to the alpha hyper-parameter
The types of words included in the corpus, whether they should be from all parts of speech or just nouns

For a base-line of comparison I build the following model:

Model based on a noun-only corpus, 75 topics, alpha = 'symmetric'

To make analyze the number of topics I build the following models:

Model based on a noun-only corpus, 25 topics, alpha = 'symmetric'
Model based on a noun-only corpus, 150 topics, alpha = 'symmetric'

To analyze the value of alpha I build the following models

Model based on a noun-only corpus, 75 topics, alpha = 'auto'
Model based on a noun-only corpus, 75 topics, alpha = 0.5

To analyze the types of words included in the corpus I build the following model:

Model based on a regular corpus, 75 topics, alpha = symmetric

Set up¶

from gensim import corpora, models

# load corpus and dictionary for noun-only models
n_path = '../noun_corpus/'
n_dictionary = corpora.Dictionary.load(path + 'noun_corpus.dict')
n_corpus = corpora.MmCorpus(path + 'noun_corpus.mm')

Model 1: Model based on a noun-only corpus, 75 topics, alpha = 'symmetric'¶

lda_75 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=75, passes=100, random_state=42)
lda_75.save(path + 'noun_75.model')

Model 2: Model based on a noun-only corpus, 25 topics, alpha = 'symmetric'¶

lda_25 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=25, passes=100, random_state=42)
lda_25.save(path + 'noun_25.model')

Model 3: Model based on a noun-only corpus, 150 topics, alpha = 'symmetric'¶

lda_150 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=150, passes=100, random_state=42)
lda_150.save(path + 'noun_150.model')

Model 4: Model based on a noun-only corpus, 75 topics, alpha = 'auto'¶

lda_auto = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=75, passes=100, alpha='auto', random_state=42)
lda_auto.save(path + 'alphas/noun_auto.model')

Model 5: Model based on a noun-only corpus, 75 topics, alpha = 0.5¶

np_05 = np.full(75, 0.5) # create array of 0.5 values to feed to the model
lda_05 = models.LdaModel(corpus, id2word=dictionary, num_topics=75, passes=100, alpha=np_05, random_state=42)
lda_05.save(path + 'alphas/noun_05.model')

Model 6: Model based on a regular corpus, 75 topics, alpha = symmetric¶

# load dictionary and corpus for regular corpus models
r_path = '../general_corpus/'
r_dictionary = corpora.Dictionary.load(path + 'general_corpus.dict')
r_corpus = corpora.MmCorpus(path + 'general_corpus.mm')

lda_75 = models.LdaModel(corpus, id2word=dictionary, num_topics=75, passes=100, random_state=42)
lda_75.save(path + 'general_75.model')