Compare Noun and Regular Corpus Models
Appendix 1: Documentation of Corpus Preparation
Appendix 2: Documentation of Model Training Code
View the Project on GitHub msaxton/topic-model-best-practices
<!DOCTYPE html>
The core of this project is the building of multiple topic models based on the same corpus of documents, the JBL. There are three primary variations that are being tested:
For a base-line of comparison I build the following model:
To make analyze the number of topics I build the following models:
To analyze the value of alpha I build the following models
To analyze the types of words included in the corpus I build the following model:
from gensim import corpora, models
# load corpus and dictionary for noun-only models
n_path = '../noun_corpus/'
n_dictionary = corpora.Dictionary.load(path + 'noun_corpus.dict')
n_corpus = corpora.MmCorpus(path + 'noun_corpus.mm')
lda_75 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=75, passes=100, random_state=42)
lda_75.save(path + 'noun_75.model')
lda_25 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=25, passes=100, random_state=42)
lda_25.save(path + 'noun_25.model')
lda_150 = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=150, passes=100, random_state=42)
lda_150.save(path + 'noun_150.model')
lda_auto = models.LdaModel(n_corpus, id2word=n_dictionary, num_topics=75, passes=100, alpha='auto', random_state=42)
lda_auto.save(path + 'alphas/noun_auto.model')
np_05 = np.full(75, 0.5) # create array of 0.5 values to feed to the model
lda_05 = models.LdaModel(corpus, id2word=dictionary, num_topics=75, passes=100, alpha=np_05, random_state=42)
lda_05.save(path + 'alphas/noun_05.model')
# load dictionary and corpus for regular corpus models
r_path = '../general_corpus/'
r_dictionary = corpora.Dictionary.load(path + 'general_corpus.dict')
r_corpus = corpora.MmCorpus(path + 'general_corpus.mm')
lda_75 = models.LdaModel(corpus, id2word=dictionary, num_topics=75, passes=100, random_state=42)
lda_75.save(path + 'general_75.model')