%load_ext version_information
%version_information numpy, pandas, re, bs4, pickle, gensim, spacy, pyLDAvis

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

import os
import pickle
import pyLDAvis

hansard_saves_directory = os.path.join('', 'saves/hansard/')
st_saves_directory = os.path.join('', 'saves/st/')

Parliamentary transcript (Hansard) corpus¶

Corpus is 5,227 speeches from Singapore Members of Parliament

Time period: 2005 to 2016

Pre-processing

Remove procedural speeches using a simple heuristic (procedural speeches are those with 100 characters and less OR 20 words and less
Remove timestamps
Remove column markers
Remove page markers
Remove strings inside parentheses and brackets
Remove stopwords
Remove punctuations
Lemmatise words
Remove certain parts of speeches (dates, percent, money, etc.) using Spacy
Identify phrases in sentences (e.g. low_wage or retirement_fund) using Gensim's Phraser (two passes)

hansard_ldavis_directory = os.path.join(hansard_saves_directory, 'ldavis-hansard-92')

with open(hansard_ldavis_directory) as f:
    hansard_ldavis = pickle.load(f)
    
pyLDAvis.display(hansard_ldavis)

The Straits Times corpus¶

Corpus is 3,425 ST articles with quotes from Singapore Members of Parliament

Time period: 2005 to 2016

Pre-processing:

Remove digits and email addresses
Remove stopwords
Remove punctuations
Lemmatise words
Remove certain parts of speeches (dates, percent, money, etc.) using Spacy
Identify phrases in sentences (e.g. low_wage or retirement_fund) using Gensim's Phraser (two passes)

st_ldavis_directory = os.path.join(st_saves_directory, 'ldavis-st-40')

with open(st_ldavis_directory) as f:
    st_ldavis = pickle.load(f)
    
pyLDAvis.display(st_ldavis)

Software	Version
Python	2.7.14 64bit [GCC 7.2.0]
IPython	5.4.1
OS	Linux 4.8.0 53 generic x86_64 with debian stretch sid
numpy	1.14.3
pandas	0.22.0
re	2.2.1
bs4	4.6.0
pickle	$Revision: 72223 $
gensim	3.4.0
spacy	2.0.11
pyLDAvis	2.1.2
Mon Jun 18 10:30:19 2018 +08