In [1]:
%load_ext version_information
%version_information numpy, pandas, re, bs4, pickle, gensim, spacy, pyLDAvis
Out[1]:
SoftwareVersion
Python2.7.14 64bit [GCC 7.2.0]
IPython5.4.1
OSLinux 4.8.0 53 generic x86_64 with debian stretch sid
numpy1.14.3
pandas0.22.0
re2.2.1
bs44.6.0
pickle$Revision: 72223 $
gensim3.4.0
spacy2.0.11
pyLDAvis2.1.2
Mon Jun 18 10:30:19 2018 +08
In [2]:
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

import os
import pickle
import pyLDAvis

hansard_saves_directory = os.path.join('', 'saves/hansard/')
st_saves_directory = os.path.join('', 'saves/st/')

Parliamentary transcript (Hansard) corpus

Corpus is 5,227 speeches from Singapore Members of Parliament

Time period: 2005 to 2016

Pre-processing

  1. Remove procedural speeches using a simple heuristic (procedural speeches are those with 100 characters and less OR 20 words and less
  2. Remove timestamps
  3. Remove column markers
  4. Remove page markers
  5. Remove strings inside parentheses and brackets
  6. Remove stopwords
  7. Remove punctuations
  8. Lemmatise words
  9. Remove certain parts of speeches (dates, percent, money, etc.) using Spacy
  10. Identify phrases in sentences (e.g. low_wage or retirement_fund) using Gensim's Phraser (two passes)
In [3]:
hansard_ldavis_directory = os.path.join(hansard_saves_directory, 'ldavis-hansard-92')

with open(hansard_ldavis_directory) as f:
    hansard_ldavis = pickle.load(f)
    
pyLDAvis.display(hansard_ldavis)
Out[3]: