Text Mining and Analytics

Notes from the Coursera Text Mining and Analytics course.

Further Reading

Syntagmatic Relations

** Categorisation **

Analysing Text


Shallow vs Deep Analysis

Text Representation

There are different levels of text representation which are useful for different purposes. We often use multiple representations at once.

  1. String of Characters - most simple form
  2. Sequence of words + Parts Of Speech tags. Words and word type
  3. Syntactic structures. Noun phrase, verb phrase etc.
  4. Entities and relations - what entities are in the text and how do they relate
  5. Logic predicates
  6. Speech acts - is the text asking somebody to do something?

Word Relations

Types of Word Relations

Paradigmatic - words belong to the same class. e.g. “banana” and “apple”

We can mine paradigmatic relations by finding words with high context similarity.

Syntagmatic - words are related symatically - e.g “fly” and “helicopter”

We can mine syntagmatic relations by finding words with high co-occurrences but relatively low individual occurrences.

Word Contexts

Paradigmatic Relation Discovery

Word context can be considered as a pseudo document - a “bag of words”.

Vector Space Model (VSM) of a Bag of Words

Split a pseudo document into words and count them:

Expected Overlap of Words in Context (EOWC)

Compute document similarity as the dot product of the normalised word count vector.

This works but has faults:

`d1=(x_1, ...x_n), x_i=(c(w_i, d1))/|d1|`

`d2=(y_1, ...y_n), y_i=(c(w_i, d2))/|d2|`

Similarity(d1, d2) = `d1 . d2 = sum_(i=1)^n x_i y_i`

`x_i` is the probability that a randomly picked word is from d1 is w_i

`c(w_i, d1)` is the count of word `w_i` in d1

`|d1|` is the total count of words in d1

Retrieval Heuristics

BM25 Transformation - a term frequency transformation

`y=TF(w,d)`, x=c(w,d)`

`y=((k+1)x) / (x+k)`

`k in [0, +oo)`

IDF term weighting

`IDF(W) = log((M+1)/k)`

`M` is the total number of docs in collection

`k` is the total number of docs containing word W

Using BM25 for Paradigmatic Relation Mining

We can use BM25 and IDF term weighting to improve on EOWC

For two documents, `d1=(x_1, ...x_n)` and `d2=(y_1, ...y_n)`

`BM25(w_1, d1) = ((k+1)c(w_i, d1))/(c(w_i, d1) + (k(1 - b + b^* |d1|)/(av(dl))))`

`b in [0, 1], k in [0, +oo), av(dl) is average document length`

`x_i = (BM25(w_1, d1)) / (sum_(j=1)^n BM25(w_j, d1))`

`y_i = ... as for x_i`

Similarity(d1, d2) = `sum_(i=1)^n IDF(w_i) x_i y_i`

Syntagmatic Relation Discovery


Define a binary random variable `X_w in {0,1}`

` X_w = {0 if w not present, 1 if w present`


`P(X_w =1) + P(X_w =0) = 1` since a word is either present or not

Entropy measures the randomness of X

`H(X_w) = sum_(v in {0,1})-p(X_w=v)log_2p(X_w=v)`

`= -p(X-w=0)log_2 p(X_w=0) -p(X-w=1)log_2 p(X_w=1)`

Define `0 log_2 0=0` since `log_2 0` is undefined

Conditional Entropy

Conditional entropy is the entropy of w1 given that w2 is present

`H(X_(w1) | X_(w2)) = sum_(u in {0,1})[p(X_(w2) = u)H(X_(w1) | X_(w2) = u)]`

= `sum_(u in {0,1})[p(X_2 = u) sum_(v in {0,1})[-p(X_1 = v | X_2 = u) log_2 p(X_1 = v | X_2 = u)]]`

This has the following properties:

This is great, but it can’t be used to compare different different pairs of words against each other.

Mutual Information

The reduction in entropy of X obtained by knowing Y.

This measure CAN be used to compare different pairs of words against each other.

`I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)`

This has the following properties:

KL-Divergence Form of Mutual Information

`I(X_w1;X_W2) = sum_(u in {0,1}) sum_(v in {0,1}) P(X_(w1) = u, X_(w2) = v) log_2 ((p(X_(w1) = u, X_(w2) = v)) / (p(X_(w1) = u)p(X_(w2) = v)))`

We use this to mine syntagmatic relations between pairs of words.

Topic Mining

Probabilistic Topical Model

In a Probabilistic Topical Model we treat a topic as a word distribution. We use θ to denote a topic.

We define a vocabulary set V which contains all words:

`V={w1, w2, ...}`

As this is a probability distribution, the sum of all the probabilities must be 1.

`sum_(w in V) p(w|theta_i) = 1`

We denote the probability of document `d_i` covering topic `theta_j` as `π_(ij)`

The coverage of topics for each `d_1` also sums to 1. i.e:

`sum_(j=1)^k π_(ij) = 1`

Topic Mining Game Plan

Input: C - collection of documents, k - number of topics, V - vocabulary set

Output: Set of topics {θ1, … θk}, Coverage of topics for each document {πi1, …. πik}, {πn1, …. πnk}

Unigram Data Model



  system - Y system - N
human - Y TP (true positive) FN (false negative)
human - N FP (false positive) TN (true negative)


`Precision = (TP) / (TP + FP)`

When the system says yes - how man are correct


`Recall = (TP) / (TP + FN)`

Does the document have all the categories it should have?


`F_beta = ((beta^2 + 1) (P * R)) / (beta^2 P + R)`

`F_1 = 2PR / (P + R)`

Harmonic mean of precision and recall. Better than using arithmetic mean.

beta = 1 -> precision and recall are given equal weight


We can aggregate the P, R and F1 values in different ways

Macro averaging

Can consider aggregation either with:

Beneficial to compare both.

Different aggregation methods:

Mirco averaging

In micro averaging you pool all results into one table then compute P, R, F1 in one step using the totals.

This method gives all documents and categories the same importance.

Macro averaging tends to be better.

Opinion Mining

An opinion is a subjective statement about what a person thinks about something


Types of Opinions


Unigram language models aren’t great for sentiment analysis. eg. “it’s not good” vs “it’s not as good as”. So, better to use n-gram models, but watch out for overfitting.

© Will Robertson