Bayes Rule

$P (C | A) = \frac{P (A | C) \cdot P (C)}{P (A)}$

$P (A, C) = P (C) \cdot P (A | C)$
$P (C, A) = P (A) \cdot P (C | A)$
$P (A) \cdot P (C | A) = P (C, A) = P (A, C) = P (C) \cdot P (A | C)$
$P (A) \cdot P (C | A) = P (C) \cdot P (A | C)$

In words, it becomes:
$P (cause | evidence) = P (evidence | cause) \cdot P (cause) \div P (evidence)$

$P (cause | evidence)$ is the Posterior
$P (evidence | cause)$ is the Likelihood
$P (cause)$ is the Prior
$P (evidence)$ is the Normalization

Evidence is what we know -- the given feature.
Cause is the class we are trying to classify the item into.

This is useful because $P (evidence | cause)$ tends to be stable, but $P (cause | evidence)$ is less stable, because it depends on the set of possible causes and how common they are right now.

In a "We see a teal bike. What is the chance that it's a veoride?" example, Veoride likes teal and regular bike owners don't. So probability of teal given Veoride is stable, while probability of Veoride given Teal depends on what types of bikes we are looking at.

MAP Estimate

Maximum a Posterior (MAP) Estimate chooses the type with the highest posterior probability:

Suppose type $X$ comes from a set of types $T$
MAP picks the type $X$ such that $P (X | evidence)$ is highest: $a r g m a x_{X \in T} P (X | evidence)$
where $a r g m a x$ returns the index (value of $X$ ) that gives the larges value

Ignoring normalization factor
Bayesian estimation often works with the equation
$P (cause | evidence) \propto P (evidence | cause) \cdot P (cause)$

The normalization is often ignored because we are only trying to determine the relative probabilities. $P (evidence)$ is often the same in one comparison. That is to say, for all entries in the dataset, the denominator does not change, it remain static. so we only care about the posterior $\propto$ (proportional to)

MLE Estimate

The Maximum a Posterior (MAP) Estimate reflects the relative frequencies of the two underlying causes/types by including the prior $P (cause)$ . If we know that all causes/types are equally likely, then we can set $P (cause)$ to the same value. We have:

Maximum Likelihood Estimate
$P (cause | evidence) \propto P (evidence | cause)$

Maximizes the likelihood $P (evidence | cause)$
Pros: can be a sensible choice if we have poor information about the prior probabilities
Cons: inaccurate if the prior probabilities of different causes are very different.

Naive Bayes

Naive Bayes deal with multiple types of evidences ( $A, B$ ) about the same cause/class $C$ . It is based on applying Intro Probability and #Bayes Rule with strong independence assumption between features.

MAP Estimate: $P (C | A, B) \propto P (A, B | C) \cdot P (C)$
Assume that $A$ and $B$ are conditionally independent given $C$ : $P (A, B | C) = P (A | C) \cdot P (B | C)$
So the Naive Bayes Equation is: $P (C | A, B) \propto P (A | C) \cdot P (B | C) \cdot P (C)$

In larger dimensions, where $C$ causes $E_{1}, . . . E_{n}$ effects:
$P (C | E_{1} . . . E_{n}) \propto P (E_{1} | C) \cdot P (E_{2} | C) . . . \cdot P (E_{n} | C) \cdot P (C) = P (C) \cdot Π_{k = 1}^{n} P (E_{k} | C)$

E.g. Classify UK v UK English given an input document $W_{1}, W_{2} . . . W_{n}$

$P (UK | W_{1} . . . W_{n}) \propto P (UK) \cdot Π_{k = 1}^{n} P (W_{k} | UK)$

$P (US | W_{1} . . . W_{n}) \propto P (US) \cdot Π_{k = 1}^{n} P (W_{k} | US)$

We need the following pre-knowledge from training data:

The likelihood $P (W | UK)$ and $P (W | US)$ for each word type $W$

The priors $P (UK)$ and $P (US)$

So $O (n)$ probabilities to estimate, where $n$ is the number of word types, whereas the full joint distribution is $O (2^{n})$ .

In the following sections, we look at how to find the likelihood of each word appearing in document of a certain types $P (W | C)$ , as well as its underlying problems and solutions.

Simple Estimating Probabilities from Data

Given Document $W_{1} . . . W_{n}$ , two classes $C_{1}$ and $C_{2}$ :

$P (C_{1}) \cdot Π_{k = 1}^{n} P (W_{k} | C_{1})$
$P (C_{2}) \cdot Π_{k = 1}^{n} P (W_{k} | C_{2})$

First try:
$count (W)$ = number of times $W$ occurs in the documents of class $C$
$n$ = number of total words in the documents of class $C$
The naive estimate is $P (W | C) = \frac{count (W)}{n}$

Text Classification

Text Data Model
CS440 Artificial Intelligence/Classification

Underflow and Log Transformation

Problem: Underflow

Using the simple estimation, uncommon words can cause the estimate process to produce numbers too small for standard floating point storage.

Log Transformation

Log Transformation gives better precisions on small values, loses precision on large values.

$P (C_{1}) \cdot Π_{k = 1}^{n} P (W_{k} | C_{1})$
becomes
$l o g (P (C)) + \sum_{k = 1}^{n} l o g (P (W_{k} | C))$

That is, our naive Bayes algorithm will be maximizing

$l o g (P (C | W_{1} . . . W_{k})) \propto l o g (P (C)) + \sum_{k = 1}^{n} l o g (P (W_{k} | C))$

E.g. "The" and "my" are very common. We don't care about what their probability differences are. But we do care about the appearance probability of less commonly used words.

Overfitting and Smoothing

Problem: Overfitting

Words that didn't appear in the training data get estimated zero probability. Words that were uncommon in the training data get inaccurate estimates. Zeroes destroy the Naive Bayes algorithm.

Laplace Smoothing

Smoothing assigns non-zero probabilities to unseen words. Note that it's tricky because all probabilities of words must add up to one.

$UNK$ = all unseen words (seen as a single word type)
$V$ = number of word types seen in training data
$n$ = number of words in our Class $C$ training data
$P (UNK | C) = \frac{α}{n + α (V + 1)}$
$P (W | C) = \frac{count (W) + α}{n + α (V + 1)}$

If we only use $UNK$ and $α$ , we will have a larger than 1 probability. So $V$ forces it back to normal.

Performance of Laplace smoothing

overestimates probability of unseen words
underestimates probability of common words

Deleted Estimation

Used to directly measure estimation errors. The steps are as follows:

Divide our training data into two halves 1 and 2.
Pick a specific count $r$ in the first half.
Suppose that $W_{1} . . . W_{n}$ are words that occur $r$ times in the first half. We can estimate the corrected count for this group of words as the average of their counts in the second half.
- Specifically, suppose $C (W_{k})$ is the count of $W_{k}$ in the second half of the dataset
- The corrected count value $C o r r (r) = \frac{\sum_{k = 1}^{n} C (W_{k})}{n}$
Further, make the estimate symmetrical:
- Assume both half contains roughly the same number of words, compute:
- $C o r r (r)$ as above
- $C o r r^{'} (r)$ reversing the role of the two halfs
- $\frac{C o r r (r) + C o r r^{'} (r)}{2}$ estimate of the true count for a word with observed count $r$