## CS410 Assignment #1: Basic probability/statistics and statistical NLP (due Friday, Feb. 12, 2010 in Class)

1. Probabilistic Reasoning and Bayes Rule [30 points]

Consider the problem of detecting email messages that may carry a virus. This problem can be modeled probabilistically by treating each email message as representing an observation of values of the following 4 random variables:
(1) V: whether the message carries a virus (1 for yes, 0 for no);
(2) A: whether the message has an attachment (1 for yes, 0 for no);
(3) L: whether the message is longer than 10 words (1 for yes, 0 for no);
(4) K: whether the sender is known to the receiver (1 for yes, 0 for no).

Given a message, we can observe the values of A, L, and K, and we want to infer its value of V. In terms of probabilistic reasoning, we are interested in evaluating the conditional probability p(V|A, L, K), and we would say that the message carries a virus if p(V=1|A, L,K) > p(V=0|A,L,K). We make a further conditional independence assumption that p(A,L,K|V)=p(A|V)p(L|V)p(K|V) for V=0 and V=1, i.e., we assume that if the status whether a message carries a virus is known (i.e., value of V is known), the values of A, K, and L would be independent. For a more detailed explanation of the concept "conditional independence", please read this page.

A. [10 points] Use the Bayes formula and the following known conditional probabilities to compute the probability that a message M with A=1, L=0, and K=0 carries a virus. I.e., show the computation p(V=1|A=1,L=0,K=0) and p(V=0|A=1,L=0,K=0). Would we conclude that message M carries a virus?

```
==================================================
|| V  ||p(A=1|V)|p(L=1|V) |p(K=1|V) |prior P(V) ||
||====||========|=========|=========|===========||
|| 0  ||  0.1   |  0.9    |   0.9   |    0.9    ||
||----||--------|---------|---------|-----------||
|| 1  ||  0.99  |  0.2    |   0.05  |    0.1    ||
==================================================
```
B. [5 points] In probability theory, can we change all the 8 probabilities in the table above to arbitrary values between 0 and 1? If not, are there any constraints on some values that we must follow?

C. [5 points] Explain how you can change our conclusion regarding whether message M carries a virus by changing the value of only one cell in the table.

D. [5 points] The conditional independence assumption p(A,L,K|V)=p(A|V)p(L|V)p(K|V) allows us to compute the joint conditional probability p(A,L,K|V) based on three individual conditional probabilities p(A|V), p(L|V), and p(K|V). This means that we only need to specify all the values for these three individual conditional probabilities, which would allow us to compute p(A,L,K|V) for any combinations of A, L, K, and V. If we don't make the conditional independence assumption, we would have to specify the probability distribution p(A,L,K|V) in a "brute-force" way, i.e., enumerating its value for all the combinations of the variables. Considering that all the probability values must sum to 1, what is the minimum number of probability values that we would have to specify in order to fully characterize the joint conditional probability distribution p(A,L,K|V)? Why?

E. [5 points] What do you think are the advantages and disadvantages of making the conditional independence assumption p(A,L,K|V)=p(A|V)p(L|V)p(K|V)?

2. Maximum Likelihood Estimation [15 points]

A Poisson distribution is often used to model the word frequency. Specifically, the number of occurrences of a word in a document with fixed length can be assumed to follow a Poisson distribution given by

```		       u^x * exp(-u)
p(wordcount=x | u) =  --------------
x!
```
where "u^x" means u raised to the power of x, and u is the parameter of the Poisson distribution, which happens to be its mean. Now, suppose we observe a sample of counts of a word w, {x_1, ..., x_N}, from N documents with the same length (x_i is the counts of w in one document). We want to estimate the parameter u of the Poisson distribution for word w.

Using the maximum likelihood method, in which we choose a value for u that maximizes the likelihood of our data {x_1,...,x_N}, derive a closed form estimate for u. (Hint: Write down the log-likelihood of {x_1,...,x_N}, which would be a function of u. Set the derivative of this function w.r.t. u to zero, and solve the equation for u. ) You can find the derivatives of elementary functions here.

3. Getting familiar with text [20 points]

Here are two current news articles downloaded from the Web: article1 and article2. Download these two articles and save them on your local disk, and perform some simple statistical analysis of the text without reading the contents of the articles.

Some Unix commands can be very useful for manipulations of text. Here is an example of how you can use commands such as "awk", "uniq", and "sort" to produce word counts. Suppose the text you want to count is named "mysource". You can do the following to generate counts of each word in "mysource":

1. Do "tr '[A-Z]' '[a-z]' < mysource > mysource.lowercase". This will convert all capital letters to lower cases.
2. Do "awk '{for (i=1;i<=NF;i++) print \$i;}' mysource.lowercase >mysource.onewordperline". This will split the words on a line so that each line has only one word.
3. Do "sort mysource.onewordperline | uniq -c > mysource.count". This will first sort all the words and then count the number of occurrences of each word.
4. Do "sort -rn -k1 mysource.count > mysource.countsorted". This will sort the words in descending order of counts so that you can easily see the high frequency words.

These steps can also be combined together as one single command:

"tr '[A-Z]' '[a-z]' < mysource | awk '{for (i=1;i<=NF;i++) print \$i;}' | sort | uniq -c |sort -rn -k1 > mysource.countsorted".

A. [5 points] Without reading the articles, do a simple word count for each article with the commands given above. Visually examine the word distribution for each article. Does the word distribution give you some idea about what each news article might be about?

B. [5 points] Compare the word distributions of these two articles. How are they similar to each other and how are they different?

C. [5 points] Do part of speech tagging and partial parsing on each article. A POS tagger called LBJ Part of Speech Tagger, and a chunker called LBJ Chunker are installed at /home/class/sp10/cs410/ on CSIL Linux machines. You can also download them and view the documentation at this page.

First do POS tagging with the following command. This will assign a POS tag to each word; see the explanation of Penn Treebank TAGS.

"java -classpath /home/class/sp10/cs410/LBJ2Library.jar:/home/class/sp10/cs410/LBJPOS.jar LBJ2.nlp.pos.POSTagPlain mysource.lowercase > mysource.tagged"

Then do chunking with the following command, which will break sentences to phrases which are shown in brackets.

"java -classpath /home/class/sp10/cs410/LBJ2Library.jar:/home/class/sp10/cs410/LBJPOS.jar:/home/class/sp10/cs410/LBJChunk.jar LBJ2.nlp.seg.SegmentTagPlain LBJ2.nlp.chunk.Chunker mysource.lowercase > mysource.chunked"

Examine the results. Did LBJ POS tagger and chunker make any mistakes in tagging and grouping words? If so, give a few examples.

D. [5 points] Use the output of the chunker to generate a list of noun phrases (i.e. phrases tagged as "NP") for each article, and use the "sort" and "uniq" commands to generate counts of noun phrases for each article. Sort the counts so that the phrases with high frequency would be on the top. Are these phrase counts more informative than the single word counts in suggesting what each article might be about? Now, read the two articles. Are their contents similar to what you were expecting?

4. Unigram language model and smoothing [35 points]

In this part of the assignment, you will use the same two news articles to explore simple unigram language models. Use article2 as the "test data", and the other as the "training data" for estimating a unigram language model. Use any programming language that you like. You may find it relatively easy if you use perl or python. In all cases, assume that the size of the underlying vocabulary is 10,000, i.e., there are a total number of 10,000 words that can potentially occur in an article, and the words we see in these two article are only a subset of the whole vocabulary. Thus, any unigram language model should be regarded as a word distribution over all the 10,000 words, though it may sometimes have zero probabilities for many of the 10,000 words. Use the natural logarithm whenever calculation of a logarithm function is involved. (Note that entropy is normally computed with base-2 logarithm, which allows us to interpret an entropy value as the number of bits; here we ask you to use the natural logarithm for convenience since it is directly supported by many programming languages.)

A. [10 points] Compute p(w), the empirical word distribution of your test article (i.e., article2). This is just the relative frequency of each word, i.e., the counts of w divided by the total number of words in that article. Compute the entropy of p(w), and report the value. Note that 0log0=0.

B. [15 points] Use the "add one" smoothing (i.e., Laplace smoothing) method to estimate a smoothed unigram language model, q(w), based on your training article (i.e., article1). Compute the following three quantities.

1. The cross entropy H(p,q) as defined on the lecture slides.
2. The log-likelihood of the test article using the trained unigram language model q(w).
3. The Kullback-Leibler divergence D(p||q) as defined on the lecture slides.

C. [10 points] Use "haiti earthquake" as a query to search the Web. Cut and paste about 50 lines of text from relevant articles about the topic, and save it as another training article. Use this article to train another smoothed unigram language model in exactly the same way as you did with your original training article. Denote this model by q'(w). Compute the same three quantities (i.e., cross-entropy, log-likelihood, and KL-divergence) as you have done for the original training document but now using q' to replace q. Compare the three values with what you obtained with the original training document. Which training document gives us a better model for predicting our test article and why?