Salmon Run: probability

Showing posts with label probability. Show all posts

Saturday, March 28, 2020

Python notebooks for Bayesian Analysis Courses on Coursera

I recently completed the Coursera courses Bayesian Statistics: From Concept to Data Analysis and Bauesian Statistics: Techniques and Models, taught by Prof. Herbert Lee and Mathew Heiner of the University of California, Santa Cruz. I did both in audit mode, so "completed" is not totally accurate, since the second course did not allow submission of quiz answers without paying for the course. But the content for both are free and excellent, and I learned a lot from them, and highly recommend them if you are interested in the subject of Bayesian Analysis. Please be aware that the courses are somewhat elementary, it was a good way for someone like me, curious about but not very knowledgeable about Bayesian Analysis, to get to a point where I can hopefully explore the subject on my own. So if you are like me, you will find the courses useful, otherwise probably not.

Both courses use R as the programming language. The first course is more math and less programming, but covers concepts which are essential in the second course. In fact, I started on the second course because I was curious about MCMC (Markov Chain Monte Carlo), but found myself out of my depth within the first week, so I ended up having to do Prof. Lee's course first. And even though it's called Concepts to Data Analysis, this is much more than your high school statistics course (my level before I took the course, give or take). It starts with probability concepts, then goes off into different kinds of distributions and when you should use them, how to do inference with these distributions and Bayes theorem, both for discrete and continuous data. At the end of the course, you will know which distributions to use when, and what to look for when trying to draw conclusions from a given distribution. It also covers linear regression, both single and multiple, from a statistician's rather than a machine learning perspective.

The second course is taught by Mathew Heiner, a doctoral student at UCSC. This expands on Prof. Lee's course, starting with simple conjugate models (this is where I realized I was out of my depth the first time around, BTW) and moving on to MCMC models for binary, continuous, and count data, as well as how to compose them into hierarchical models to account for uncertainty in our knowledge. It also covers Metropolis-Hastings and Gibbs sampling methods. The course is very example-driven, using small datasets included in the R platform, to explain each concept. The MCMC library used in the course is rjags, which depends on JAGS (Just Another Gibbs Sampler).

Going into the course, I had some understanding (superficial in hindsight) of MCMC, and my main motivation was to learn enough theory to work intelligently with PyMC3, a Python toolkit for Probabilistic Programming. I figured that going through the course to learn the theory and reimplementing the R and JAGS examples in Python and PyMC3 will allow me to learn both faster (kind of how joint learning sometimes works better in Machine Learning models), so that's what I did. These are modeled as Jupyter notebooks with short text annotations, code, and outputs, so you can read them if you like, but you will probably benefit more from doing the exercises yourself and using my notebooks as cheat sheets for when you are stuck. All notebooks are runnable without any additional data. The examples used datasets built into the R platform, which Vincent Arel-Bundock has been kind enough to package and host at his R-Datasets repository. My notebooks automatically pull the data from his repository if they are not already downloaded.

The notebooks are in my sujitpal/bayesian-stats-examples Github repository, each course is in its own subfolder. Direct links to notebooks for each course are provided below for convenience as well, hopefully you find them useful as you navigate your way around the world of Bayesian analysis and PyMC3.

I am currently exploring this subject a bit more using the book Bayesian Analysis with Python 2/ed by Osvaldo Martin. The book is recommended from the PyMC3 github page, and so far, I find it covers the Coursera course material, and then some, even though it is listed as an introductory book. The PyMC3 Tutorial is also an excellent resource, and I have used it as a reference when reimplementing JAGS models from the course. The book also mentions the Arviz package for exploratory analysis of Bayesian models, which is part of the effort around the move to PyMC4 (see below), and is being led by the author.

Another book I want to mention is the one where I first learned about PyMC3 -- Bayesian Methods for Hackers by Cam Davidson-Pilon. The book is fantastic and only around 250 pages, and contains many code examples and graphs. It resembles a series of very well-written Jupyter notebooks, which is how Davidson-Pilon has effectively also provided an open source version of the book on Github. But I found it very dense the first time I read it, probably because I didn't have sufficient background in statistics to follow it through its mental leaps. In a subsequent partial re-read after completing the courses above, I did end up with an easier read.

Finally, I wanted to address a concern that I had, and I think many others might have too. PyMC3 depends on Theano for its fast numerical computing backend, and the first time I looked it, I learned that LISA Labs, the group at the University of Montreal that created and maintained Theano, had decided that it was time to move on from Theano and discontinued support. At around the same time, the PyMC4 project was born, and its objective was to provide a PyMC3 like API on top of the Tensorflow Probability library. At the time, the future of PyMC3 seemed uncertain, and I figured it might be safer to wait until PyMC4 became available, rather than spending time learning PyMC3 and having my newly acquired skills go obsolete soon after. However, 6-10 months since then, PyMC4 is still pre-release, and the PyMC3 team has committed to supporting Theano as it relates to PyMC3, so I have more confidence that the effort to learn PyMC3 will not go to waste. The article Theano, Tensorflow, and the future of PyMC3 posted by Chris Fonnesbeck, creator of PyMC3, provides more detail around this.

I hope the notebooks are useful. In keeping with Coursera terms of use, I have not published notebooks containing quiz answers, even though their usefulness solely for quiz answers is doubtful. This is because because the Python/PyMC3 models sometimes produce slightly different results from their R/JAGS counterparts described in the course, probably because of numerical precision and algorithm differences. On the other hand, the models on which the quiz questions are based are sometimes interesting because they illustrate concepts mentioned in the classes, so being able to publish them would probably have been helpful.

Saturday, February 28, 2015

Modeling the Evolution of Skin Color

Sometime back, someone asked this question on Quora: if all humans descended from a single population of ancestors in Africa, how did different skin colors come about? Pippi Maria Groving provided an answer (the first one) that I found intuitively very appealing, although she does mention (and later answers also indicate) that this may be a bit of an oversimplification.

Essentially she states that there are 3 important genes that govern skin color in humans, say A, B and C. Each human has a pair of each, and each pair may be a combination of dominant and recessive versions of these genes, denoted by uppercase and lowercase respectively. Thus, given this model, one can have a combination of 6 possible genes (in pairs) for skin color: A, a, B, b, C, c. The dominant genes produce pigments that darken the skin, so the more dominant genes one has, the darker the skin color. She also provides a 8x8 Punnet Square exhaustively listing out all combinations of genes (called genotypes) and their resulting skin color (phenotypes). Binning the phenotypes into 7 distinct skin colors results in a theoretical ratio of 1:6:15:20:15:6:1 from very dark to very light.

I am currently also doing an edX course (offered by MIT) on Quantitative Biology, so I figured it would be interesting to try to model this as a simulation to see if my experimental results matched up to the theoretical ratios. Since skin color is an adaptation based on the weather, with darker skin providing protection from ultraviolet (UV) light in hot and sunny climates, and lighter skin able to make more vitamin D with limited UV in colder and less sunny climates, I ran another simulation to model that situation. This post is a result of that work.

The functions used for this simulation is provided below. I describe the relevant functions as they are called from the main code below.

from operator import itemgetter
from random import random
import math
import matplotlib.pyplot as plt
import nltk
import numpy as np

def person():
    alleles = []
    for allele in ['a','b','c']:
        pairs = []
        for pair in range(2):
            pairs.append(allele if random() <= 0.5 else allele.upper())
        alleles.append("".join(sorted(pairs)))
    return alleles

def shuffle_and_choose(counts):
    shuffled = [x[0] for x in sorted(enumerate([random() for i in 
                range(len(counts))]), key=itemgetter(1))]
    return counts[shuffled[0]]

def compute_mating_likelihood(left, right):
    left_dominant = get_num_dominant(left)
    right_dominant = get_num_dominant(right)
    diff = abs(left_dominant - right_dominant)
    return math.exp(-diff)

def mate(left, right):
    mated_alleles = []
    for i in range(3):
        child_pairs = []
        for lp in left[i]:
            for rp in right[i]:
                child_pairs.append("".join(sorted([lp, rp])))
        mated_alleles.append(shuffle_and_choose(child_pairs))
    return mated_alleles

def get_num_dominant(allele):
    return len([c for c in "".join(allele) if c == c.upper()])    
        
def produce_next_generation(curr_gen, region_filter=None):
    next_gen = []
    males = curr_gen[:len(curr_gen)/2]
    females = curr_gen[len(curr_gen)/2:]
    i = 0
    while i < len(curr_gen):
        mptr = int(random() * len(males))
        fptr = int(random() * len(females))
        offspring = mate(males[mptr], females[fptr])
        if region_filter is not None:
            num_dominant = get_num_dominant(offspring)
            if not num_dominant in region_filter:
                if random() > 0.1:
                    continue
        next_gen.append(offspring)
        i = i + 1
    return next_gen
        
SKIN_COLORS = {
  6: 0x111111, 5: 0x6B0000, 4: 0x7B3812, 3: 0xAB671D, 
  2: 0xE0AD87, 1: 0xFDDACA, 0: 0xFEF2DF 
};

def get_color_distrib(curr_gen):
    color_dist = {k:0 for k in SKIN_COLORS.keys()}
    for alleles in curr_gen:
        num_dominant = get_num_dominant(alleles)
        color_dist[num_dominant] = color_dist[num_dominant] + 1
    dist_values = []
    for k in sorted(list(color_dist.keys())):
        dist_values.append(color_dist[k])
    return np.array(dist_values)

def plot_population_chart(color_pop, gen_title):
    xs = [str(hex(SKIN_COLORS[x])).replace("0x", "#") for x in range(7)]
    plt.bar(range(len(xs)), color_pop, color=xs)
    plt.xlabel("Skin Colors")
    plt.ylabel("Frequency")
    plt.xticks([])
    plt.title("Skin Color Distribution: %s" % (gen_title))
    plt.show()

def plot_population_drift(drift_data, title):
    generations = range(drift_data.shape[1])
    xs = range(drift_data.shape[0])
    colors = [str(hex(SKIN_COLORS[x])).replace("0x", "#") for x in xs]
    plt.stackplot(generations, drift_data, baseline="zero", colors=colors)
    plt.xlabel("Generations")
    plt.ylabel("Frequency")
    plt.title("Phenotype Drift:%s" % (title))
    plt.show()

The first part models a situation where the distribution of skin color is truly random. I generate a population of 2,000 individuals, each with a random set of 3 gene pairs (alleles). Then I split the set in half resulting in a pair of sets of 1,000 individuals each. A random member of each set is mated to a random member of the other to produce an offspring - to ensure randomness in the offspring, I exhaustively compute all possibilities for each allele and randomly pick one to create each corresponding offspring allele. This is repeated for 100 generations and the resulting population distribution binned by the 7 skin color phenotypes.

num_generations = 100

drift_data = np.zeros((len(SKIN_COLORS), num_generations + 1))
curr_gen = [person() for i in range(2000)]
drift_data[:, 0] = get_color_distrib(curr_gen)
plot_population_chart(drift_data[:, 0], "Initial")

for i in range(num_generations):
    next_gen = produce_next_generation(curr_gen)
    drift_data[:, i+1] = get_color_distrib(next_gen)
    curr_gen = next_gen
    
plot_population_chart(drift_data[:, num_generations], "Final")    

print drift_data[:, num_generations]

As can be seen, the distribution of the skin color phenotype in the 100-th generation looks remarkably similar to that of the first generation. The colors are identical to those used in Pippa's answer (thanks to this online Color Picker Tool).

The chart below shows the drift in the phenotype distribution across generations. Once again, the long term trend seems quite flat and unchanging.

The counts of the number of individuals in the different skin color categories in the final generation for me were 39:182:467:637:473:169:33, which is remarkably similar to the theoretical observed ratio as shown below. Note: some numbers were rounded to make them line up for easier visual comparison.

>>> from __future__ import division
>>> import numpy as np
>>> theoretical = np.array([1, 6, 15, 20, 15, 6, 1])
>>> observed = np.array([39, 182, 467, 637, 473, 169, 33])
>>> theoretical / np.sum(theoretical)
array([ 0.0156, 0.094, 0.2343, 0.3125, 0.2344, 0.0938, 0.0156])
>>> observed / np.sum(observed)
array([ 0.0195, 0.091, 0.2335, 0.3185, 0.2365, 0.0845, 0.0165])

I now tried to simulate the situation where identically random sets of people moved to different geographical regions (5 in my case) with different levels of sunlight. In each region, natural selection would ensure that people of certain skin colors survived. I choose skin color "trigrams" for each region. Thus, for region 1 (cold and dark) I choose the skin color categories (1,2,3), for region 2 I choose categories (2,3,4), and so on. Offspring produced in each generation whose genotypes resolved to one of the "approved" skin colors for the region would survive unconditionally, while others would survive with a 10% chance.

regions = [x for x in nltk.trigrams(range(7))]
for i in range(len(regions)):
    drift_data = np.zeros((len(SKIN_COLORS), num_generations + 1))
    curr_gen = [person() for x in range(2000)]
    for j in range(num_generations):
        next_gen = produce_next_generation(curr_gen, region_filter=set(regions[i]))
        drift_data[:, j+1] = get_color_distrib(next_gen)
        curr_gen = next_gen
    plot_population_drift(drift_data, "Dispersion, Region %d" % (i+1))

This produces the following population drift charts.

As can be seen, each region seems to have a preferred skin color that begins to dominate after a while. So the model, grossly oversimplified as it is, seems to agree with the facts.

I started on this because I was curious if I could build something that approximated reality using randomness (ie random.random() or flipping a coin). I had lots of fun with it, hope you enjoyed reading it also. The code for this is available on my project on GitHub here.

Saturday, April 20, 2013

Language Model to detect Medical Sentences using NLTK

I've been thinking of ways of singling out medical sentences in a body of mixed text for special processing, and one of the approaches I thought of was to train a trigram (backoff) language model using some medical text, then use the model to detect if a sentence is medical or non-medical. The joint probability of the words appearing in the model should be higher for medical sentences than for non-medical ones.

I initially looked at NLTK's NgramModel, but unfortunately could not make it work because the Lidstone probability distribution I was passing to it as an estimator expected a minimum number of bins to be configured. Unfortunately I could not reproduce the error with small amounts of data (so I could submit a bug report). In any case, I also found that the NgramModel can't be pickled (because of a probability distribution function object in it), which made it even less interesting.

In any case, you can find this (non-working) code in my GitHub here. It crashes with a "ValueError - A Lidstone probability distribution must have at least one bin" error message during the testing phase. Unfortunately I can't share the data for licensing reasons. But hopefully, if you have a reasonably large set of XML files (I had about 3,500) to feed the code, it should hopefully fail at around the same place. [Update: I found a publicly available XML sample and I have asked about this on the nltk-users mailing list - you can follow the discussion here, if you'd like.]

However, it turns out that a trigram language model is quite simple to build, especially using NLTK's building blocks. My language model attempts to first report trigram probabilities, falling back to corresponding bigram and unigram probabilities, and finally reporting a Laplace smoothed estimate if the unigram probability is also 0. Probabilities at lower n-grams are discounted by a (heuristically chosen) value alpha, and the final result normalized by the number of words in the sentence (to remove the effect of long sentences). Because this is a proof of concept to test the validity of the idea more than anything else, I decided to skip the calculation of alpha.

Here is the code for the home grown language model described above (also available in my GitHub). The train() method reads in sentences from a bunch of medical XML files, and parses out the sentences. These sentences are then used to instantiate the LanguageModel class, which is then pickled. The test() method then unpickles the model and uses it to compute the log probabilities of sentence trigrams, finally normalizing it with the length of the sentence.

from __future__ import division

import math
import os.path

import cPickle
import glob
import nltk
from nltk.corpus.reader import XMLCorpusReader

class LangModel:
  def __init__(self, order, alpha, sentences):
    self.order = order
    self.alpha = alpha
    if order > 1:
      self.backoff = LangModel(order - 1, alpha, sentences)
      self.lexicon = None
    else:
      self.backoff = None
      self.n = 0
    self.ngramFD = nltk.FreqDist()
    lexicon = set()
    for sentence in sentences:
      words = nltk.word_tokenize(sentence)
      wordNGrams = nltk.ngrams(words, order)
      for wordNGram in wordNGrams:
        self.ngramFD.inc(wordNGram)
        if order == 1:
          lexicon.add(wordNGram)
          self.n += 1
    self.v = len(lexicon)

  def logprob(self, ngram):
    return math.log(self.prob(ngram))
  
  def prob(self, ngram):
    if self.backoff != None:
      freq = self.ngramFD[ngram]
      backoffFreq = self.backoff.ngramFD[ngram[1:]]
      if freq == 0:
        return self.alpha * self.backoff.prob(ngram[1:])
      else:
        return freq / backoffFreq
    else:
      # laplace smoothing to handle unknown unigrams
      return ((self.ngramFD[ngram] + 1) / (self.n + self.v))

def train():
  if os.path.isfile("lm.bin"):
    return
  files = glob.glob("data/*.xml")
  sentences = []
  i = 0
  for file in files:
    if i > 0 and i % 500 == 0:
      print("%d/%d files loaded, #-sentences: %d" %
        (i, len(files), len(sentences)))
    dir, file = file.split("/")
    reader = XMLCorpusReader(dir, file)
    sentences.extend(nltk.sent_tokenize(" ".join(reader.words())))
    i += 1
  lm = LangModel(3, 0.4, sentences)
  cPickle.dump(lm, open("lm.bin", "wb"))

def test():
  lm1 = cPickle.load(open("lm.bin", 'rb'))
  testFile = open("sentences.test", 'rb')
  for line in testFile:
    sentence = line.strip()
    print "SENTENCE:", sentence,
    words = nltk.word_tokenize(sentence)
    wordTrigrams = nltk.trigrams(words)
    slogprob = 0
    for wordTrigram in wordTrigrams:
      logprob = lm1.logprob(wordTrigram)
      slogprob += logprob
    print "(", slogprob / len(words), ")"

def main():
  train()
  test()

if __name__ == "__main__":
  main()

And here are the language model's predictions for a set of test sentences I pulled off the Internet (mainly Wikipedia).

In biology, immunity is the state of having sufficient biological defences to avoid infection, disease, or other unwanted biological invasion. (-6.53506411778)
Naturally acquired immunity occurs through contact with a disease causing agent, when the contact was not deliberate, whereas artificially acquired immunity develops only through deliberate actions such as vaccination. (-7.90563670519)
Immunity from prosecution occurs when a prosecutor grants immunity, usually to a witness in exchange for testimony or production of other evidence. (-8.40420096533)
Transactional immunity (colloquially known as "blanket" or "total" immunity) completely protects the witness from future prosecution for crimes related to his or her testimony. (-8.60917860675)
Hearing loss is being partly or totally unable to hear sound in one or both ears. (-1.61661138183)
Conductive hearing loss (CHL) occurs because of a mechanical problem in the outer or middle ear. (-1.98718543565)
Sensorineural hearing loss (SNHL) occurs when the tiny hair cells (nerve endings) that detect sound in the ear are injured, diseased, do not work correctly, or have died. (-2.5566194904)
This type of hearing loss often cannot be reversed. (-2.72710898378)
In law, a hearing is a proceeding before a court or other decision-making body or officer, such as a government agency. (-5.87112753897)
Within some criminal justice systems, a preliminary hearing (evidentiary hearing) is a proceeding, after a criminal complaint has been filed by the prosecutor, to determine whether there is enough evidence to require a trial. (-7.44050739024)

As you can see, sentences that are obviously medical tend to have a higher normalized log probability (the value at the end of the sentence) than sentences that are not. Sentences #1 and #2 are right on the border with normalized log probability comparable to non-medical sentences. Depending on the results of more tests, this model may or may not be good enough. Alternatively, it may be more effective to reframe the problem as one where we have classify a sentence as belonging to one of multiple genres, and each genre has a language model.

Anyway, thats all I have for today. Hope you found it interesting.

Saturday, October 17, 2009

Binomial Probability Distribution with Commons-Math

Sometime back, I wrote about a Hadoop Phrase Extractor that went through a bunch of electronic books from the Gutenberg project and extracted possible phrases by parsing out 3-5 word grams. The extractor works fairly well for phrases of 3 words or more with a simple count threshold. However, for 2 word phrases, a count threshold (i.e. only consider word bigrams which occur more than a specific cutoff value) did not work so well for me.

In his book, Building Search Applications - Lucene, LingPipe and Gate, Dr Manu Konchady touches on using Binomial Distributions to compute the likelihood of a word bigram being a phrase based on its number of occurrences and the number of occurrences of its component words in a document corpus. I decided to explore that a bit further, and I try to explain my understanding of the approach in this post.

Binomial Distribution - theory

Although I learned about Binomial Distributions in high school, I never really understood it beyond being able to solve simple toy problems, and in any case, its been a while, so I decided to brush up on the theory. The AP* Statistics Tutorial: Binomial Distribution page from StatTrek explains it in great detail, so you may also find this page helpful if you are in a similar situation.

Essentially, for a word bigram, we are trying to estimate the likelihood of observing the number of bigram occurrences of the second word, preceded by the first word, out of the number of occurrences of the second word alone, with the observed probability of the second word in the corpus. This definition reduces our phrase detection problem into a Binomial Distribution problem.

Binomial Distribution Implementation

The commons-math library does not have a implementation for this distribution (this is not true, as I found out later, please see the update below for details), but it provides a Distribution Framework and guidelines for creating your own custom distributions, which I used to create my own subclass. The code is shown below:

// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/BinomialDistribution.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;

import org.apache.commons.math.MathException;
import org.apache.commons.math.distribution.AbstractIntegerDistribution;
import org.apache.commons.math.distribution.DiscreteDistribution;
import org.apache.commons.math.util.MathUtils;

/**
 * Provides various probability density functions for a Binomial
 * Distribution. Definitions and formulae for Binomial Distribution
 * can be found here:
 * {@link http://stattrek.com/Lesson2/Binomial.aspx?Tutorial=AP}
 */
public class BinomialDistribution extends AbstractIntegerDistribution 
    implements DiscreteDistribution {

  private static final long serialVersionUID = -1858690105951636184L;

  private int n;    // number of trials
  private double p; // probability of success on an individual trial
  
  /**
   * Construct a Binomial Distribution experiment instance.
   * @param n the number of trials.
   * @param p the probability of success on an individual trial.
   */
  public BinomialDistribution(int n, double p) {
    super();
    this.n = n;
    this.p = p;
  }

  /**
   * Using logs is useful for very large values of n.
   * @param x the number of successes.
   * @return the log of the probability.
   */  
  public double logProbability(int x) {
    return MathUtils.binomialCoefficientLog(n, x) + 
      (x * Math.log(p)) + ((n - x) * Math.log(1 - p));
  }
  
  /**
   * Computes the probability that the experiment results in 
   * exactly x successes. The computation is done with logarithms
   * internally and converted back to the probability value to
   * prevent overflow.
   * @param x the number of successes, should be an integer.
   * @return probability the probability of exactly x successes.
   */
  @Override
  public double probability(int x) {
    return Math.exp(MathUtils.binomialCoefficientLog(n, x) + 
      (x * Math.log(p)) + ((n - x) * Math.log(1 - p)));
  }

  /**
   * Computes the probability that the experiment results in at 
   * least x (0-x) successes.
   * @param x the number of successes, should be an integer.
   * @return the probability of at least x successes.
   */
  @Override
  public double cumulativeProbability(int x) throws MathException {
    double cumulativeProbability = 0.0D;
    for (int i = 0; i <= x; i++) {
      cumulativeProbability += probability(i);
    }
    return cumulativeProbability;
  }

  @Override
  protected int getDomainLowerBound(double p) {
    return 0;
  }

  @Override
  protected int getDomainUpperBound(double p) {
    return 1;
  }
}

To test the component out, I used the problems and their solutions described on the StatTrek page. Here is my JUnit test. The class is not that hard to use, but this also shows the usage of the class.

// Source: src/test/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/BinomialDistributionTest.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;

import java.math.RoundingMode;

import org.apache.commons.math.util.MathUtils;
import org.junit.Assert;
import org.junit.Test;

/**
 * Tests for Binomial Distribution. The tests solves some problems
 * found in the StatTrek AP Statistics page and compares the results
 * with that provided in the page. Link for the StatTrek page is:
 * {@link http://stattrek.com/Lesson2/Binomial.aspx?Tutorial=AP}
 */
public class BinomialDistributionTest {

  /**
   * Problem: Suppose a die is tossed 5 times. What is the probability
   * of getting exactly 2 fours?
   * @throws Exception if thrown.
   */
  @Test
  public void test1() throws Exception {
    double p = 1.0D / 6.0D;
    BinomialDistribution bd = new BinomialDistribution(5, p);
    double probability = bd.probability(2);
    System.out.println(">> b(x = 2; 5, " +  p + ") = " + probability);
    Assert.assertEquals(0.161, MathUtils.round(probability, 3, 
      RoundingMode.HALF_UP.ordinal()));
  }

  /**
   * Problem: What is the probability of obtaining 45 or fewer heads in
   * 100 tosses of a coin.
   * @throws Exception if thrown.
   */
  @Test
  public void test2() throws Exception {
    BinomialDistribution bd = new BinomialDistribution(100, 0.5D);
    double cumulativeProbability = bd.cumulativeProbability(45);
    System.out.println(">> b(x <= 45; 100, 0.5) = " +
      cumulativeProbability);
    Assert.assertEquals(0.184, MathUtils.round(cumulativeProbability, 3, 
      RoundingMode.HALF_UP.ordinal()));
  }
  
  /**
   * Problem: The probability that a student is accepted into a prestigeous
   * college is 0.3. If 5 students from the same school apply, what is the
   * probability that at most 2 are accepted?
   * @throws Exception
   */
  @Test
  public void test3() throws Exception {
    BinomialDistribution bd = new BinomialDistribution(5, 0.3D);
    double cumulativeProbability = bd.cumulativeProbability(2);
    System.out.println(">> b(x <= 2; 5, 0.3) = " +
      cumulativeProbability);
    Assert.assertEquals(0.8369, MathUtils.round(cumulativeProbability, 4,
      RoundingMode.HALF_UP.ordinal()));
  }
}

Application in Phrase Identification

I want to use this in the context of a Hadoop job that will extract the word unigrams and bigrams and store their counts. For that, I wanted to first write some code that will, given some parameters I know I can get from a previous stage of my job, compute the likelihood value for the phrase using the problem definition above. Here is a unit test working from the problem definition.

  @Test
  public void test4() throws Exception {
    // known parameters
    int nNew = 1044;      // number of occurrences of first word
    int nYork = 597;      // number of occurrences of second word
    int nNewYork = 588;   // number of bigram occurrences
    int nWords = 1900000; // total words in document corpus
    // observed probability of second word
    double pYork = (double) nYork / (double) nWords;
    BinomialDistribution bd = new BinomialDistribution(nYork, pYork);
    // our probabilities in data mining situations are usually too small
    // so we should use the log probabilities. The logEstimatedProbability
    // is the log likelihood of "new" preceding the word "york" in the 
    // case where "new" and "york" are assumed to have no relation.
    double logEstimatedProbability = bd.logProbability(nNewYork);
    // compare with the log of the observed probability of the bigram
    double logActualProbability = 
      Math.log((double) nNewYork / (double) nWords);
    // determine whether this is a phrase by a simple test
    boolean likely = (logActualProbability - logEstimatedProbability > 0);
    System.out.println("likely? " + likely);
    Assert.assertTrue(likely);
  }

The method outlined in the book is significantly more complex and uses a null independence hypothesis and a corresponding dependence hypothesis. Likelihood ratios are computed for the cases where New precedes York and where New does not precede York for both hypothesis. I traced through the method but could not figure out why the algorithm needs to be so complex, so decided to do it the way I described above. If you find flaws in this approach, please let me know.

Update 2009-10-23: commons-math provides a Binomial Distribution implementation (called BinomialDistributionImpl), and has been at least since version 1.0, not sure how I missed it, but unfortunately I did. The DistributionFactory class referenced here seems to have disappeared from commons-math 2.0 (which I am using), so I guess you can instantiate the implementation with a BinomialDistribution bd = new BinomialDistribution(n, p) call. However, the current implementation does not use the BinomialCoefficient.binomialCoefficientLog() call, which makes it unsuitable for my usage (see the logProbability() method in my implementation). So one possibility is to subclass the BinomialDistributionImpl and add it in there. Another way is to simply create a method that takes the n, p and x parameters and just returns the logProbability, which is probably the path I will be taking.

Friday, November 14, 2008

IR Math in Java : Rule based POS Tagger

In my previous post, I described an HMM based Part of Speech tagger. This post describes a rule based POS tagger loosely based on ideas underlying the design of the Brill Tagger, as described in Chapter 4 of the TMAP book.

The rule based tagger provides a single method tagSentence(), which takes a sentence as input, and returns the same sentence tagged with the appropriate part of speech. I was too lazy to add a convenience method to return the POS for a given word in a sentence, since its really simple, and I have already done this in the code in my previous post.

Wordnet is used to find the part of speech for each word in the sentence. We use MIT Java Wordnet Interface (JWI) to access the Wordnet database. Wordnet can only recognize the following four parts of speech - NOUN, VERB, ADJECTIVE and ADVERB. Therefore, our POS tagging is restricted to these four and OTHER. We enhance our Pos enum class from our previous post with methods to convert Wordnet POS to and from our Pos enum, as shown below:

// Source: src/main/java/com/mycompany/myapp/postaggers/Pos.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.collections15.BidiMap;
import org.apache.commons.collections15.bidimap.DualHashBidiMap;
import org.apache.commons.lang.StringUtils;

import edu.mit.jwi.item.POS;

public enum Pos {

  NOUN, VERB, ADJECTIVE, ADVERB, OTHER;

  private static Map<String,Pos> bmap = null;
  private static BidiMap<Pos,POS> wmap = null;
  private static final String translationFile = 
    "src/main/resources/brown_tags.txt";
  
  public static Pos fromBrownTag(String btag) throws Exception {
    // .. omitted for brevity, see previous post for body
  }

  public static Pos fromWordnetPOS(POS pos) {
    if (wmap == null) {
      wmap = buildPosBidiMap();
    }
    return wmap.getKey(pos);
  }
  
  public static POS toWordnetPos(Pos pos) {
    if (wmap == null) {
      wmap = buildPosBidiMap();
    }
    return wmap.get(pos);
  }

  private static BidiMap<Pos,POS> buildPosBidiMap() {
    wmap = new DualHashBidiMap<Pos,POS>();
    wmap.put(Pos.NOUN, POS.NOUN);
    wmap.put(Pos.VERB, POS.VERB);
    wmap.put(Pos.ADJECTIVE, POS.ADJECTIVE);
    wmap.put(Pos.ADVERB, POS.ADVERB);
    wmap.put(Pos.OTHER, null);
    return wmap;
  }
}

When Wordnet is asked what the POS of a particular word is, it can return one of the following three results, each of which are handled differently as described below:

No POS found for word
Single Unique POS found for word
Multiple POS found for word

No POS found for Word

In this case, Wordnet may not know about the word being checked, or it could be a proper noun. We use a combination of word pattern rules to try and guess the POS for this word. If none of the patterns match, then the POS is considered to be OTHER.

First, we check to see if the first letter is uppercase, in that case we assume that it is a proper noun, and therefore tag it with Pos.NOUN.

If not, we check if the word ends with one of the known suffixes that exist in our suffix to POS mappings, longest suffix first. The POS corresponding to the first matched suffix is used to tag the word.

If not, the word is tagged as Pos.OTHER.

Single Unique POS found for Word

In this case, there is no confusion -- Wordnet tells us that there is a single POS found for the word, so we tag the word with this POS, and continue on with our life...err, I mean, the next word.

Multiple POS found for Word

When Wordnet reports multiple POS possibilities for a word, it means that the word can be used as different POS depending on where it is used in the sentence - in other words, the context determines the POS.

The tagger considers the context as the word trigram surrounding the word (i.e. before/current/after). Two rules are fired, the word-backward rule and the word-forward rule, unless the word happens to occur at the beginning or end of the sentence, in which case only one rule is fired. The objective of these rules is to find the most likely POS for the word based on the POS of its anterior and posterior neighbors.

The probabilities used to compute the likelihood comes from the transition probabilities (A values) that were computed from the Brown Corpus in my previous post.

Each rule finds the highest probability of a particular POS pairs (before/current and current/after) occurring and associates it with the word's POS. The two probabilities are then added and the word POS corresponding to the highest probability is used to tag the word.

In addition, we could have used the emission probabilities (Π) from from our last post to similarly resolve ambiguous POS for the first word in the sentence, but we did not do this.

Tagger code

The code for the tagger is shown below:

// Source: src/main/java/com/mycompany/myapp/postaggers/RuleBasedTagger.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.FileReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import com.mycompany.myapp.clustering.ByValueComparator;
import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;
import com.mycompany.myapp.tokenizers.WordTokenizer;

import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.item.IIndexWord;

public class RuleBasedTagger {

  private final Log log = LogFactory.getLog(getClass());
  
  private class Context {
    public String prev;
    public List<Pos> prevPos;
    public String curr;
    public List<Pos> nextPos;
    public String next;
    public String toString() {
      return StringUtils.join(new String[] {prev,curr,next}, "/");
    }
  };
  
  private IDictionary wordnetDictionary;
  private Map<String,Pos> suffixPosMap;
  private double[][] tp;
  
  public void setWordnetDictLocation(String wordnetDictLocation) 
      throws Exception {
    this.wordnetDictionary = new Dictionary(
      new URL("file", null, wordnetDictLocation));
    this.wordnetDictionary.open();
  }

  public void setSuffixMappingLocation(String suffixMappingLocation) 
      throws Exception {
    String line;
    this.suffixPosMap = new TreeMap<String,Pos>(
      new Comparator<String>() {
        public int compare(String s1, String s2) {
          int l1 = s1.length();
          int l2 = s2.length();
          if (l1 == l2) {
            return s1.compareTo(s2);
          } else {
            return (l2 > l1 ? 1 : -1);
          }
        }
      }
    );
    BufferedReader reader = new BufferedReader(
      new FileReader(suffixMappingLocation));
    while ((line = reader.readLine()) != null) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      String[] suffixPosPair = StringUtils.split(line, "\t");
      suffixPosMap.put(suffixPosPair[0], Pos.valueOf(suffixPosPair[1]));
    }
    reader.close();
  }

  public void setTransitionProbabilityDatafile(
      String transitionProbabilityDatafile) throws Exception {
    int numPos = Pos.values().length;
    tp = new double[numPos][numPos];
    BufferedReader reader = new BufferedReader(
      new FileReader(transitionProbabilityDatafile));
    int i = 0; // row
    String line;
    while ((line = reader.readLine()) != null) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      String[] parts = StringUtils.split(line, "\t");
      for (int j = 0; j < parts.length; j++) {
        tp[i][j] = Double.valueOf(parts[j]);
      }
      i++;
    }
    reader.close();
  }

  public String tagSentence(String sentence) throws Exception {
    StringBuilder taggedSentenceBuilder = new StringBuilder();
    WordTokenizer wordTokenizer = new WordTokenizer();
    wordTokenizer.setText(sentence);
    List<Token> tokens = new ArrayList<Token>();
    Token token = null;
    while ((token = wordTokenizer.nextToken()) != null) {
      tokens.add(token);
    }
    // extract the words from the tokens
    List<String> words = new ArrayList<String>();
    for (Token tok : tokens) {
      if (tok.getType() == TokenType.WORD) {
        words.add(tok.getValue());
      }
    }
    // for each word, find the pos
    int position = 0;
    for (String word : words) {
      Pos partOfSpeech = getPartOfSpeech(words, word, position);
      taggedSentenceBuilder.append(word).
        append("/").
        append(partOfSpeech.name()).
        append(" ");
      position++;
    }
    return taggedSentenceBuilder.toString();
  }

  private Pos getPartOfSpeech(List<String> wordList, String word, 
      int position) {
    List<Pos> partsOfSpeech = getPosFromWordnet(word);
    int numPos = partsOfSpeech.size();
    if (numPos == 0) {
      // unknown Pos, apply word rules to figure out Pos
      if (startsWithUppercase(word)) {
        return Pos.NOUN;
      }
      Pos pos = getPosBasedOnSuffixRules(word);
      if (pos != null) {
        return pos;
      } else {
        return Pos.OTHER;
      }
    } else if (numPos == 1) {
      // unique Pos, return
      return partsOfSpeech.get(0);
    } else {
      // ambiguous Pos, apply disambiguation rules
      Context context = getContext(wordList, position);
      Map<Pos,Double> posProbs = new HashMap<Pos,Double>();
      if (context.prev != null) {
        // backward neighbor rule
        accumulatePosProbabilities(posProbs, word, partsOfSpeech, 
          context.prev, context.prevPos, false);
      }
      if (context.next != null) {
        // forward neighbor rule
        accumulatePosProbabilities(posProbs, word, partsOfSpeech, 
          context.next, context.nextPos, true);
      }
      if (posProbs.size() == 0) {
        return Pos.OTHER;
      } else {
        ByValueComparator<Pos,Double> bvc = 
          new ByValueComparator<Pos,Double>(posProbs);
        List<Pos> poslist = new ArrayList<Pos>();
        poslist.addAll(posProbs.keySet());
        Collections.sort(poslist, Collections.reverseOrder(bvc));
        return poslist.get(0);
      }
    }
  }

  private List<Pos> getPosFromWordnet(String word) {
    List<Pos> poslist = new ArrayList<Pos>();
    for (Pos pos : Pos.values()) {
      try {
        IIndexWord indexWord = 
          wordnetDictionary.getIndexWord(word, Pos.toWordnetPos(pos));
        if (indexWord != null) {
          poslist.add(pos);
        }
      } catch (NullPointerException e) {
        // JWI throws this if it cannot find the word in its dictionary
        // so we just dont add anything to the poslist.
        continue;
      }
    }
    return poslist;
  }

  private boolean startsWithUppercase(String word) {
    return word.charAt(0) == Character.UPPERCASE_LETTER;
  }

  private Pos getPosBasedOnSuffixRules(String word) {
    for (String suffix : suffixPosMap.keySet()) {
      if (StringUtils.lowerCase(word).endsWith(suffix)) {
        return suffixPosMap.get(suffix);
      }
    }
    return null;
  }

  private Context getContext(List<String> words, int wordPosition) {
    Context context = new Context();
    if ((wordPosition - 1) >= 0) {
      context.prev = words.get(wordPosition - 1);
      context.prevPos = getPosFromWordnet(context.prev);
    }
    context.curr = words.get(wordPosition);
    if ((wordPosition + 1) < words.size()) {
      context.next = words.get(wordPosition + 1);
      context.nextPos = getPosFromWordnet(context.next);
    }
    return context;
  }
  
  private void accumulatePosProbabilities(
      Map<Pos,Double> posProbabilities,
      String word, List<Pos> wordPosList, String neighbor, 
      List<Pos> neighborPosList, boolean isForwardRule) {
    if (isForwardRule) {
      for (Pos wordPos : wordPosList) {
        for (Pos neighborPos : neighborPosList) {
          double prob = 
            tp[wordPos.ordinal()][neighborPos.ordinal()];
          updatePosProbabilities(posProbabilities, wordPos, prob);
        }
      }
    } else {
      for (Pos neighborPos : neighborPosList) {
        for (Pos wordPos : wordPosList) {
          double prob = 
            tp[neighborPos.ordinal()][wordPos.ordinal()];
          updatePosProbabilities(posProbabilities, wordPos, prob);
        }
      }
    }
  }

  private void updatePosProbabilities(
      Map<Pos,Double> posProbabilities,
      Pos wordPos, double prob) {
    Double origProb = posProbabilities.get(wordPos);
    if (origProb == null) {
      posProbabilities.put(wordPos, prob);
    } else {
      posProbabilities.put(wordPos, prob + origProb);
    }
  }
}

Test Code and Data Files

The test code for this is really simple. All we do is to instantiate the RuleBasedTagger and set into it the location of the Wordnet dictionary, the suffix mapping data file and the transition probabilities (A values) from the HMM data file that we built in our previous post. The files are described in more detail below. Once instantiated and set up, we feed it a set of sentences, and get back POS-tagged sentences. Here is the code for the JUnit test.

// Source: src/test/java/com/mycompany/myapp/postaggers/RuleBasedTaggerTest.java
package com.mycompany.myapp.postaggers;

import org.junit.Test;

public class RuleBasedTaggerTest {

  private String[] INPUT_TEXTS = {
    "The growing popularity of Linux in Asia, Europe, and the US is " +
    "a major concern for Microsoft.",
    "Jaguar will sell its new XJ-6 model in the US for a small fortune.",
    "The union is in a sad state.",
    "Please do not state the obvious.",
    "I am looking forward to the state of the union address.",
    "I have a bad cold today.",
    "The cold war was over long ago."
  };

  @Test
  public void testTagSentence() throws Exception {
    for (String sentence : INPUT_TEXTS) {
      RuleBasedTagger tagger = new RuleBasedTagger();
      tagger.setWordnetDictLocation("/opt/wordnet-3.0/dict");
      tagger.setSuffixMappingLocation("src/main/resources/pos_suffixes.txt");
      tagger.setTransitionProbabilityDatafile(
        "src/main/resources/pos_trans_prob.txt");
      String taggedSentence = tagger.tagSentence(sentence);
      System.out.println("Original: " + sentence);
      System.out.println("Tagged:   " + taggedSentence);
    }
  }
}

I am using Wordnet-3.0 data files which can be downloaded from here.

The suffix to POS mapping file was created manually from data available here and here. The suffix and POS are tab separated. A partial listing follows:

# Source: src/main/resources/pos_suffixes.txt
# POS Suffixes
#SUFFIX POS
dom     NOUN
ity     NOUN
ment    NOUN
sion    NOUN
tion    NOUN
ness    NOUN
ance    NOUN
ence    NOUN
er      NOUN
...

The transition probabilities (A values) are taken from the HMM text file, which was generated from the Brown Corpus as described in my previous post. The file is a tab separated file of observed probabilities for transitioning from one POS to another. The numbers should add up to 1 across each line. Here is what the file looks like:

# Source: src/main/resources/pos_trans_prob.txt
# NOUN  VERB    ADJECTIVE  ADVERB  OTHER
0.155   0.156   0.019      0.025   0.645
0.095   0.195   0.168      0.094   0.449
0.639   0.024   0.148      0.005   0.183
0.052   0.228   0.111      0.041   0.569
0.206   0.199   0.205      0.039   0.351

Results

The results of running the test are shown below (edited for readability by adding newlines to break up the original and tagged sentences so words don't break across lines). As you can see, the tagging is fairly accurate.

Original: The growing popularity of Linux in Asia, Europe, and the US is 
          a major concern for Microsoft.
Tagged:   The/OTHER growing/ADJECTIVE popularity/NOUN of/OTHER Linux/NOUN 
          in/ADJECTIVE Asia/NOUN Europe/NOUN and/OTHER the/OTHER US/NOUN 
          is/OTHER a/NOUN major/ADJECTIVE concern/NOUN for/NOUN 
          Microsoft/OTHER 

Original: Jaguar will sell its new XJ-6 model in the US for a small fortune.
Tagged:   Jaguar/NOUN will/NOUN sell/VERB its/OTHER new/OTHER XJ-6/OTHER
          model/ADJECTIVE in/NOUN the/OTHER US/NOUN for/NOUN a/NOUN 
          small/ADJECTIVE fortune/NOUN 

Original: The union is in a sad state.
Tagged:   The/OTHER union/OTHER is/OTHER in/ADJECTIVE a/NOUN sad/ADJECTIVE
          state/NOUN 

Original: Please do not state the obvious.
Tagged:   Please/VERB do/VERB not/ADVERB state/VERB the/OTHER 
          obvious/ADJECTIVE 

Original: I am looking forward to the state of the union address.
Tagged:   I/ADJECTIVE am/NOUN looking/ADJECTIVE forward/NOUN to/OTHER 
          the/OTHER state/OTHER of/OTHER the/OTHER union/ADJECTIVE 
          address/NOUN 

Original: I have a bad cold today.
Tagged:   I/ADJECTIVE have/NOUN a/NOUN bad/ADJECTIVE cold/NOUN today/NOUN 

Original: The cold war was over long ago.
Tagged:   The/OTHER cold/ADJECTIVE war/NOUN was/OTHER over/ADVERB 
          long/VERB ago/ADJECTIVE

Conclusion

Unlike the HMM approach, where all the information was built into the model at the outset, the Rule based approach takes advantage of the fact that we can use a tagged dictionary (Wordnet) to tell the POS for a word as we encounter it, and that most words resolve to a single POS. For those that are not recognized by Wordnet, we use simple word pattern matching to deduce the POS. For those that resolve to multiple POS, we use collocation probability rules to disambiguate it.

And now, for something completely different...

On a completely different note, regular (i.e. more than first-time) readers may have noticed some new widgets on my blog. This week, I added a tag cloud widget, spiffed up the "Blogs I read" widget with a new one from Blogger which displays the favicon of the blog and reports on the last time it was updated, added social bookmarking links at the bottom of the post, and added URLs for Atom and RSS Syndication feeds for my blog. Hope you like them, and no, this is not some sinister black-hat SEO bid to pollute/enrich the World Wide Web with my blog contents. For my reasons, read on...

I've wanted to get a tag cloud ever since I wrote about how to build one with Python. I think it is an awesome way to provide a snapshot of the entire site contents at a glance. However, Blogger does not give you a Tag Cloud Widget, although it does give you a Label widget on which this one is based, and I never had the time to mess around with the HTML Template until last week.

For the social bookmarking links, I noticed that one of you had submitted one of my posts to Digg. Whoever it was, thank you, especially since you had to do this manually. For those of who who like my posts enough to submit to your favorite social bookmarking site, there is now a bank of links under my byline and tags which you can click and which will pre-populate the title and URL so you don't have to cut and paste.

The syndication URLs just seemed to be a good idea because I've written quite a few times about ROME and RSS, and its funny that I didn't offer a syndication feed URL myself before this. So now I offer two, one for RSS and another for Atom - take your pick :-).

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

Salmon Run