Statistical Models of Word Frequency and Other Count Data

Size: px

Start display at page:

Download "Statistical Models of Word Frequency and Other Count Data"

Rudolph Norman
6 years ago
Views:

1 Statistical Models of Word Frequency and Other Count Data Martin Jansche

2 Motivation Item counts are commonly used in NLP as independent variables in many applications: information retrieval, topic detection and tracking, text categorization, among many others. Generative models are in widespread use. Such models make predictions about the distribution of word counts, word and document lengths, etc. Parametric models are equally widespread. Their assumptions need to be checked against data.

3 What this talk is about Accurately modeling discrete properties of text documents, such as length, word frequencies, etc. Focus on word frequency in documents. Claim 1: In addition to overdispersion, variation of word frequency across documents is largely due to zero-inflation. Claim 2: Modeling zero-inflation is often preferable to modeling overdispersion.

4 What this talk is not about Estimating the probabilities of unseen words. Instead, focus on words that occur zero times in most documents (true of most words!), but do occur a few times in a small number of documents. State-of-the-art text classification. Text classification using an independent feature model is used merely for illustration, since it is simple and benefits from richer models for individual features.

5 Parametric models Encode all properties of a distribution in (typically) very few parameters. Easy to incorporate prior information about plausible values for parameters. Can work with very small amounts of data. Can work with sparse data. Often closed form expressions are available for moments, probabilities, percentiles, etc.

6 Linguistic count data Focus on modeling document length and word frequency in documents. Sample sizes are often small: most words are extremely rare and most documents are fairly short. Overdispersion: natural variation not well captured by simple models with very few parameters. Zero-inflation: most words occur zero times in a given document; not captured by standard models.

7 Claim 1 Overdispersed models can capture increased variance of token frequency [Mosteller and Wallace 1964, 1984; Church and Gale 1995]. Zero-inflation accounts for variation not captured by overdispersed models. Need to develop a zero-inflated extension of a robust, overdispersed model of token frequency. Zero-inflation can be observed in M&W s data.

8 Poisson family models Start with the Poisson distribution with rate λ > 0: Poisson(λ)(x) = λx x! exp( λ). A natural generalization of the Poisson is the Negative Binomial distribution, with an additional parameter κ > 0 that controls non-poissonness: NegBin(λ, κ)(x) = λx x! κ κ (λ + κ) κ Γ(κ + x) Γ(κ)(λ + κ) x

9 Poisson(λ) NegBin(λ, κ) Mean µ λ = λ Variance σ 2 λ λ (1 + λ/κ) Skewness γ 1 1 λ 1 λ Kurtosis γ 2 1 λ 1 λ κ + 2λ κ (λ + κ) κ λ + κ + 6 κ Mode λ λ (1 1/κ) if κ 1 0 if 0 < κ < 1

10 Example: Subject lines of spam frequency (number of messages) observed Poisson(38.35) NegBin(38.35, 5.52) length of subject line (number of characters)

11 Detailed examples Mosteller and Wallace s [1964, 1984] data, taken from The Federalist papers. Essays by Alexander Hamilton and James Madison (and John Jay) on the shape of the proposed US constitution. M&W sampled approx. 250 contiguous passages of equal length for each of the two main authors.

12 Some words follow the Poisson frequency (number of passages) observed Poisson(0.67) occurrences of "any" [Hamilton]

13 Some words follow the Neg. Binomial frequency (number of passages) observed Poisson(0.45) NegBin(0.45, 1.17) occurrences of "were" [Madison]

14 And some words are special For example his (Hamilton and Madison pooled) in Mosteller and Wallace s data. The method of maximum likelihood leads to NegBin(0.54, 0.15). Here s what that model has to say: obsrvd expctd

15 Alternatively, we could have estimated the parameters based on: (a) the number of documents with zero occurrences of his ; and (b) the number of documents with one occurrence of his. Not surprisingly, the resulting model, NegBin(0.76, 0.11), is worse: obsrvd expctd

16 frequency (number of passages) observed NegBin(0.54, 0.15) NegBin(0.76, 0.11) 0.34 NegBin(1.56, 0.89) occurrences of "his" [Hamilton, Madison]

17 Adaptation, burstiness and all that Church [2000]: The first mention of a word obviously depends on frequency, but surprisingly, the second does not. Adaptation [the degree to which the probability of a word encountered in recent context is increased] depends more on lexical content than frequency[.] Church, concerned mostly with empirical exploration, used nonparametric methods. How can his findings be incorporated into a parametric setting?

18 A modest proposal Whether a given word appears at all in a document is one thing. How often it appears, if it does, is another thing. Not all words are appropriate in a given context (taboo words, technical jargon, proper names). A writer s/speaker s active vocabulary is limited and idiosyncratic ( (tom/pot)atos / (tom/pot)atoes ). We insist on capturing non-zero occurrences with parametric models, but treat zeroes specially.

19 frequency (number of passages) observed NegBin(0.54, 0.15) NegBin(0.76, 0.11) 0.34 NegBin(1.56, 0.89) occurrences of "his" [Hamilton, Madison]

20 A concrete modest proposal Two-component mixture: first component is a degenerate distribution at zero (or possibly a geometric distribution starting at zero); second component a standard distribution F with parameter vector θ, e. g. from the Poisson or Binomial family. ZIF(z, θ)(x) = z(x 0) + (1 z)f(θ)(x) where 0 z 1 (z < 0 may be allowable).

21 Properties of ZIF If F(θ) has mean µ and variance σ 2, then ZIF(z, θ) has mean (1 z) µ and variance (1 z) (σ 2 + z µ 2 ). Furthermore, ZIF(z, θ) has the same modes as F(θ) plus potentially an additional mode at zero.

22 Zero-inflated distributions Straightforward interpretation of generative process: pretend there is a z-biased coin; flip coin; on heads, generate 0; on tails, generate according to F. If parameter vector θ of F can be estimated straightforwardly, use EM to estimate z and θ. Otherwise use multidimenisional maximization algorithms.

23 ZINB model for his Recall that a NegBin model can already account for the fact that most of the probability mass is concentrated at zero. Can a zero-inflated NegBin (ZINB) model do better? Note that the maximum likelihood models for the distribution of his in M&W s data say very different things, even though the net effects may be superficially similar.

24 The NegBin model claims that his occurs much less than once on average (0.54 expected occurrences) and that it has large variance. The ZINB model claims that his occurs in only a third of all passages, but within those its expected number of occurrences is 1.56 and its variance is less than that predicted by the NegBin model.

25 frequency (number of passages) observed NegBin(0.54, 0.15) NegBin(0.76, 0.11) 0.34 NegBin(1.56, 0.89) occurrences of "his" [Hamilton, Madison]

26 NegBin ZINB obsrvd expctd expctd χ 2 q-value log L(ˆθ)

27 Comparison of Poisson models x Poisson(λ) µ = λ σ 2 = λ = µ x NegBin(λ, κ) x ZIPoisson(z, λ) µ = λ, µ = (1 z) λ, σ 2 = λ (1 + λ κ ) σ2 = µ (1 + z λ) x ZINegBin(z, λ, κ)

28 Comparison of Binomial models x n Binom(p) µ = n p σ 2 = n p (1 p) = µ q x n BetaBin(p, γ) x n ZIBinom(z, p) µ = n p µ = (1 z) n p σ 2 = µ q (1 + (n 1)γ) σ 2 = µ (q + z n p) x n ZIBetaBin(z, p, γ)

29 The Naive Bayes classifier We would like to have a distribution over a random variable C (class labels) conditional on independent variables X 1,..., X k and parameters θ: P (C X 1,..., X k ; θ) P (C, X 1,..., X k θ) Assume a graphical model where the only edges are from C to X i for i = 1,..., k. In other words: P (C, X 1,..., X k θ) = P (C θ) k P (X i C; θ) i=1

30 For document classification, the independent variables X i range over counts. In addition, we can condition on the document length L. For example: P (X i = x L = n, C = j; θ) ( ) n = (θ ij ) x (1 θ ij ) (n x) x Training consists of finding a point estimate of θ. Classification is done by selecting the most probable class, conditional on the values of the independent variables and the estimated ˆθ.

31 Effects on classification performance McCallum and Nigam [1998] compared multivariate Bernoulli and multinomial models. We compare (joint independent) Bernoulli, binomial, betabinomial, and zero-inflated binomial models. Bernoulli model can be interpreted as binning (nonparametric historgram method) into two dominant classes: zero and nonzero. Zero-inflated binomial should be able to combine advantages of Bernoulli and detail of binomial model.

32 naive standard Poisson 1 Negative Binomial 2 Binomial 1 Beta-Binomial 2 Multinomial k Dirichlet-Multinomial k+1 McCallum and Nigam recommended Bernoulli for small vocabulary sizes; we recommend ZIBinomial.

33 Newsgroups data set 20 Newsgroups data set, stratified so that all classes are equally likely a priori, therefore 5% baseline accuracy.

34 Newsgroups 90 classification accuracy (percent) k 2k Bernoulli Binomial ZIBinom BetaBin 5k 10k 20k vocabulary size (number of word types)

35 Binom ZIB McNemar

36 WebKB data set Web pages from CS departments, classified as faculty, student, course, and project pages documents total.

37 classification accuracy (percent) Bernoulli Binomial BetaBin ZIBinom WebKB k 2k 5k 10k vocabulary size (number of word types) 20k

38 Claim 2 Zero-inflated models perform no worse than overdispersed models. Standard zero-inflated models are easier to work with, since EM can be used for parameter estimation. Modeling zero-inflation is preferable to modeling overdispersion, at least for Naive Bayes document classification.

39 Longer documents Tom in Project Gutenberg books (15k 25k words). No surprises initially: But the tail is very long:

40 Document lengths Document length in newsgroup data is non-negative, heavily skewed to the right, and seems to be unimodal (unlike newswire). Approximated well by log-logistic density: LogLogistic(µ, σ, δ)(x) = σ ) δ 1 δ ( x µ σ [ 1 + ( ) ] x µ δ 2 σ CDF easy to invert (unlike log-normal), pth

41 percentile point is: µ + σ ( p ) 1/δ 1 p Leave µ fixed, estimate remaining two parameters from tertile points: ˆσ = t 1 µ t 2 µ ˆδ = 2 log 2 log(t 2 µ) log(t 1 µ)

42 frequency (number of documents) observed LogLogistic document length (number of words)

43 Language modeling Traditionally done using Markov chains. For example, a bigram model over {a, b} : a b b a a a b b

44 Word length Markov chains are poor models of word length. As a model of word length, a bigram model degenerates to a (shifted) geometric distribution: q p 1-p 1-q

45 Pascal distribution The geometric distribution can be generalized to the Pascal distribution, which is a special case of the Negative Binomial with κ an integer. NegBin(λ, κ)(x) = λx = κ κ Γ(κ + x) x! (λ + κ) κ Γ(κ)(λ + κ) ( ) ( ) x x ( x + κ 1 λ κ x λ + κ λ + κ ) κ

46 Reparametrize as follows: ( ) x + κ 1 Pascal(p, κ)(x) = x p x (1 p) κ For example, when κ = 4: p p p p 1-p 1-p 1-p 1-p

47 Word length in the NETtalk data Modeled by a slight variant of a Pascal model: m 1 p1 1-p1 p2 1-p2 p3 1-p3 1 1-m p p p p p 1-p 1-p 1-p 1-p 1-p With κ fixed (in this case κ = 5), we can use EM to estimate the remaining parameters.

48 frequency (number of words) observed geometric Pascal word length (number of letters)

49 Conclusions Especially for parametric models, need to check goodness of fit. Overdispersion and zero-inflation are common in count data encountered in NLP. This affects our choice of models. For example, exponential family models misleadingly known as maximum entropy models have no provisions for overdispersion. Use with caution.

Chapter 3 Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2013 John Wiley & Sons, Inc.

Chapter 3 Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2013 John Wiley & Sons, Inc. 1 3.1 Describing Variation Stem-and-Leaf Display Easy to find percentiles of the data; see page 69 2 Plot of Data in Time Order Marginal plot produced by MINITAB Also called a run chart 3 Histograms Useful