Information Retrieval

Size: px

Start display at page:

Download "Information Retrieval"

Aron Boone
6 years ago
Views:

1 Information Retrieval Ranked Retrieval & the Vector Space Model Gintarė Grigonytė Department of Linguistics and Philology Uppsala University Slides based on IIR material Gintarė Grigonytė 1/57

2 Outline for today Ranked retrieval Efficient scoring & ranking IR system architecture Gintarė Grigonytė 2/57

3 Boolean Retrieval Model Advantages Easy for the system Users get transparency: it is easy to understand why a document was or was not retrieved Users get control: it easy to determine whether the query is too specific (few results) or too broad (many results) Disadvantages The burden is on the user to formulate a good boolean query Gintarė Grigonytė 3/57

4 Recap on Ranked Retrieval Boolean retrieval (= exact matches) is not enough! ranking of retrieved documents according to relevance vector-space model term weighting (tf-idf) documents as vectors (in space) queries as vectors (in same space) ranking according to vector similarity (proximity) How important is good ranking? Gintarė Grigonytė 4/57

5 (Slides from Dan Russell, Google)

6 To sum up: Good ranking is very important!

7 Recall: Boolean model (document = set of words) Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER Each document is represented by a binary vector {0, 1} V. Gintarė Grigonytė 7/57

8 Recall: Bag of Words Model Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER Each document is represented by a count vector N V (term frequency). Gintarė Grigonytė 8/57

9 Term frequency weight relevance as (non-linear) function of term frequency log frequency weight of term t in d is defined as follows w t,d = { 1 + log10 tf t,d if tf t,d > 0 0 otherwise Gintarė Grigonytė 9/57

10 Inverse Document Frequency weight some terms are more relevant than others the document frequency df t is defined as the number of documents that t occurs in. We define the idf weight of term t as follows: idf t = log 10 N df t idf is a measure of the informativeness of the term. term df t idf t calpurnia 1 6 animal sunday fly 10,000 2 under 100,000 1 the 1,000,000 0 Gintarė Grigonytė 10/57

11 Collection frequency vs. Document frequency We could also use collection frequency. (count multiple occurrences of terms in documents) Word Collection frequency Document frequency INSURANCE TRY Which one is better? Gintarė Grigonytė 11/57

12 tf-idf weighting product of term frequency (tf) and inverse document frequency (idf) w t,d = (1 + log tf t,d ) log N df t best known weighting scheme in IR increases with number of occurrences within a document rarity of the term in collection Gintarė Grigonytė 12/57

13 Binary count weight matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER each document represented by a real-valued vector of tf-idf weights R V. How do we rank documents according to relevance? Gintarė Grigonytė 13/57

14 Best-match retrieval models So far, we ve discussed exact-match models Today, we start discussing best-match models Best-match models predict the degree to which a document is relevant to a query Ideally, this would be expressed as RELEVANT(q,d) In practice, it is expressed as SIMILAR(q,d) How can we compute the similarity between q and d? Gintarė Grigonytė 14/57

15 Vector Space Model A vector space is defined by a set of linearly independent basis vectors The basis vectors correspond to the dimensions or directions of the vector space (2D, 3D,...) A vector is a point in a vector space and has length (from the origin to the point) and direction The basis vectors are linearly independent because knowing a vector s value on one dimension doesn t say anything about its value along another dimension Gintarė Grigonytė 15/57

16 Vector Space Model The vector space model ranks documents based on the vector-space similarity between the query vector and the document vector There are many ways to compute the similarity between two vectors One way is to compute the inner product Gintarė Grigonytė 16/57

17 Vector Space Model (source: J. Arguello) Gintarė Grigonytė 17/57

18 Vector Space Model What is more relevant to a query? A 50-word document which contains 5 of the query- terms? A 100-word document which contains 5 of the query-terms? The inner-product doesn t account for the fact that documents have widely varying lengths All things being equal, longer documents are more likely to have the query-terms So, the inner-product favors long documents Gintarė Grigonytė 18/57

19 Documents and queries as vectors Measure proximity between query and documents! Gintarė Grigonytė 19/57

20 Cosine similarity between query and document cos( q, d) = q q d d = V i=1 q i V i=1 q2 i d i V i=1 d i 2 q i is the tf-idf weight of term i in the query. d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q/ q and d/ d are length-1 vectors (= normalized). Gintarė Grigonytė 20/57

21 Cosine similarity between query and document Gintarė Grigonytė 21/57

22 Predicted and true probability of relevance A problem with cosine normalization: (source: Lillian Lee) Gintarė Grigonytė 22/57

23 Pivot normalization Linear adjustment to compensate for true relevance (source: Lillian Lee) Gintarė Grigonytė 23/57

24 Gintarė Grigonytė 24/57

25 (source: J. Arguello) Gintarė Grigonytė 25/57

26 (source: J. Arguello) Gintarė Grigonytė 26/57

27 Gintarė Grigonytė 27/57 (source: J. Arguello)

28 (source: J. Arguello) Gintarė Grigonytė 28/57

29 (source: J. Arguello) Gintarė Grigonytė 29/57

30 (source: J. Arguello) Gintarė Grigonytė 30/57

31 (source: J. Arguello) Gintarė Grigonytė 31/57

32 Components of tf-idf weighting Term frequency Document frequency Normalization n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log df N c (cosine) 1 t w 2 1 +w w M 2 a (augmented) b (boolean) tf t,d max t (tf t,d ) { 1 if tft,d > 0 0 otherwise p (prob idf) max{0, log N dft df t } u (pivoted unique) αu d +(1 α)pivot b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(tf t,d ) 1+log(ave t d (tf t,d )) different weightings for queries & documents qqq.ddd example: ltn.lnc: query: logarithmic tf, idf, no normalization document: logarithmic tf, no df weighting, cosine normalization Gintarė Grigonytė 32/57

33 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

34 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 best 1 car 1 insurance 1 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

35 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 1 best 1 0 car 1 1 insurance 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

36 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

37 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

38 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

39 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

40 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

41 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

42 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

43 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight / / Gintarė Grigonytė 33/57

44 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Gintarė Grigonytė 33/57

45 tf-idf Example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Final similarity score between query and document: i w qi w di = = 3.08 Gintarė Grigonytė 33/57

46 Second part today efficient scoring and ranking a complete IR system components architecture implementational issues Gintarė Grigonytė 34/57

47 Efficient scoring Query stays the same: need only (relative) scores for each document no matching keywords no score use inverted index file (instead of comparing all d s with q) run through postings for each term accumulate scores per document in posting lists rank J documents with positive scores much less than N (documents in collection) Gintarė Grigonytė 35/57

48 Term frequencies in the inverted index BRUTUS 1,2 7,3 83,1 87,2... CAESAR 1,1 5,1 13,1 17,1... CALPURNIA 7,1 8,2 40,1 97,3 term frequencies are easier to store than log-weights compute term weight on-the-fly (store idf in dictionary) intersect as usual (iterative calculation of document scores) don t need complete ranking top k retrieval Gintarė Grigonytė 36/57

49 Term frequencies in the inverted index BRUTUS 1,2 7,3 83,1 87,2... CAESAR 1,1 5,1 13,1 17,1... CALPURNIA 7,1 8,2 40,1 97,3 term frequencies are easier to store than log-weights compute term weight on-the-fly (store idf in dictionary) intersect as usual (iterative calculation of document scores) don t need complete ranking top k retrieval Efficient selection of top k documents How? Gintarė Grigonytė 36/57

50 Binary min heap for selecting top k binary min heap = binary tree in which each node s value is less than the values of its children always keep top k documents seen so far takes O(J log k) operations to construct read off k winners in O(k log k) steps Gintarė Grigonytė 37/57

51 Inexact top k retrieval bottleneck: cosine computation (J can still be big) instead of retrieving top k: retrieve k documents with scores close to top k use search heuristics Is this a bad thing to do? Gintarė Grigonytė 38/57

52 Generic approach find a set A of contenders with K < A << N return top k documents in A selecting A = pruning non-contendors index elimination champion lists static quality scores impact ordering cluster pruning Gintarė Grigonytė 39/57

53 Index elimination only consider high-idf query terms query: catcher in the rye accumulate scores for catcher and rye only Gintarė Grigonytė 40/57

54 Index elimination only consider high-idf query terms query: catcher in the rye accumulate scores for catcher and rye only only consider docs containing many query terms multi-term queries: scores for docs that contain at least a fixed proportion pf query terms (e.g., 3 out of 4) soft conjunction (early Google) easy to implement in postings traversal Gintarė Grigonytė 40/57

55 Champion lists For each term in dictionary: precompute r documents of heighest weight (in the postings of the term) Champion lists! query: only compute scores for documents in union of champion lists for all query terms r is chosen at indexing time might use different r for each term (more for rare terms?!) Gintarė Grigonytė 41/57

56 Static quality scores Idea 2: reorder posting lists according to expected relevance query independent quality of documents (authority) examples: a paper with many citations many bookmarks (del.icio.us,...) PageRank (!) Gintarė Grigonytė 42/57

57 Static quality scores assign quality ( goodness ) score g(d) to each d combine with relevance score (cos(q, d)): net-score(q, d) = g(d) + cos(q, d) might use other type of combination return top k documents according to net-score Gintarė Grigonytė 43/57

58 Static quality scores How does this help? postings are ordered by g(d) (still consistent order!) traverse postings and compute scores early termination is possible stop if minimal score cannot be improved time threshold threshold for goodness score can be combined with champion lists Gintarė Grigonytė 44/57

59 Other ideas High and low lists : for each term: high list ( = champion list) low list (other documents) use high lists first use low list if < k documents found Gintarė Grigonytė 45/57

60 Impact ordering compute scores only for documents with high wf t,d sort each posting by wf t,d non-consistent order of postings! solution 1: early termination: for each term stop after a fixed number of r documents stop when wft,d <threshold score documents in union of retrieved postings solution 2: sort terms by idf stop if document scores don t change much Gintarė Grigonytė 46/57

61 Cluster Pruning Pre-processing (clustering): Pick N docs at random (= leaders ) (random = fast + reflects distribution well) For every other doc, pre-compute nearest leader attach them to leader each leader has N followers Query processing: given query q: find nearest leader L seek k nearest docs among L followers Gintarė Grigonytė 47/57

62 Cluster Pruning Variants: clustering: attach documents to x nearest leaders querying: find y nearest leaders and consider their followers Gintarė Grigonytė 48/57

63 Building an IR system Components of an IR system tiered indexes query term proximity parsing and scoring functions... Gintarė Grigonytė 49/57

64 Tiered indexes Basic idea: Create several tiers of indexes, corresponding to importance of indexing terms Query processing: start with highest-tier index Stop if highest-tier index returns at least k results Go to next tier if we ve only found < k hits (see high & low lists!) Gintarė Grigonytė 50/57

65 Tiered indexes Basic idea: Create several tiers of indexes, corresponding to importance of indexing terms Query processing: start with highest-tier index Stop if highest-tier index returns at least k results Go to next tier if we ve only found < k hits (see high & low lists!) Could also be related to zones: Tier 1: Index of all titles Tier 2: Index of the rest of documents (Pages containing the search words in the title are better hits than pages containing the search words in the body of the text) Gintarė Grigonytė 50/57

66 Tiered index Tier 1 auto best Doc2 car Doc1 Doc3 insurance Doc2 Doc3 auto Tier 2 best Doc1 Doc3 car insurance auto Doc1 Tier 3 best car Doc2 insurance Gintarė Grigonytė 51/57

67 Query term proximity Free text queries: just a set of terms typed into the query box common on the web Users prefer docs in which query terms occur within close proximity of each other Let w be the smallest window in a doc containing all query terms, e.g., For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words) Gintarė Grigonytė 52/57

68 Query parsers Free text query from user may fire one or more queries to the indexes, e.g. query rising interest rates 1. Run the query as a phrase query 2. If < k docs contain the phrase rising interest rates, run the two phrase queries rising interest and interest rates 3. If we still have < k docs, run the vector space query rising interest rates 4. Rank matching docs by vector space scoring This sequence is issued by a query parser How to combine scores? Tuning! (experts/machine-learning) Gintarė Grigonytė 53/57

69 IR system architecture Gintarė Grigonytė 54/57

70 Watch online lecture watch online lecture by Jeff Dean: Gintarė Grigonytė 55/57

71 Summary IR includes many components Document preprocessing (linguistic and otherwise) Positional indexes Tiered indexes Spelling correction k-gram indexes for wildcard queries and spelling correction Query parsing & Query processing Document scoring (including proximity...) Gintarė Grigonytė 56/57

72 What is next? lab on boolean and ranked retrieval small test collection create inverted index test boolean queries and ranked retrieval Gintarė Grigonytė 57/57

CEC login. Student Details Name SOLUTIONS

CEC login. Student Details Name SOLUTIONS Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching