An Introduction to Term Frequency – Inverse Document Frequency (tf-idf)

tf-idf is one of the most fundamental concepts in information retrieval. It’s the product of 2 statistics, the term frequency (tf), which is the numeric value representing how many times a word appears in a document. The inverse document frequency (idf) illustrates how much information a word provides, i.e., if it’s common or rare in examined documents. Despite its simplicity, it’s still one of the most popular term-weighting schemes; variances of the td-idf are used extensively in search engines as a ranking function to evaluate the relevance between users’ queries and a webpage.

Term frequency

Let $t$ is the term we want to find the frequency in the document $d$ , the output of the $tf(t,d)$ function is simply the total number of times the term appeared in the document divide by the total number of terms in the document:

{tf}(t,d) = \frac{f_{t,d}}{{\sum_{t' \in d}{f_{t',d}}}}

For example, we have the document $d$ with the following content: “I am learning information retrieval, and you are learning information retrieval as well,“ and the term $t$ we want to find its frequency is retrieval, then $tf(t,d)=2/13 = 0.1538$ .

Keep in mind that there are various other ways to compute the term frequency, such as only the count frequency of the term itself, boolean frequency (whether a term is in the document or not), etc…

Bag of words

While computing the term frequency for documents, we can often use bag-of-words for data preprocessing. In this phase, bag-of-words describe the frequency of each word that appeared in the document, disregarding grammar detail and order of words. With the document given the previous session, we can have a simple bag-of-words representation as follow:

{
  "as": 1, "are": 1, "and": 1, "retrieval": 2, "I": 1, "well" : 1, "learning": 2, "information": 2, "am": 1, "you": 1
}

Problem with TF

However, let’s say we want to determine whether a query is related to a given document or not; simply counting the term frequency may not be so effective; for example, we have two documents:

Document 1 $d_1$	Document 2 $d_2$
the lazy fox jumped over the quick brown dog	learning is knowledge or a piece of information obtained by study or experience

And the query $q$ is “the learning process”, we calculate the relevancy score by summing up the frequency of each term in $q$ in the $d_1$ and $d_2$ documents. ${score}(q,d) = {{\sum_{w \in q}{tf_{w,d}}$ .

$score(q, d_1) = 2/9 + 0/9 + 0/9 = 2/9 \approx 0.22$
$score(q, d_2) = 0/13 + 1/13 + 0/13 = 1/13 \approx 0.077$

The document $d_1$ has a higher score, while clearly, we know that the document $d_2$ is more relevant.

Inverse document frequency

As we can see, if using the only term frequency functions, we can run into a situation in which unimportant terms are given more weight than important ones. In the example of the previous section, the word “the” is one of the stop words, which is insignificant and should be filtered out before processing.

The inverse document frequency function measures how much information a term provides in documents, i.e., whether the term rarely appears or is common between documents. Stop or common words can exist in most documents without adding much value. At the same time, a rare term only appears in a small number of documents, indeed giving us more relevant information. The inverse document frequency will assign a lower score for common unimportant terms and give us a higher score for more important ones. The idf function is presented as follows:

\mathrm{idf}(t, D) = \log \frac{N}{|\{d \in D: t \in d\}|}

Where $N$ is the number of the documents
The denominator is the number of documents in which the term $t$ appears. If the term doesn’t appear in any document, this can lead to divide-by-zero; hence it’s common to add 1 to the denominator, i.e. $1 + {|{d \in D: t \in d}|}$

We can see that the inverse document frequency function got its name because instead of dividing the number of documents where the terms appeared by the total number of documents, we inverse them. And we take the logarithm (here, we use base $e$ ) at the end to compute the frequency score.

Now, let’s compute the $idf$ of these 2 documents $D$ :

Document 1 $d_1$	Document 2 $d_2$
the best way to learn something is to teach it	i am always looking for something to improve my writing

$idf("something", D) = \log\frac{2}{2} = 0$
$idf("learn", D) = \log\frac{2}{1} = 0.69314$

Term frequency-inverse document frequency

The two weight schemes tf and idf can be combined to calculate the relevant score as the following:

{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)

A high weighting score given by this function is high if both values provided by $tf(t,d)$ and $idf(t, D)$ are high, i.e., when the term has a high frequency in a given document and a low happening frequency in the total number of existing documents. The $tdidf(t,d,D)$ will give us a 0 score if the term $t$ appears in every document (because $idf(t,D) = 0$ ), hence using $tfidf$ we can filter out common terms between documents.

A code example

Now we are ready to create a simple program that calculates the relevancy of documents to a user’s query using the $tdidf$ function. We first create a class that models our document:

record Document(int docId, String doc) {}

Then, we create a function that calculates the term frequency score:

public static double getTermFrequencyScore(String term, Document document) {
     String[] words = document.doc().split("\\s+");
     int size = words.length;
     int freq = 0;
     for(final String word : words) {
         if (word.equals(term)) {
            freq++;
         }
     }
     return freq / (double) size;
}

Next, we make another function to calculate the inverse document frequency score; the input will be the term along with the list of documents:

public static double getInverseDocumentFrequencyScore(String term, List<Document> documents) {
      int size = documents.size();
      int freq = 0;
      for(final Document document : documents) {
          String[] words = document.doc().split("\\s+");
          for(final String word : words) {
              if (word.equals(term)) {
                  freq++;
                  break;
              }
          }
      }
      return Math.log(size / (double) (1 + freq));
}

Finally, we can have the function getTfIdfScore, which is the product of returned values from 2 previous functions:

public static double getTfIdfScore(String term, List<Document> allDocs, Document doc) {
      double termFrequencyScore = getTermFrequencyScore(term, doc);
      double inverseDocumentFrequencyScore = getInverseDocumentFrequencyScore(term, allDocs);
      return termFrequencyScore * inverseDocumentFrequencyScore;
}

We’re ready to make use of our methods and put them in the utility class RelevancyScoreUtils; next, we create another class named DocumentRelevancyCalculator which has a function that takes a user’s query as input and returns a list of pertaining documents:

record DocumentTermScore(String term, Document doc, double relScore) { }

class DocumentRelevancyCalculator {
    private final List<Document> documents;

    public DocumentRelevancyCalculator(List<Document> documents) {
        this.documents = documents;
    }

    public List<DocumentTermScore> getRelevantDocuments(String query, int k) {
        List<DocumentTermScore> relatedDocs = new ArrayList<>();
        for(final Document doc : documents) {
            String[] terms = query.split("\\s+"); 
            double tfidfScore = 0;
            for(final String term : terms) {
                tfidfScore += RelevancyScoreUtils.getTfIdfScore(term, documents, doc);
            }
            if (tfidfScore > 0) {
                DocumentTermScore dtc = new DocumentTermScore(query, doc, tfidfScore);
                relatedDocs.add(dtc);
            }
        }
        Comparators.comparingDouble(DocumentTermScore::relScore).reversed()
        return relatedDocs.subList(0, k);
    }
}

Finally, we can create a test class that harnesses the function we’ve just created:

class DocumentRelevancyCalculatorTest {
    static DocumentRelevancyCalculator documentRelevancyCalculator;

    @BeforeEach
    public void init() {
        documentRelevancyCalculator = new DocumentRelevancyCalculator(List.of(
                new Document(1, "the chains of habit are too weak to be felt until they are too strong to be broken."),
                new Document(2, "think before you speak. read before you think."),
                new Document(3, "what do you think about our improvement plan?"),
                new Document(4, "if you can do something in under 2 minutes, then just doing it right away might save you a lot of time."),
                new Document(5, "15 minutes of direct sunlight in the morning will save you hours of insomnia in the evening."))
        );
    }

    @Test
    void testGetRelevantDocuments() {
        List<DocumentTermScore> documents = documentRelevancyCalculator.getRelevantDocuments("think", 2);
        assertEquals(2, documents.get(0).doc().docId());
        assertEquals(3, documents.get(1).doc().docId());
    }
}

To wrap up, term frequency and inverse document frequency scores are often computed together to evaluate the relevancy of a given term to documents or a corpus. The $tfidf$ function gives a higher weighting score when a term appears more in a document and rarely in other documents, and it gives us a 0 score when the term appears in every document.

The code example is available here.

Facebook0 Tweet0

An Introduction to Term Frequency – Inverse Document Frequency (tf-idf)8 min read