Evaluating Summarization Quality with ROUGE Metrics

Author

Andres Monge

Published

March 3, 2025

Summarization is a critical task in natural language processing (NLP), but how do we measure the quality of a summary? In this article, we’ll explore ROUGE metrics, a set of evaluation tools widely used to assess the effectiveness of summarization models.

Common Libraries

To implement and evaluate ROUGE metrics, several Python libraries are commonly used. Below is a list of essential libraries and their roles in the process:

Code

from dataclasses import dataclass
import textwrap
from typing import Literal
import unittest
from rouge_score import rouge_scorer

SCORES = dict[Literal["rouge_1", "rouge_2", "rouge_l", "rouge_lsum"], float]
LINE_BREAK = " [BR "

dataclasses: Simplifies the creation of structured data classes, which are useful for organizing evaluation results.
textwrap: Helps format and wrap text, ensuring summaries are presented cleanly.
typing.Literal: Provides type hints for specific values, improving code clarity and robustness.
unittest: Facilitates the creation of unit tests to validate the correctness of the ROUGE implementation.
re: Enables regular expression operations, which are often used for text preprocessing.
rouge_score: The core library for computing ROUGE metrics, offering pre-built functions for evaluating summarization quality.
SCORES: Defines a type alias for ROUGE scores, which is a dictionary mapping.

These libraries form the foundation for implementing and evaluating ROUGE metrics effectively.

Helper Function: `print_case`

To make our results more readable, we’ll use the print_case function. This function formats and displays the original text, summary, ROUGE scores, and compression ratio in a visually appealing way.

Code

def print_case(
    case_name: str,
    text: str,
    summary: str,
    scores: SCORES,
    summary_ration: float | None = None,
) -> None:
    CYAN = "\033[96m"
    MAGENTA = "\033[95m"
    BLUE = "\033[94m"
    RESET = "\033[0m"
    GRAY = "\033[90m"
    RED = "\033[91m\033[4m"
    CLEAR = "\033[F"

    title = " ".join(case_name.split("_")[1:]).capitalize()

    print(f"{CLEAR}{RED}{title}{RESET}", end=LINE_BREAK)
    print(f"{CYAN}Text{RESET}{LINE_BREAK}{text}{LINE_BREAK}", end=LINE_BREAK)
    print(f"{CYAN}Summary{RESET}{LINE_BREAK}{summary}{LINE_BREAK}", end=LINE_BREAK)
    print(f"{CYAN}Scores{RESET}", end=LINE_BREAK)
    print(
        f"{MAGENTA}Rouge 1           :{RESET} {BLUE}{scores['rouge_1']:5.4f}  "
        f"{GRAY}(Precision of unigrams){RESET}", end=LINE_BREAK
    )
    print(
        f"{MAGENTA}Rouge 2           :{RESET} {BLUE}{scores['rouge_2']:5.4f}  "
        f"{GRAY}(Precision of bigrams){RESET}", end=LINE_BREAK
    )
    print(
        f"{MAGENTA}Rouge L           :{RESET} {BLUE}{scores['rouge_l']:5.4f}  "
        f"{GRAY}(Longest common subsequence){RESET}", end=LINE_BREAK
    )
    print(
        f"{MAGENTA}Rouge L∑          :{RESET} {BLUE}{scores['rouge_lsum']:5.4f}  "
        f"{GRAY}(Average of longest common subsequences){RESET}", end=LINE_BREAK
    )
    if summary_ration is not None:
        print(
            f"{MAGENTA}Compression ratio :{RESET} {BLUE}{summary_ration:5.4f}  "
            f"{GRAY}(Ratio of summary tokens to source tokens){RESET}", end=LINE_BREAK
        )
    print(end=LINE_BREAK)

ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics are used to evaluate how well a summary captures the content of the original text. The most commonly used ROUGE metrics are:

ROUGE-1: Measures the overlap of unigrams (single words) between the summary and the original text.
ROUGE-2: Measures the overlap of bigrams (pairs of words) between the summary and the original text.
ROUGE-L: Measures the longest common subsequence (LCS) between the summary and the original text. This is useful when the grammar and word order are preserved.
ROUGE-LSum: Similar to ROUGE-L but averages the LCS scores for each sentence in the summary.

These metrics provide insights into how well the summary captures the content of the original text. However, ROUGE-L is less effective when the grammar changes, so a combination of ROUGE-1, ROUGE-2, and ROUGE-LSum is often more reliable.

Implementing `RougeMetrics`

To compute ROUGE scores, we’ll implement a Python class called RougeMetrics. This class will calculate ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-LSum scores for a given text and summary.

Code

from dataclasses import dataclass
from rouge_score import rouge_scorer

@dataclass
class RougeMetrics:
    """
    A class to compute ROUGE metrics for evaluating summary quality.
    """
    text: str
    summary: str

    def __rouge(self, algorithm: str) -> float:
        scorer = rouge_scorer.RougeScorer([algorithm])
        return round(scorer.score(self.text, self.summary)[algorithm].fmeasure, 4)

    def _rouge_1(self) -> float:
        return self.__rouge("rouge1")

    def _rouge_2(self) -> float:
        return self.__rouge("rouge2")

    def _rouge_l(self) -> float:
        return self.__rouge("rougeL")

    def _rouge_lsum(self) -> float:
        return self.__rouge("rougeLsum")

    def scores(self) -> dict:
        """
        Compute ROUGE scores for a given text and summary.
        """
        return {
            "rouge_1": self._rouge_1(),
            "rouge_2": self._rouge_2(),
            "rouge_l": self._rouge_l(),
            "rouge_lsum": self._rouge_lsum(),
        }

Testing ROUGE Metrics with Examples

Let’s test the RougeMetrics class with a few examples to understand how ROUGE scores work.

Example 1: No summary

Code

def test_no_summary():
    text = "The cat sat on the mat."
    summary = "The cat sat on the mat."
    scores = RougeMetrics(text=text, summary=summary).scores()

    assert scores["rouge_1"] >= 1.0
    assert scores["rouge_2"] >= 1.0
    assert scores["rouge_l"] >= 1.0
    assert scores["rouge_lsum"] >= 1.0
    print_case("test_no_summary", text, summary, scores)

test_no_summary()

[F[91m[4mNo summary[0m [BR [96mText[0m [BR The cat sat on the mat. [BR [BR [96mSummary[0m [BR The cat sat on the mat. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m1.0000 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m1.0000 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m1.0000 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m1.0000 [90m(Average of longest common subsequences)[0m [BR [BR :::

Explanation:

Since the summary is identical to the original text, all ROUGE scores are 1.0, falsely indicating a perfect match.

Example 2: Partial Summary

Code

def test_partial_summary():
    text = "A quick brown fox jumps over the lazy dog."
    summary = "A fox jumps over a dog."
    metrics = RougeMetrics(text=text, summary=summary)
    scores = metrics.scores()
    expected_scores = (0.6, 0.3, 0.6, 0.6)

    assert scores["rouge_1"] >= expected_scores[0]
    assert scores["rouge_2"] >= expected_scores[1]
    assert scores["rouge_l"] >= expected_scores[2]
    assert scores["rouge_lsum"] >= expected_scores[3]
    print_case("test_partial_summary", text, summary, scores)

test_partial_summary()

[F[91m[4mPartial summary[0m [BR [96mText[0m [BR A quick brown fox jumps over the lazy dog. [BR [BR [96mSummary[0m [BR A fox jumps over a dog. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.6667 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.3077 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.6667 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.6667 [90m(Average of longest common subsequences)[0m [BR [BR :::

Explanation:

ROUGE-1: The summary captures 4 out of 6 unigrams (“A”, “fox”, “jumps”, “over”), resulting in a score of 0.6667.
ROUGE-2: Only 2 out of 5 bigrams are captured (“fox jumps”, “jumps over”), resulting in a score of 0.3333.
ROUGE-L: The longest common subsequence is “A fox jumps over”, which matches 4 out of 6 words, resulting in a score of 0.6667.

Example 3: Poor Summary

Code

def test_poor_summary():
    text = "Artificial intelligence is transforming industries like healthcare and finance."
    summary = "AI is changing the world."
    metrics = RougeMetrics(text=text, summary=summary)
    scores = metrics.scores()
    print_case("test_poor_summary", text, summary, scores)

test_poor_summary()

[F[91m[4mPoor summary[0m [BR [96mText[0m [BR Artificial intelligence is transforming industries like healthcare and finance. [BR [BR [96mSummary[0m [BR AI is changing the world. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.1429 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.0000 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.1429 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.1429 [90m(Average of longest common subsequences)[0m [BR [BR :::

Explanation:

ROUGE-1: Only 1 out of 6 unigrams (“is”) is captured, resulting in a score of 0.1667.
ROUGE-2: No bigrams are captured, resulting in a score of 0.0.
ROUGE-L: The longest common subsequence is “is”, which matches 1 out of 6 words, resulting in a score of 0.1667.

Token Counting and Compression Ratio

While ROUGE metrics are useful, they don’t tell us everything about the quality of a summary. For example, a summary could have high ROUGE scores but still be too long or too short. To address this, we’ll introduce token counting and compression ratio.

Token Counting

Token counting measures the number of unique words in the text and summary. This helps us understand how concise the summary is.

Compression Ratio

The compression ratio is the ratio of the number of tokens in the summary to the number of tokens in the original text. A good summary typically has a compression ratio between 10% and 50%.

Extending `RougeMetrics` with Token Counting and Compression Ratio

We’ll extend the RougeMetrics class to include token counting and compression ratio.

Code

import re

class RougeMetricsExtended(RougeMetrics):
    """
    Extends the RougeMetrics class with tokenization and compression ratio.
    """

    @staticmethod
    def regex_tokenizer_counter(text: str) -> int:
        """
        Tokenize text using regex and count the number of unique tokens.
        """
        tokens = re.findall(r"\b\w+\b", text)
        unique_tokens = set(tokens)
        return len(unique_tokens)

    def compression_ratio(self) -> float:
        """
        Calculate the compression ratio of the summary compared to the original text.
        """
        source_tokens = self.regex_tokenizer_counter(self.text)
        summary_tokens = self.regex_tokenizer_counter(self.summary)
        return summary_tokens / source_tokens

Testing Token Counting and Compression Ratio

Let’s test the extended class with the same examples.

Example 1: Partial Summary (with ratio)

Code

def test_partial_summary():
    text = "A quick brown fox jumps over the lazy dog."
    summary = "A fox jumps over a dog."
    metrics = RougeMetricsExtended(text=text, summary=summary)
    scores = metrics.scores()
    ratio = metrics.compression_ratio()
    expected_scores = (0.6, 0.3, 0.6, 0.6)

    assert scores["rouge_1"] >= expected_scores[0]
    assert scores["rouge_2"] >= expected_scores[1]
    assert scores["rouge_l"] >= expected_scores[2]
    assert scores["rouge_lsum"] >= expected_scores[3]
    assert 0.6 <= ratio

    print_case("test_partial_summary", text, summary, scores, ratio)

test_partial_summary()

Explanation:

The compression ratio is 0.6667, indicating that the summary is about 66% as long as the original text.

Multi-line and more complex summarization

First lets define our multi-line sample text phrase.

Code

MULTILINE = """
Artificial intelligence (AI) is transforming industries across the globe.
From healthcare to finance, AI-powered tools are enabling faster
decision-making, reducing costs, and improving efficiency.

In healthcare, AI is being used to diagnose diseases, predict patient outcomes,
and personalize treatment plans. For example, machine learning algorithms can
analyze medical images to detect cancer earlier than traditional methods.

However, the adoption of AI is not without challenges. Ethical concerns, such
as bias in algorithms and data privacy, must be addressed to ensure fair and
responsible use of AI technologies.

Despite these challenges, the future of AI looks promising. As technology
advances, AI will continue to revolutionize industries, creating new
opportunities and improving quality of life for people worldwide.\
"""

Example 2: Bad summary should be better than no summary

Code

def test_multiline_better_than_wrong():
    text = MULTILINE
    wrong_summary = "The weather is sunny today, and I went for a walk in the park."
    bad_summary = """\
    AI is changing industries like healthcare.
    It has challenges but a bright future.\
    """
    bad_summary = textwrap.dedent(bad_summary)

    wrong_scores = RougeMetrics(text=text, summary=wrong_summary).scores()
    bad_scores = RougeMetrics(text=text, summary=bad_summary).scores()

    assert bad_scores["rouge_1"] > wrong_scores["rouge_1"]
    assert bad_scores["rouge_2"] > wrong_scores["rouge_2"]
    assert bad_scores["rouge_l"] > wrong_scores["rouge_l"]
    assert bad_scores["rouge_lsum"] > wrong_scores["rouge_lsum"]

Explanation:

The ROUGE scores for the bad summary are higher than the wrong summary, indicating that the bad summary is better than the wrong summary.

Example 3: Poor Summary (with ratio)

Code

def test_multiline_bad_summary():
    text = MULTILINE
    summary = """\
    AI is changing industries like healthcare.
    It has challenges but a bright future.\
    """
    summary = textwrap.dedent(summary)
    metric = RougeMetricsExtended(text=text, summary=summary)
    scores = metric.scores()
    expected_scores = (0.09, 0.0, 0.09, 0.09)
    ratio = metric.compression_ratio()

    assert scores["rouge_1"] >= expected_scores[0]
    assert scores["rouge_2"] >= expected_scores[1]
    assert scores["rouge_l"] >= expected_scores[2]
    assert scores["rouge_lsum"] >= expected_scores[3]
    assert  ratio <= 0.2
    print_case("test_multiline_bad_summary", text, summary, scores, ratio)

test_multiline_bad_summary()

[F[91m[4mMultiline bad summary[0m [BR [96mText[0m [BR Artificial intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI-powered tools are enabling faster decision-making, reducing costs, and improving efficiency.

In healthcare, AI is being used to diagnose diseases, predict patient outcomes, and personalize treatment plans. For example, machine learning algorithms can analyze medical images to detect cancer earlier than traditional methods.

However, the adoption of AI is not without challenges. Ethical concerns, such as bias in algorithms and data privacy, must be addressed to ensure fair and responsible use of AI technologies.

Despite these challenges, the future of AI looks promising. As technology advances, AI will continue to revolutionize industries, creating new opportunities and improving quality of life for people worldwide. [BR [BR [96mSummary[0m [BR AI is changing industries like healthcare. It has challenges but a bright future. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.0916 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.0155 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.0916 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.0916 [90m(Average of longest common subsequences)[0m [BR [95mCompression ratio :[0m [94m0.1413 [90m(Ratio of summary tokens to source tokens)[0m [BR [BR :::

Explanation:

The compression ratio is 0.14, indicating that the summary is about 10% as long as the original text.

However, the low ROUGE scores suggest that the summary is not very informative.

Example 4: Excellent Summary (with ratio)

Code

def test_multiline_excelent_summary():
    text = MULTILINE
    summary = """\
    AI is transforming industries like healthcare and finance, enabling faster
    decision-making and improving efficiency. In healthcare, AI is used to diagnose
    diseases and personalize treatments. However, challenges like ethical concerns
    and data privacy must be addressed. Despite these, AI has a promising future,
    revolutionizing industries and improving quality of life worldwide.\
    """
    summary = textwrap.dedent(summary)
    metrics = RougeMetricsExtended(text=text, summary=summary)
    scores = metrics.scores()
    expected_scores = (0.5, 0.3, 0.5, 0.5)
    ratio = metrics.compression_ratio()

    assert scores["rouge_1"] >= expected_scores[0]
    assert scores["rouge_2"] >= expected_scores[1]
    assert scores["rouge_l"] >= expected_scores[2]
    assert scores["rouge_lsum"] >= expected_scores[3]
    assert 0.3 <= ratio <= 0.5
    print_case("test_multiline_excelent_summary", text, summary, scores, ratio)

test_multiline_excelent_summary()

[F[91m[4mMultiline excelent summary[0m [BR [96mText[0m [BR Artificial intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI-powered tools are enabling faster decision-making, reducing costs, and improving efficiency.

However, the adoption of AI is not without challenges. Ethical concerns, such as bias in algorithms and data privacy, must be addressed to ensure fair and responsible use of AI technologies.

Despite these challenges, the future of AI looks promising. As technology advances, AI will continue to revolutionize industries, creating new opportunities and improving quality of life for people worldwide. [BR [BR [96mSummary[0m [BR AI is transforming industries like healthcare and finance, enabling faster decision-making and improving efficiency. In healthcare, AI is used to diagnose diseases and personalize treatments. However, challenges like ethical concerns and data privacy must be addressed. Despite these, AI has a promising future, revolutionizing industries and improving quality of life worldwide. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.5412 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.3214 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.5176 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.5176 [90m(Average of longest common subsequences)[0m [BR [95mCompression ratio :[0m [94m0.4457 [90m(Ratio of summary tokens to source tokens)[0m [BR [BR :::

Explanation:

The compression ratio is 0.3333, indicating that the summary is about 33% as long as the original text.

The ROUGE scores are also high, indicating that the summary is well-informed and accurate.

Conclusion

ROUGE metrics provide a robust way to evaluate summarization quality, but they are not sufficient on their own. ROUGE-L is useful when the grammar and word order are preserved, but it loses value when the grammar changes. Therefore, a combination of ROUGE-1, ROUGE-2, and ROUGE-LSum is often more reliable.

Additionally, the compression ratio is crucial for assessing the conciseness of a summary. A good summary should have a compression ratio between 10% and 50%, ensuring it is both concise and informative.

By combining ROUGE metrics with tokenization and compression analysis, we can build more effective summarization models and ensure they meet the desired quality standards.

Common Libraries

Helper Function: print_case

ROUGE Metrics

Implementing RougeMetrics

Testing ROUGE Metrics with Examples

Example 1: No summary

Example 2: Partial Summary

Example 3: Poor Summary

Token Counting and Compression Ratio

Token Counting

Compression Ratio

Extending RougeMetrics with Token Counting and Compression Ratio

Testing Token Counting and Compression Ratio

Example 1: Partial Summary (with ratio)

Multi-line and more complex summarization

Example 2: Bad summary should be better than no summary

Example 3: Poor Summary (with ratio)

Example 4: Excellent Summary (with ratio)

Conclusion

Helper Function: `print_case`

Implementing `RougeMetrics`

Extending `RougeMetrics` with Token Counting and Compression Ratio