Code
def test_no_summary():
= "The cat sat on the mat."
text = "The cat sat on the mat."
summary = RougeMetrics(text=text, summary=summary).scores()
scores
assert scores["rouge_1"] >= 1.0
assert scores["rouge_2"] >= 1.0
assert scores["rouge_l"] >= 1.0
assert scores["rouge_lsum"] >= 1.0
"test_no_summary", text, summary, scores)
print_case(
test_no_summary()
[F[91m[4mNo summary[0m [BR [96mText[0m [BR The cat sat on the mat. [BR [BR [96mSummary[0m [BR The cat sat on the mat. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m1.0000 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m1.0000 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m1.0000 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m1.0000 [90m(Average of longest common subsequences)[0m [BR [BR :::
Explanation:
Since the summary is identical to the original text, all ROUGE scores are 1.0, falsely indicating a perfect match.
Example 2: Partial Summary
Code
def test_partial_summary():
= "A quick brown fox jumps over the lazy dog."
text = "A fox jumps over a dog."
summary = RougeMetrics(text=text, summary=summary)
metrics = metrics.scores()
scores = (0.6, 0.3, 0.6, 0.6)
expected_scores
assert scores["rouge_1"] >= expected_scores[0]
assert scores["rouge_2"] >= expected_scores[1]
assert scores["rouge_l"] >= expected_scores[2]
assert scores["rouge_lsum"] >= expected_scores[3]
"test_partial_summary", text, summary, scores)
print_case(
test_partial_summary()
[F[91m[4mPartial summary[0m [BR [96mText[0m [BR A quick brown fox jumps over the lazy dog. [BR [BR [96mSummary[0m [BR A fox jumps over a dog. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.6667 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.3077 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.6667 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.6667 [90m(Average of longest common subsequences)[0m [BR [BR :::
Explanation:
- ROUGE-1: The summary captures 4 out of 6 unigrams (“A”, “fox”, “jumps”, “over”), resulting in a score of 0.6667.
- ROUGE-2: Only 2 out of 5 bigrams are captured (“fox jumps”, “jumps over”), resulting in a score of 0.3333.
- ROUGE-L: The longest common subsequence is “A fox jumps over”, which matches 4 out of 6 words, resulting in a score of 0.6667.
Example 3: Poor Summary
Code
def test_poor_summary():
= "Artificial intelligence is transforming industries like healthcare and finance."
text = "AI is changing the world."
summary = RougeMetrics(text=text, summary=summary)
metrics = metrics.scores()
scores "test_poor_summary", text, summary, scores)
print_case(
test_poor_summary()
[F[91m[4mPoor summary[0m [BR [96mText[0m [BR Artificial intelligence is transforming industries like healthcare and finance. [BR [BR [96mSummary[0m [BR AI is changing the world. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.1429 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.0000 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.1429 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.1429 [90m(Average of longest common subsequences)[0m [BR [BR :::
Explanation:
- ROUGE-1: Only 1 out of 6 unigrams (“is”) is captured, resulting in a score of 0.1667.
- ROUGE-2: No bigrams are captured, resulting in a score of 0.0.
- ROUGE-L: The longest common subsequence is “is”, which matches 1 out of 6 words, resulting in a score of 0.1667.
Token Counting and Compression Ratio
While ROUGE metrics are useful, they don’t tell us everything about the quality of a summary. For example, a summary could have high ROUGE scores but still be too long or too short. To address this, we’ll introduce token counting and compression ratio.
Token Counting
Token counting measures the number of unique words in the text and summary. This helps us understand how concise the summary is.
Compression Ratio
The compression ratio is the ratio of the number of tokens in the summary to the number of tokens in the original text. A good summary typically has a compression ratio between 10% and 50%.
Extending RougeMetrics
with Token Counting and Compression Ratio
We’ll extend the RougeMetrics
class to include token counting and compression ratio.
Code
import re
class RougeMetricsExtended(RougeMetrics):
"""
Extends the RougeMetrics class with tokenization and compression ratio.
"""
@staticmethod
def regex_tokenizer_counter(text: str) -> int:
"""
Tokenize text using regex and count the number of unique tokens.
"""
= re.findall(r"\b\w+\b", text)
tokens = set(tokens)
unique_tokens return len(unique_tokens)
def compression_ratio(self) -> float:
"""
Calculate the compression ratio of the summary compared to the original text.
"""
= self.regex_tokenizer_counter(self.text)
source_tokens = self.regex_tokenizer_counter(self.summary)
summary_tokens return summary_tokens / source_tokens
Testing Token Counting and Compression Ratio
Let’s test the extended class with the same examples.
Example 1: Partial Summary (with ratio)
Code
def test_partial_summary():
= "A quick brown fox jumps over the lazy dog."
text = "A fox jumps over a dog."
summary = RougeMetricsExtended(text=text, summary=summary)
metrics = metrics.scores()
scores = metrics.compression_ratio()
ratio = (0.6, 0.3, 0.6, 0.6)
expected_scores
assert scores["rouge_1"] >= expected_scores[0]
assert scores["rouge_2"] >= expected_scores[1]
assert scores["rouge_l"] >= expected_scores[2]
assert scores["rouge_lsum"] >= expected_scores[3]
assert 0.6 <= ratio
"test_partial_summary", text, summary, scores, ratio)
print_case(
test_partial_summary()
[F[91m[4mPartial summary[0m [BR [96mText[0m [BR A quick brown fox jumps over the lazy dog. [BR [BR [96mSummary[0m [BR A fox jumps over a dog. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.6667 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.3077 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.6667 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.6667 [90m(Average of longest common subsequences)[0m [BR [95mCompression ratio :[0m [94m0.6667 [90m(Ratio of summary tokens to source tokens)[0m [BR [BR :::
Explanation:
The compression ratio is 0.6667, indicating that the summary is about 66% as long as the original text.
Multi-line and more complex summarization
First lets define our multi-line sample text phrase.
Code
= """
MULTILINE Artificial intelligence (AI) is transforming industries across the globe.
From healthcare to finance, AI-powered tools are enabling faster
decision-making, reducing costs, and improving efficiency.
In healthcare, AI is being used to diagnose diseases, predict patient outcomes,
and personalize treatment plans. For example, machine learning algorithms can
analyze medical images to detect cancer earlier than traditional methods.
However, the adoption of AI is not without challenges. Ethical concerns, such
as bias in algorithms and data privacy, must be addressed to ensure fair and
responsible use of AI technologies.
Despite these challenges, the future of AI looks promising. As technology
advances, AI will continue to revolutionize industries, creating new
opportunities and improving quality of life for people worldwide.\
"""
Example 2: Bad summary should be better than no summary
Code
def test_multiline_better_than_wrong():
= MULTILINE
text = "The weather is sunny today, and I went for a walk in the park."
wrong_summary = """\
bad_summary AI is changing industries like healthcare.
It has challenges but a bright future.\
"""
= textwrap.dedent(bad_summary)
bad_summary
= RougeMetrics(text=text, summary=wrong_summary).scores()
wrong_scores = RougeMetrics(text=text, summary=bad_summary).scores()
bad_scores
assert bad_scores["rouge_1"] > wrong_scores["rouge_1"]
assert bad_scores["rouge_2"] > wrong_scores["rouge_2"]
assert bad_scores["rouge_l"] > wrong_scores["rouge_l"]
assert bad_scores["rouge_lsum"] > wrong_scores["rouge_lsum"]
Explanation:
The ROUGE scores for the bad summary are higher than the wrong summary, indicating that the bad summary is better than the wrong summary.
Example 3: Poor Summary (with ratio)
Code
def test_multiline_bad_summary():
= MULTILINE
text = """\
summary AI is changing industries like healthcare.
It has challenges but a bright future.\
"""
= textwrap.dedent(summary)
summary = RougeMetricsExtended(text=text, summary=summary)
metric = metric.scores()
scores = (0.09, 0.0, 0.09, 0.09)
expected_scores = metric.compression_ratio()
ratio
assert scores["rouge_1"] >= expected_scores[0]
assert scores["rouge_2"] >= expected_scores[1]
assert scores["rouge_l"] >= expected_scores[2]
assert scores["rouge_lsum"] >= expected_scores[3]
assert ratio <= 0.2
"test_multiline_bad_summary", text, summary, scores, ratio)
print_case(
test_multiline_bad_summary()
[F[91m[4mMultiline bad summary[0m [BR [96mText[0m [BR Artificial intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI-powered tools are enabling faster decision-making, reducing costs, and improving efficiency.
In healthcare, AI is being used to diagnose diseases, predict patient outcomes, and personalize treatment plans. For example, machine learning algorithms can analyze medical images to detect cancer earlier than traditional methods.
However, the adoption of AI is not without challenges. Ethical concerns, such as bias in algorithms and data privacy, must be addressed to ensure fair and responsible use of AI technologies.
Despite these challenges, the future of AI looks promising. As technology advances, AI will continue to revolutionize industries, creating new opportunities and improving quality of life for people worldwide. [BR [BR [96mSummary[0m [BR AI is changing industries like healthcare. It has challenges but a bright future. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.0916 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.0155 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.0916 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.0916 [90m(Average of longest common subsequences)[0m [BR [95mCompression ratio :[0m [94m0.1413 [90m(Ratio of summary tokens to source tokens)[0m [BR [BR :::
Explanation:
The compression ratio is 0.14, indicating that the summary is about 10% as long as the original text.
However, the low ROUGE scores suggest that the summary is not very informative.
Example 4: Excellent Summary (with ratio)
Code
def test_multiline_excelent_summary():
= MULTILINE
text = """\
summary AI is transforming industries like healthcare and finance, enabling faster
decision-making and improving efficiency. In healthcare, AI is used to diagnose
diseases and personalize treatments. However, challenges like ethical concerns
and data privacy must be addressed. Despite these, AI has a promising future,
revolutionizing industries and improving quality of life worldwide.\
"""
= textwrap.dedent(summary)
summary = RougeMetricsExtended(text=text, summary=summary)
metrics = metrics.scores()
scores = (0.5, 0.3, 0.5, 0.5)
expected_scores = metrics.compression_ratio()
ratio
assert scores["rouge_1"] >= expected_scores[0]
assert scores["rouge_2"] >= expected_scores[1]
assert scores["rouge_l"] >= expected_scores[2]
assert scores["rouge_lsum"] >= expected_scores[3]
assert 0.3 <= ratio <= 0.5
"test_multiline_excelent_summary", text, summary, scores, ratio)
print_case(
test_multiline_excelent_summary()
[F[91m[4mMultiline excelent summary[0m [BR [96mText[0m [BR Artificial intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI-powered tools are enabling faster decision-making, reducing costs, and improving efficiency.
In healthcare, AI is being used to diagnose diseases, predict patient outcomes, and personalize treatment plans. For example, machine learning algorithms can analyze medical images to detect cancer earlier than traditional methods.
However, the adoption of AI is not without challenges. Ethical concerns, such as bias in algorithms and data privacy, must be addressed to ensure fair and responsible use of AI technologies.
Despite these challenges, the future of AI looks promising. As technology advances, AI will continue to revolutionize industries, creating new opportunities and improving quality of life for people worldwide. [BR [BR [96mSummary[0m [BR AI is transforming industries like healthcare and finance, enabling faster decision-making and improving efficiency. In healthcare, AI is used to diagnose diseases and personalize treatments. However, challenges like ethical concerns and data privacy must be addressed. Despite these, AI has a promising future, revolutionizing industries and improving quality of life worldwide. [BR [BR [96mScores[0m [BR [95mRouge 1 :[0m [94m0.5412 [90m(Precision of unigrams)[0m [BR [95mRouge 2 :[0m [94m0.3214 [90m(Precision of bigrams)[0m [BR [95mRouge L :[0m [94m0.5176 [90m(Longest common subsequence)[0m [BR [95mRouge L∑ :[0m [94m0.5176 [90m(Average of longest common subsequences)[0m [BR [95mCompression ratio :[0m [94m0.4457 [90m(Ratio of summary tokens to source tokens)[0m [BR [BR :::
Explanation:
The compression ratio is 0.3333, indicating that the summary is about 33% as long as the original text.
The ROUGE scores are also high, indicating that the summary is well-informed and accurate.
Conclusion
ROUGE metrics provide a robust way to evaluate summarization quality, but they are not sufficient on their own. ROUGE-L is useful when the grammar and word order are preserved, but it loses value when the grammar changes. Therefore, a combination of ROUGE-1, ROUGE-2, and ROUGE-LSum is often more reliable.
Additionally, the compression ratio is crucial for assessing the conciseness of a summary. A good summary should have a compression ratio between 10% and 50%, ensuring it is both concise and informative.
By combining ROUGE metrics with tokenization and compression analysis, we can build more effective summarization models and ensure they meet the desired quality standards.