In content operation, accurately counting the number of words in an article is crucial for SEO optimization, content length control, and even fee calculation. Anqi CMS provides convenientwordcountThe filter helps us quickly achieve this goal. However, if the content is not properly processed, extra whitespace characters may subtly affect the accuracy of the count.
This article will delve into how to use Anqi CMS'swordcountfilter to effectively avoid the interference of these blank characters, ensuring you get the most accurate word count results.
Understandwordcountworking principle
Firstly, we need to understandwordcountHow filters identify and calculate words. According to the Anqi CMS template filter document,wordcountThe function is to calculate the number of words in a string, it 'will differentiate words by spaces.'If it does not contain any spaces, it is considered a word. It returns an integer.This means that the filter mainly depends on spaces as word separators.Any continuous sequence of non-space characters (including Chinese characters and punctuation marks) will be regarded as a 'word' if there are no spaces in the middle.
This mechanism usually works well in most cases, but deviations may occur when non-standard whitespace characters or HTML tags are present.
Why do white spaces affect the accuracy of counting?
White spaces, as the name implies, are those characters that do not display specific content, such as:
- Leading and trailing white spaces: There may be spaces or newline characters at the beginning or end of the article. For example, " Hello World ".
- Redundant internal spacesThere are multiple spaces between words or sentences, or a mixture of non-standard whitespace characters such as tabs or full-width spaces.For example, 'Hello World', 'Hello World'.
- Spaces caused by HTML tagsIn a rich text editor, content is usually stored in HTML format. After removing HTML tags, unnecessary spaces or line breaks may appear between originally closely connected text blocks, which may affect
wordcountthe judgment. For example,<div>Hello</div><div>World</div>it may become after removing the tagsHello Worldbut it may also become if the HTML structure is complexHello WorldorHello World.
if these situations are not preprocessed,wordcountIt may misjudge the empty strings before and after the extra blank characters as words, or incorrectly separate Chinese words that should be connected due to irregular spaces, leading to inaccurate statistical results.
Solution: Clean content, improve counting accuracy
To get an accurate word count, we need to usewordcountBefore, a series of purification processes are performed on the content. AnQi CMS provides some powerful filters to help us complete these tasks.
1. Remove extra spaces at the beginning and end:trimFilter
This is the most common and direct optimization method.trimThe filter can remove all whitespace characters at the beginning and end of a string (including spaces, newlines, etc.).
Usage:
{# 假设 archive.Content 是您要统计的文章内容 #}
{{ archive.Content | trim | wordcount }}
BytrimAfter processing, strings like “ Hello World ” will become “Hello World”, avoiding the impact of leading and trailing whitespaces on counting.
2. Processing rich text content: striptagsFilter
If your content comes from a rich text editor, it is likely to contain a large number of HTML tags.These tags themselves are not words, but they may introduce extra spaces after removal.striptagsThe filter can effectively remove all HTML and XML tags from a string.
Usage:
{# 先移除所有HTML标签,再清除首尾空白,最后统计单词数 #}
{{ archive.Content | striptags | trim | wordcount }}
For example,"<p>Hello <b>World</b></p>"afterstriptagsit will become"Hello World". If the original content is" <p>Hello</p> <p>World</p> ",striptagsit may become" Hello World "Combine it nowtrimAnd you get"Hello World".
If you only need to remove a specific HTML tag (for example, only remove<i>tag), you can useremovetagsfilter.
Usage:
{# 移除所有i标签,再进行后续处理 #}
{{ archive.Content | removetags:"i" | striptags | trim | wordcount }}
3. Normalize internal redundant spaces:replaceFilter (optional but recommended)
AlthoughwordcountGenerally, multiple consecutive spaces can be treated as a delimiter, but if your content contains full-width spaces (such as those entered with the Chinese input method) or other non-standard whitespace characters,replaceThe filter can be put to use. We can use it to replace these non-standard whitespace characters with standard half-width spaces and ensure that all consecutive spaces are normalized to a single space.
Usage:
{# 将全角空格替换为半角空格,再将多个半角空格替换为单个半角空格 #}
{{ archive.Content | replace:" "," " | replace:" "," " | wordcount }}
This is something that needs to be noted,replace:" "," "It may be necessary to chain calls multiple times to replace all consecutive spaces with a single space, because each call only handles the replacement once. For most cases,wordcountThe processing of consecutive spaces is already sufficient. But if you pursue ultimate precision or need to handle specific non-standard whitespace charactersreplaceit would be a good supplement.
Using in combination, it can achieve accurate counting
To ensure the highest accuracy of word count, it is recommended to combine the above filters into a content purification pipeline:
- Remove HTML tags: Pass
striptagsorremovetagsConvert rich text content to plain text. - Clean leading and trailing spaces.: Use.
trimRemove redundant spaces at the beginning and end of the text. - Standardize internal spaces.(Optional but recommended): Use as needed.
replaceHandle full-width spaces or replace multiple consecutive spaces with a single one. - Count words: Apply last
wordcount.
**Practical Example:**
”`twig {% set cleaned_content = archive.Content | striptags | trim | replace: “ ” , “ “ %} {% set word_count = cleaned_content | wordcount %}
Total number of characters in the article