In content operation, accurately counting the number of words in an article is crucial for SEO optimization, content length control, and even royalty calculation. The Anqi CMS provides convenientwordcountFilter, helping us quickly achieve this goal. However, if the content is not properly processed, excessive whitespace characters may subtly affect the accuracy of counting.

This article will delve into how to use Anqi CMS effectivelywordcountWhen using the filter, effectively avoid the interference of these blank characters to ensure you get the most accurate word count results.

Understandwordcountworking principle

Firstly, we need to understandwordcountThe filter is how to identify and calculate words. According to the template filter document of Anqi CMS,wordcountThe function is 'to calculate the number of words in a string', it 'will distinguish words by spaces.'If it does not contain any spaces, it is considered a word.Returns an integer.This means that the filter mainly relies on spaces as word separators.Any continuous sequence of non-space characters (including Chinese and punctuation symbols) without spaces in between will be regarded as a 'word'.

This mechanism generally works well in most cases, but may be biased when the string contains non-standard whitespace characters or HTML tags.

Why do white spaces affect counting accuracy?

As the name suggests, white spaces are those characters that do not display specific content, such as:

  • Leading and trailing white spacesThe article may have leading or trailing spaces and newline characters. For example, “ Hello World ”.
  • Redundant internal whitespaceThere are more than one spaces between words or sentences, or a mixture of non-standard whitespace characters such as tabs and full-width spaces.For example, "hello world
  • Whitespace caused by HTML tagsIn a rich text editor, content is typically stored in HTML format. After removing the HTML tags, unnecessary spaces or line breaks may appear between the originally closely connected text blocks, which may affectwordcountof auto. For example,<div>Hello</div><div>World</div>it may become after removing tagsHello World, but it may also become if the HTML structure is complex,Hello \n WorldorHello World.

if these situations are not preprocessed,wordcountMay misjudge empty strings before and after excessive blank characters as words, or incorrectly separate Chinese words that should be connected due to irregular spaces, leading to inaccurate statistics.

Solution: Purify content, improve count accuracy

To obtain accurate word count, we need to usewordcountBefore, a series of purification processes are performed on the content. The Anqi CMS provides some powerful filters to help us complete these tasks.

1. Remove extra spaces at the beginning and end:trimFilter

This is the most common and direct optimization method.trimThe filter can remove all whitespace characters at the beginning and end of a string (including spaces, newline characters, etc.).

Usage:

{# 假设 archive.Content 是您要统计的文章内容 #}
{{ archive.Content | trim | wordcount }}

PasstrimProcessed, strings like '你好 世界' will become '你好 世界', avoiding the effect of leading and trailing spaces on counting.

2. Processing rich text content:striptagsFilter

If your content is from a rich text editor, it is likely to contain a large number of HTML tags.These tags themselves are not words, but they may introduce additional spaces after removal.striptagsThe filter can effectively remove all HTML and XML tags from the string.

Usage:

{# 先移除所有HTML标签,再清除首尾空白,最后统计单词数 #}
{{ archive.Content | striptags | trim | wordcount }}

For example,"<p>Hello <b>World</b></p>"Afterstriptagswill become"Hello World". If the original content is" <p>Hello</p> <p>World</p> ",striptagsit may become" Hello World ",then combine withtrimThen we can get."Hello World".

If you only need to remove specific HTML tags (for example, remove only<i>) you can useremovetagsFilter.

Usage:

{# 移除所有i标签,再进行后续处理 #}
{{ archive.Content | removetags:"i" | striptags | trim | wordcount }}

3. Normalize internal redundant spaces:replacefilter (optional but recommended)

AlthoughwordcountIt usually treats multiple consecutive spaces as a separator, but if your content contains full-width spaces (such as those entered with a Chinese input method) or other non-standard whitespace characters,replaceThe filter can be put to use.We can use it to replace these non-standard whitespace characters with standard half-width spaces and ensure that all redundant consecutive spaces are normalized to a single space.

Usage:

{# 将全角空格替换为半角空格,再将多个半角空格替换为单个半角空格 #}
{{ archive.Content | replace:" "," " | replace:"  "," " | wordcount }}

It should be noted that,replace:" "," "May require multiple chained calls to replace all consecutive spaces with a single space, because each call only handles the replacement once. For most cases,wordcountIt is sufficient for processing consecutive spaces. However, if you追求极致的精确, or need to handle specific non-standard whitespace characters,replaceit would be a good supplement.

combined use, to achieve precise counting

To ensure the highest accuracy of word counting, it is recommended to combine the above filters into a content purification pipeline:

  1. Remove HTML tags: ThroughstriptagsorremovetagsConvert rich text content to plain text.
  2. Clean leading and trailing spaces: UsetrimRemove redundant spaces at the beginning and end of text.
  3. Standardize internal spacing(Optional but recommended): Use as neededreplaceProcess full-width spaces or replace multiple consecutive spaces with a single one.
  4. Count wordsApply last:wordcount.

**Example of practice:**

`twig {% set cleaned_content = archive.Content | striptags | trim | replace:“ ”,” “ %} {% set word_count = cleaned_content | wordcount %}

Total number of words in the article