When using the Safe CMS to manage website content, we often need to perform word count on articles, whether it is for content planning, SEO optimization, or simply to meet publication requirements,wordcountThe filter is a very practical tool. It can quickly calculate the number of words in the text, providing intuitive data support for content operation.

However, when the text we are processing contains HTML entities, such as the common 'non-breaking space', wordcountWhat will the filter calculate? This is often a place that is easy to raise doubts.

wordcountFilter: Review of Basic Functions

In the template system of AnQi CMS,wordcountThe filter is mainly used to count the number of words in a given string.It works by recognizing spaces in text and treating sequences separated by spaces as individual 'words'.

{{ "Hello World"|wordcount }} {# 结果会是 2 #}
{{ "安企CMS 是一个高效的内容管理系统"|wordcount }} {# 结果会是 7 #}

From these examples, we can see,wordcountIn recognizing Chinese text, spaces are also used as delimiters to determine the number of words.If a sequence of Chinese characters appears continuously without spaces, it will be considered as a complete 'word'.

When HTML entities appear in the text

Now, let's go back to the core issue: when HTML entities are included in the textwordcountwhat will it count?

HTML entities, such as (non-breaking space),<(less than symbol),&(And symbol)etc., it will be parsed into the corresponding characters when rendered in the browser.For program internal processing, especially when counting words in plain text, whether they are treated as a standard space or counted after decoding depends on the implementation logic of the filter.

In the Anqi CMS.wordcountin the filter, itwill not automatically parse or decode HTML entities before countingThis means that it will treat HTML entities themselves, for example, as complete character sequences.

Let's look at a few examples to understand this behavior:

  1. Text directly contains HTML entities, but there are no actual spaces between them:

    {{ "Hello World"|wordcount }} {# 结果可能是 1 #}
    

    In this case,wordcountit willHello WorldConsidered as a continuous string without standard spacing, it may only be counted as 1 word.It does not treat it as a space to separate "Hello" and "World".

  2. Text contains both actual spaces and HTML entities:

    {{ "Hello   World"|wordcount }} {# 结果可能是 3 #}
    

    Here,wordcountIdentifies 'Hello' and 'World' as being separated by actual spaces.At the same time, this character sequence itself, due to the spaces at both ends, will also be considered as a separate 'word'.Therefore, it will calculate the words “Hello”, “ ” and “World”.

  3. Text contains other HTML entities or tags:

    {{ "安企CMS<strong>非常强大</strong>"|wordcount }} {# 结果可能是 3 #}
    

    In this example,&lt;strong&gt;and&lt;/strong&gt;Similarly, they will be considered as independent character sequences, and if they are separated by spaces, they will also be counted as words.For example, 'AnQi CMS' and '<strong>very powerful</strong>' may be considered as two words.If there is no space, it may be a long word.

Practical suggestions: How to get an 'actual' word count

This 'literals' processing method for HTML entities and tags may not be what we expect in some cases.We usually want the word count to be based on the actual content read by the user, not the original string containing HTML tags.

In order to get a word count that is more in line with 'human reading' habits, which is to say the pure text word count excluding HTML tags and entities, we can usewordcountBefore using the filter, use other filters to 'clean' the content.

AnQi CMS providesstriptagsFilter, it can remove all HTML tags from the text. If the content still contains such entities, we usually also want them to be treated as a space instead of a word.

The following is a more practical word count method:

{# 假设content变量包含 HTML 文本和实体 #}
{% set cleanContent = content|striptags %} {# 移除所有HTML标签 #}
{% set finalWordCount = cleanContent|wordcount %} {# 对纯文本进行字数统计 #}

<p>这篇文章的实际字数(不含HTML标记和实体)是:{{ finalWordCount }}个词。</p>

By using such a combination,striptagsit will first convert<p>/<strong>Remove HTML tags, and it also tends to replace such entities with actual spaces (the specific behavior may vary slightly depending on the content and context, but it usually achieves the expected result). After that,wordcountIt can count words on a relatively "clean" plain text, thus giving results that are more in line with our intuitive understanding.

Summary

Anqi CMS'swordcountThe filter treats text containing HTML entities as a plain character sequence and does not automatically decode or treat them specially.This means that entities like this, if separated by spaces, may also be counted as a word.

To obtain a more accurate and human-readable word count, it is recommended to applywordcountUse it before the filterstriptagsEnglish filters are used to preprocess the content, remove HTML tags and entities, and then get pure text for counting.So, you can better grasp the real volume of the content, providing more accurate data support for website operation and SEO strategies.

Common Questions and Answers (FAQ)

1.wordcountDoes the filter count Chinese characters?

Yes,wordcountFilter calculates Chinese characters.It is not counted by the number of characters, but follows the principle of 'words are separated by spaces'.If multiple Chinese characters appear consecutively without spaces, they are considered as a 'word'.For example, "An enterprise content management system" is counted as 1 word because there are no spaces between them.

2. How towordcountThe filter ignores punctuation in the text?

wordcountThe filter itself does not have the function to ignore punctuation marks.It treats punctuation as part of a word unless the punctuation itself is separated by spaces.wordcountPerform more complex text preprocessing, such as usingreplaceFilter with regular expressions to remove or replace punctuation, but this usually requires some custom template functions or more advanced programming skills.For general needs, it is recommended to accept the default behavior.

3. If my content contains images,wordcountwill it calculate the images?

wordcountThe filter only counts the words in the text. If the images are passed through<img>Label inserted, it will calculate<img>Any readable text inside the label (such asaltThe value of the property, if it is extracted as text from the template and passed to the filter), but it will not calculate the image itself. If you want to completely ignore the HTML tags and their attribute content, make sure that inwordcountbefore usingstriptagsFiltering is performed.