In content operation, accurately counting the number of words or characters in an article is an important step in measuring content length, estimating reading time, and conducting SEO optimization. AnQiCMS (AnQi Content Management System) provides convenientwordcountFilter, which helps us quickly achieve this feature. However, during use, we may encounter a common detail problem: whether the punctuation at the beginning and end of the string, or the punctuation attached to the word, will affectwordcountThe statistical results? Understanding this and mastering the corresponding handling methods is crucial for obtaining more accurate statistical data.
UnderstandingwordcountThe working principle of the filter
Anqi CMS'swordcountThe original design intention of the filter is to simply and efficiently count the number of 'words' in a string.According to its description, it mainly uses spaces to distinguish words.This means that by default, as long as there are no spaces between characters, they are considered to be the same "word"."Hello world"will be counted as 2 words, while"HelloWorld"Then it is counted as 1 word. For Chinese content, as there are usually no spaces to separate them,"你好世界"it will also be counted as 1 word.
The problem is that when words are closely connected with punctuation marks, for example,"Hello!"/"世界。"or"(AnQiCMS)",wordcount
The impact and processing strategies of punctuation on statistics
In order to obtain a word count that is more in line with our expectations, we need to applywordcountBefore the filter, the punctuation symbols in the string are preprocessed. The Anqi CMS provides a variety of flexible filters that can help us clean up unnecessary punctuation symbols.
1. UsereplaceFilter performs precise replacement
replaceThe filter is a powerful tool for dealing with such problems.It can replace the specified "old keyword" with the "new keyword" in a string.We can take advantage of this feature to replace common punctuation marks with empty strings, thereby removing them from the text.
We have a text that contains commas, periods, exclamation marks, question marks, parentheses, and other punctuation marks, and we want to ignore them when counting words.replaceThe filter needs to operate on each punctuation mark that needs to be replaced separately. For example:
{% set text_with_punctuation = "你好,世界!AnQiCMS (内容管理系统) 真不错。" %}
{# 原始统计,标点符号可能被算作单词一部分 #}
<p>原始词数:{{ text_with_punctuation | wordcount }}</p>
{# 清理标点符号后再统计 #}
{% set cleaned_text = text_with_punctuation | replace:",," | replace:"!," | replace:".," | replace:"(," | replace:")," | replace:"?,," | replace:"!,," %}
<p>清理后词数:{{ cleaned_text | wordcount }}</p>
In the above example, we use multiple chained callsreplaceFilter, replacing Chinese commas, exclamation marks, periods, parentheses, and English question marks and exclamation marks with empty strings.wordcountThe filter will count the number of words for a more 'pure' text, resulting in more accurate results.
2. Consider usingcutFilter (for individual characters)
AlthoughreplaceThe filter function is powerful, but if you need to remove punctuation symbols that are all single characters and there are many types, you can also consider usingcutFilter.cutThe filter is used to remove specified characters from any position in a string. However,replaceit differs from replacing one "old keyword" each time,cutUsed to remove one or more characters. When dealing with multiple different punctuation marks, its usage is the same asreplaceSimilar, it also needs to be called in a chain, or write a custom function that can handle all target characters (if the system supports it). But in the current filter system of AnQiCMS,replaceThe expression may be clearer and more direct.
3. CombinetrimThe filter handles leading and trailing whitespaces.
AlthoughwordcountFilter typically handles white space at both ends of the entire string correctly, but in some complex text processing procedures, it is first usedtrimFilter removes whitespace from both ends of a string (including spaces, newline characters, etc.), ensuring that the text is in **English** state when entering subsequent cleaning and statistical processes. This mainly targets whitespace characters, but not punctuation symbols attached to words.replaceIt is still the core solution.
{% set text_raw = " Hello, world! " %}
{# 先移除首尾空白 #}
{% set trimmed_text = text_raw | trim %}
{# 然后再清理标点并统计 #}
{% set final_count = trimmed_text | replace:",," | replace:"!," | wordcount %}
Considerations in practice and **practice
In practical applications, we need to decide the range of punctuation symbols to be cleaned based on the specific content of the website and the accuracy requirements of the "word" definition. For example:
- English contentEnglish: usually, commas (,), periods (.), and exclamation marks (!) need to be removed), Question mark (?)), colon (:), semicolon (;), quotation marks (''`'), parentheses (())[].
- English contentEnglish, English.), exclamation mark (!), Question mark (?)), colon (colon), semicolon (semicolon), quotation marks (single quotes, double quotes, curly quotes), parentheses (parentheses) and so on.
To get the most accurate statistics, it is recommended that you:
- Determine the list of punctuation marks to be excludedEnglish: List all punctuation marks that do not need to be counted according to your content type and statistical requirements.
- Create a cleaned text variable.: Use
setCreate a new variable for tags, by chaining calls.replaceFilter, gradually remove all target punctuation symbols. - Apply to the cleaned text.
wordcountReplace withwordcountThe filter is applied to the cleaned text variable you have to get more accurate statistical results. - Perform testing and verificationAlways use actual content fragments for testing, compare the statistical results before and after cleaning to ensure they meet expectations.
Through these detailed preprocessing steps, you will be able to better utilize AnQiCMS.wordcountThe filter provides more accurate data support for content operation.
Common Questions (FAQ)
1. Why does my content have"AnQiCMS!",wordcountonly be counted as 1 word instead of"AnQiCMS"and"!"2 words?Anqi CMS'swordcountThe filter mainly uses spaces to distinguish words. When punctuation marks (such as exclamation marks!) are with text (such asAnQiCMS)between no space,wordcountIt will count it as a whole."AnQiCMS!"Would be recognized as a complete “word”, rather than two separate words. If you wish to separate punctuation from the text and calculate them separately, you need towordcountbefore usingreplaceThe filter removes or replaces punctuation with spaces.
2. Can I remove all common punctuation marks with one filter?In AnQiCMS current template filters,replaceThe filter can only replace one specified "old keyword" with "new keyword" at a time. This means you cannot use character classes (such as[.,!?])to match and remove multiple punctuation marks at once. Therefore, to remove multiple different punctuation marks, you need to chain calls to multiplereplaceFilter, each filter processes a specific or a series of punctuation marks. For example,{{ content | replace:",," | replace:".," | replace:"!," }}.
3.wordcountWhat are the statistical rules for Chinese characters in filters?For Chinese characters,wordcountFilter follows the principle of "separated by spaces". If there are no spaces in the Chinese text, such as"安企CMS内容管理系统", it will be counted as 1 word. If there are spaces in the Chinese content, for example"安企CMS 内容管理系统",then it will be counted as 2 words. Like English words, when Chinese characters and punctuation signs are closely connected (such as"安企CMS!"),will also be considered as a whole. To obtain a more fine-grained Chinese word count, you may need to use external tools for word segmentation, orwordcountremove punctuation marks first.