How to safely extract the content of an article containing HTML tags without destroying its structure?

In website content management and display, we often need to display part of the article content in the article list, homepage summary, or related recommendation area to attract readers to click.However, directly truncating the character or word count of an article content that contains HTML tags can easily disrupt its original HTML structure, causing the page display to become disordered and even affecting the overall layout and user experience.

AnQiCMS (AnQiCMS) understands this pain point, its powerful template engine and built-in filters provide an elegant and secure solution, allowing you to worry-free about display issues caused by content truncation.Next, we will discuss how to safely extract article content containing HTML tags in Anqi CMS.

Why does the conventional cut method destroy the HTML structure?

Imagine that the content of an article might be like this:

<p>这是一段包含<strong>重要信息</strong>的文字。</p>
<div class="image-box">
  <img src="/path/to/image.jpg" alt="图片描述">
</div>
<p>更多精彩内容。</p>

If you simply use string truncation features, such as limiting the display to the first 50 characters, and these 50 characters happen to be<strong>truncated inside a tag, or in<div class="image-box">After the tag but before its closing tag</div>Before that, the code that will be displayed on the final page may be<p>这是一段包含<strong>重要信Or<div class="image-box">. Such incomplete HTML code will cause the browser to render incorrectly, resulting in the loss of style at best, and chaos in page layout at worst, even affecting the normal display of other elements.

The solution of AnQiCMS: HTML secure extraction filter

AnQiCMS uses a template engine syntax similar to Django, providing rich filters (Filters) to process content. For the safe truncation of HTML content, the system has built-intruncatechars_htmlandtruncatewords_htmlThese are very practical filters. Their core advantage lies in their ability to intelligently identify and close incomplete HTML tags when extracting content, thereby ensuring the structural integrity of the output content.

1. Safe truncation by character count:truncatechars_html

If you need to control the truncation length more accurately, for example, requiring that the abstract must be within 100 characters, even if this will truncate a word, truncatechars_htmlIt is your ideal choice. It will truncate according to the number of characters you specify, while ensuring that all HTML tags opened at the truncation point are properly closed.

Usage example:Assuming your article content is stored initem.Contentvariable, you want to extract the first 120 characters as a summary.

{{ item.Content|truncatechars_html:120|safe }}

Here120which is the number of characters you want to extract.|safeThe filter is crucial here, it tells the template engine that the output content is safe HTML, which should not be automatically escaped, so that the browser can normally parse the captured HTML structure.

2. Safely truncate by word count:truncatewords_html

Compared to this, if your content pays more attention to semantic integrity and hopes to avoid truncating words, even though this may cause the actual character count to fluctuate slightly, thentruncatewords_htmlIt would be a better choice. It will truncate according to the number of words you specify, and it will also intelligently handle the closing of HTML tags.

Usage example:If you want the article summary to display about 30 words.

{{ item.Content|truncatewords_html:30|safe }}

Here30That is the number of words you want to extract. Similarly,|safeThe filter is indispensable.

Actual operation example

In Anqi CMS templates, you usually loop through to display article lists (for example, usingarchiveListWhen using these truncation features. A typical scenario is, if the article has an independent "summary/abstract" field (such asitem.DescriptionWe prioritize using it; if the article does not have an introduction, it is taken from the main textitem.ContentExtract a part of it safely.

{# 假设我们正在循环输出一个文章列表 #}
{% archiveList archives with type="list" limit="10" %}
    {% for item in archives %}
    <article class="article-item">
        <h2><a href="{{item.Link}}">{{item.Title}}</a></h2>
        <div class="article-meta">
            <span>发布日期: {{stampToDate(item.CreatedTime, "2006-01-02")}}</span>
            <span>分类: <a href="{% categoryDetail with name='Link' id=item.CategoryId %}">{% categoryDetail with name='Title' id=item.CategoryId %}</a></span>
        </div>
        <div class="article-summary">
            {# 优先使用文章的简介字段,如果为空,则从内容中安全截取前180个字符 #}
            <p>{{ item.Description|default:item.Content|truncatechars_html:180|safe }}</p>
        </div>
        <a href="{{item.Link}}" class class="read-more">阅读全文 &gt;&gt;</a>
    </article>
    {% endfor %}
{% endarchiveList %}

In this example,item.Description|default:item.ContentMeans: Ifitem.Descriptionhas a value, use it; otherwise, useitem.ContentThen, whether it is a brief introduction or the main text, it passes throughtruncatechars_html:180|safeThe filter for safe truncation and output. This ensures the flexibility of content configuration while maintaining the stability of the page structure.

Points to note

  • |safeThe filter cannot be omitted:This is the most critical point.truncatechars_htmlandtruncatewords_htmlThe content after the filter is still a string with HTML tags. To make the browser parse these tags correctly instead of displaying them as plain text, it must be added at the output time|safeFilter. Otherwise, you might see the original HTML tags, such as&lt;p&gt;.
  • Prefer to use the independent summary field:The best practice in content operation is to fill in an independent abstract or introduction field when publishing articlesitem.Description). This can ensure the accuracy and attractiveness of the summary, as well as reduce the burden of the template engine on HTML parsing extraction during each page load.When there is no independent abstract, use it againtruncate_htmlSeries filters as a backup solution.
  • Consideration for truncation length:According to the design of your website and page layout, set the length of characters or words for truncation reasonably. Too short may not convey enough information, and too long may lose the meaning of the summary.

With the powerful template filter provided by Anqi CMS, you can easily achieve safe truncation of article content containing HTML tags, which can optimize page display, enhance user experience, and effectively manage your website content, making website operation more efficient.


Frequently Asked Questions (FAQ)

Q1:truncatechars_htmlandtruncatecharsWhat is the essential difference?

A1: truncatechars_htmlandtruncatecharsThey are all used to truncate strings by character count, but they handle HTML content in completely different ways.truncatecharsIt is purely string truncation, not concerned with whether the content contains HTML tags, and if the truncation point is exactly inside a tag, it will destroy the HTML structure.truncatechars_htmlIt is HTML-aware, it will intelligently close all incomplete HTML tags after extraction, thus ensuring that the HTML structure of the extracted content is complete and valid. Therefore, when processing article content containing HTML, it is imperative to usetruncatechars_html.

Q2: Why was it usedtruncatechars_htmlAfter that, the content did not display correctly, and the HTML tags were also displayed.

A2:This is usually because you forget to add in the output when truncating the content.|safeThe filter. The Anqi CMS (as well as most modern template engines) for security reasons, defaults to escaping all output content to prevent XSS attacks. Whentruncatechars_htmlAfter processing the content, the output is still a string containing HTML tags. If you don't add|safeThese tags (such as<p>) will be escaped into&lt;p&gt;, thus displayed as plain text on the page. Adding|safeAfter that, it will tell the template engine that this content is safe HTML and can be output directly without escaping.

Q3: Can I directly set the global article summary truncation length in the Anqi CMS backend?

A3:The AnQiCMS provides a custom field feature for content models, encouraging you to set a separate "summary" or "abstract" field for articles and manually fill it in when publishing.This can maximize the control of the summary quality. If you have not set a separate summary field and need to automatically extract from the article content, Anqi CMS more advocates using template leveltruncatechars_htmlortruncatewords_htmlThe filter can flexibly control the truncation length. This means that you need to set and apply the truncation logic yourself in the template file according to the needs of different areas (such as the home page, category page, search results page), rather than through the background global unified setting.This method provides greater flexibility and customization options.