How to extract the HTML content rendered by Markdown without destroying the tag structure?

In content operation, we often encounter such needs: on an article list page or a special topic page, it is necessary to display the summary content of the articles.These articles are usually written using a Markdown editor, and the content may contain images, links, bold text, and other rich HTML structures.If it is only simple truncation of the HTML string rendered by Markdown, it often destroys the original tag structure, causing the page layout to become chaotic, even resulting in unclosed tags, which seriously affects the user experience.

AutoCMS is an efficient and flexible content management system that fully considers the challenges of this type of content display.It provides a graceful solution through its powerful template engine and built-in filter features, ensuring that the tag structure remains complete and intact when extracting HTML content rendered from Markdown.

Understanding the Markdown content rendering in AnQi CMS

Firstly, we need to understand how the CMS handles Markdown content.When we use the Markdown editor in the background to write articles, the system will store the Markdown text.When it is necessary to display these contents on the front-end page, especially on full-content pages like document detail pages, it is usually to render Markdown text into HTML.

In the AnQi CMS templates, we can usearchiveDetailLabel to get various fields of the article, including the content field of the Markdown editorContent. ThisContentField has a very practicalrenderWhen we setrender=trueWhen the system is set to auto, it will automatically convert and render the stored Markdown text into standard HTML content. If rendering is not required, it can be set torender=false, or omit this parameter when the editor is closed at this timeContentThe field will output the original Markdown text.

For example, to get the rendered article content, we can use it like this:{% archiveDetail articleContent with name="Content" render=true %}

It should be noted that the rendered HTML content, when output in the template to avoid being escaped again by the browser and displayed as plain text, also needs to be paired with|safeFilter usage. This is a common security practice in web development.|safeTell the template engine that this content is safe HTML and can be output directly.

核心策略：智能截取HTML内容 (English)

Now, we have obtained the rendered HTML content, but problems arise if we truncate the HTML string directly by characters or words. For example, a segment of HTML这是一段重要的文字。If we break the word "important" in the middle,the tag will not close properly, the browser will try to fix it, but the result is often unpredictable, leading to layout errors.

To solve this problem, the AQ CMS is built-in with a special truncation filter for handling HTML content: truncatechars_htmlandtruncatewords_html.

truncatechars_html:numberThis filter will truncate HTML content based on the specified number of characters and intelligently check and close all unclosed HTML tags.It ensures that the truncated HTML is still a valid, structurally sound fragment, and adds an ellipsis “…” at the truncation position.
truncatewords_html:number: withtruncatechars_htmlIt is similar, but it truncates HTML content based on the specified word count. It also handles the closing of HTML tags and adds an ellipsis.

These two filters are crucial for extracting HTML content without destroying the tag structure.

Practical Exercise: Extract HTML content rendered from Markdown.

The article list page is being built, and each article needs to display a summary of about 150 characters, and the original Markdown styles such as bold and italic should be preserved.

In our template file, you can write it like this:

{# 假设我们正在遍历一个文章列表，item是当前文章对象 #}
{% for item in archives %}
    <div class="article-summary">
        <h3><a href="{{ item.Link }}">{{ item.Title }}</a></h3>
        <div class="summary-content">
            {# 先获取并渲染Markdown内容为HTML #}
            {%- archiveDetail fullContent with name="Content" id=item.Id render=true %}
            {# 对渲染后的HTML内容进行字符截断，并确保安全输出 #}
            {{ fullContent|truncatechars_html:150|safe }}
        </div>
        <a href="{{ item.Link }}" class="read-more">阅读更多 &gt;</a>
    </div>
{% endfor %}

In the above code:

We first go through{% archiveDetail fullContent with name="Content" id=item.Id render=true %}Obtained the specified article.ContentThe field content, and force it to be rendered as HTML. The rendered HTML content is assigned tofullContenta variable.
Then, we usefullContenta variable|truncatechars_html:150Filter.This filter intelligently truncates the first 150 characters of HTML content (including the characters occupied by HTML tags themselves), and most importantly, it automatically handles the possible truncation positions that may lead to unclosed tags, closing them correctly.
Finally, we use it again|safeFilter to ensure that the HTML summary extracted and processed can be normally parsed and displayed by the browser, rather than being output as plain text.

Through this method, we can see the brief abstracts of each article on the article list page, which preserves the original HTML format and avoids the problem of tag structure damage caused by truncation, keeping the page layout neat.

Further thought: When to choose which cutting method

截取字符 (truncatechars_html): When you have a strict character limit for the abstract length, for example, requiring that all abstracts be kept within 100 characters regardless of their content being Chinese, English, or HTML tags.truncatechars_html会是更精确的选择。
截取单词 (truncatewords_html): If your website content is mainly in English, and you want the summary to be semantically complete and avoid truncation in the middle of words, thentruncatewords_htmlIt will be more suitable. It will try to truncate at word boundaries to make the summary more readable.
Get plain text summary (striptags): Sometimes, we may not need to retain any HTML styles at all, just want a plain text summary. In this case, you can use|striptagsFilter that removes all HTML tags, then you can process the plain text you get|truncatecharsor|truncatewordsTruncate. For example:{{ fullContent|striptags|truncatechars:150 }}.

These built-in features of AnQi CMS provide great convenience for content operators.No manual HTML cleaning is required, nor do you have to worry about complex regular expressions. Just call the corresponding tags and filters in the template, and you can easily achieve a high-quality content summary display.

Common Questions (FAQ)

1. How to get the original text of Markdown content instead of the rendered HTML?If you want to get the original text of Markdown content on the front-end page, rather than the rendered HTML,archiveDetailput in the tagrenderparameter settingsfalse. For example:{% archiveDetail rawMarkdown with name="Content" render=false %}At this momentrawMarkdownThe content stored in the variable is the original Markdown text without any conversion.

2. How can I get a plain text summary without retaining any HTML tags?If you want the summary to be plain text without any HTML tags, you can first use|striptagsFilter out all HTML tags and then perform character or word truncation. For example, extract a plain text summary of 150 characters:{% archiveDetail fullContent with name="Content" render=true %} {{ fullContent|striptags|truncatechars:150 }}Here, Markdown is rendered into HTML, then the HTML tags are stripped, and finally, the plain text is truncated.

3. Why when usingtruncatechars_htmlortruncatewords_htmlAfter that, it still needs to be added|safeFilter?The template engine of Anqi CMS (similar to Django) defaults to escaping all output content to prevent cross-site scripting (XSS) and other security issues. This means that eventruncatechars_htmlortruncatewords_htmlThe filter has intelligently handled the closing HTML tags and generated valid HTML fragments, if any are missing,|safeThe filter, these HTML tags (such as,/）still be escaped into entity encoding (for example/)，which causes the browser to be unable to parse and render it correctly. Added|safeIt is to explicitly inform the template engine that this content has been verified and can be directly output as HTML.

How to extract the HTML content rendered by Markdown without breaking the tag structure?

Understanding the Markdown content rendering in AnQi CMS

核心策略：智能截取HTML内容 (English)

Practical Exercise: Extract HTML content rendered from Markdown.

Further thought: When to choose which cutting method

Common Questions (FAQ)

AnQi CMS Website Case

AnQi CMS Usage Help

AnQi CMS Template Tag Manual

Security BLOG

Design Market

Anqi CMS API Help

Anqi CMS Update Log

Question Exchange

Feature Introduction

Video Tutorial

What error message does the `archive/list` interface return when the `moduleId` parameter is invalid?

How to use the result of `archive/list` to achieve clicking to view the article details with `archiveDetail.md`?

Does the AnQiCMS document list interface support more complex queries on the `extra` field of the returned data?

How to use the `archive/list` interface to dynamically load more documents on the front end (infinite scrolling)?

What help does the `canonical_url` and `fixed_link` fields returned by the `archive/list` interface provide for SEO optimization?

What will `data` and `total` return if no document meeting the conditions is found in the AnQiCMS document list?

How to automatically generate a table of contents (TOC) based on Markdown content?

How to implement syntax highlighting for code blocks in Markdown content after rendering to HTML?

Does AnQiCMS support custom configuration of Markdown renderers?

How to precisely control the character truncation length of HTML content with the `truncatechars_html` filter?

How to safely truncate the HTML content rendered by Markdown if it is too long?

How to remove all or specific HTML tags from HTML content rendered from Markdown?