How to filter sensitive words on the website, ideas and practices of filtering sensitive words

author:

AnqiCMS

Running environment:

AnqiCMS 3.0.5 or above

Installation method:

Please log in to your website background and install it in function management

price:

free

Sensitive word filtering is a technology that implements content censorship in a website, application, or platform to prevent users from publishing content that contains inappropriate, illegal, or incompatible with policies. In the actual website operation process, we often need to worry that the content posted by some users contains sensitive words, which often lead to our website being reported by users, and even being banned by server operators, interviewed by relevant departments, and fined. To prevent this from happening, we need to filter sensitive words.

The implementation of sensitive word filtering involves multiple steps, including both technical implementation and strategy formulation. The following is an example of the sensitive word filtering design of Anqi CMS.

Defining sensitive thesaurus

The composition of sensitive thesaurus generally includes words involving SE, political, FAN violence, and advertising law. Depending on the different definitions of our website, you can collect a part of the collection to obtain all the vocabulary. Generally, we can download these words online or manually collect them.

In terms of collecting sensitive thesaurus, Anqi CMS has been designed into a dual mode of manual collection + system synchronization. The default system does not have sensitive words. You can choose to synchronize the sensitive vocabulary preset from the official website from the official website, or you can manually add customized sensitive words.

Algorithm design for sensitive word filtering

For daily corporate websites, we do not need to use AI algorithms such as context analysis and semantic analysis. For simplicity, we can use the most common and simplest keyword matching algorithm. For larger adaptation rules, we can also add a fuzzy matching algorithm using regular matching classes to achieve it.

Anqi CMS uses keyword matching + regular fuzzy matching dual mode for processing. The specific replacement code is as follows:

The specific code is GoLang code, because Anqi CMS is developed in the Go language, so the code takes Go language as an example.

func ReplaceSensitiveWords(content []byte, sensitiveWords []string) []byte {
  // 如果敏感词库为空，或内容为空，直接返回
	if len(sensitiveWords) == 0 || len(content) == 0 {
		return content
	}
  // 顶一个结构体，用于存储替换结果
	type replaceType struct {
		Key   []byte
		Value []byte
	}
	var replacedMatch []*replaceType
	numCount := 0
	//忽略所有html标签的属性，这是为了防止将标签属性替换成为*，导致页面出错
	reg, _ := regexp.Compile("(?i)<!?/?[a-z0-9-]+(\\s+[^>]+)?>")
	content = reg.ReplaceAllFunc(content, func(s []byte) []byte {
		key := []byte(fmt.Sprintf("{$%d}", numCount))
		replacedMatch = append(replacedMatch, &replaceType{
			Key:   key,
			Value: s,
		})
		numCount++

		return key
	})
	// 替换所有敏感词为星号
	for _, word := range sensitiveWords {
		if len(word) == 0 {
			continue
		}
		if bytes.Contains(content, []byte(word)) {
			content = bytes.ReplaceAll(content, []byte(word), bytes.Repeat([]byte("*"), utf8.RuneCountInString(word)))
		} else {
			// 支持正则表达式替换，定义正则表达式以{开头}结束，如：{[1-9]\d{4,10}}
			if strings.HasPrefix(word, "{") && strings.HasSuffix(word, "}") && len(word) > 2 {
				// 移除首尾花括号
				newWord := word[1 : len(word)-1]
				re, err := regexp.Compile(newWord)
				if err == nil {
					content = re.ReplaceAll(content, bytes.Repeat([]byte("*"), utf8.RuneCountInString(word)))
				}
				continue
			}
		}
	}
	// 将上面忽略的html标签属性还原回来
	for i := len(replacedMatch) - 1; i >= 0; i-- {
		content = bytes.Replace(content, replacedMatch[i].Key, replacedMatch[i].Value, 1)
	}

	return content
}

The timing of sensitive word replacement

Replacement of sensitive words can be performed under the following opportunities:

Submit real-time filtering: When users submit content, the system will automatically detect and filter sensitive words.
Batch filtering: The system scans the content in the database regularly and batch filters sensitive words.
Filtering on display: When displaying content, the system will automatically detect and filter sensitive words.
The third timing scheme is mainly used when using Anqi CMS. When rendering the page, the system automatically filters sensitive words. This is also to consider the fact that there are different data input sources and dynamic updates of sensitive vocabulary. If filtering in real time during submission, the added sensitive words will not take effect later, and batch filtering may often fail to cause sensitive words to be invalid due to untimely. Therefore, the filtering process is more rigorous when displaying, although doing so will sacrifice some performance.

In order to achieve filtering sensitive words during display, Anqi CMS rewritten the ExecuteWriter output function, the specific code is as follows:

func (s *DjangoEngine) ExecuteWriter(w io.Writer, filename string, _ string, bindingData interface{}) error {
	// 如果开启了debug模式，每次渲染的时候，重新解析模板。
	if s.reload {
		if err := s.LoadStart(true); err != nil {
			return err
		}
	}
	ctx := w.(iris.Context)
	currentSite := provider.CurrentSite(ctx)
	if tmpl := s.fromCache(currentSite.Id, filename); tmpl != nil {
		data, err := tmpl.ExecuteBytes(getPongoContext(bindingData))
		if err != nil {
			return err
		}
		// 对data进行敏感词替换
		data = currentSite.ReplaceSensitiveWords(data)
		buf := bytes.NewBuffer(data)
		_, err = buf.WriteTo(w)
		return err
	}
  // 如果模板不存在，返回错误
	return view2.ErrNotExist{Name: filename, IsLayout: false, Data: bindingData}
}

The above ideas and practices for filtering sensitive words. In actual use, we should optimize and adjust according to actual needs. On the basis of automatic filtering of the machine, add some content to manual review and regular inspections, especially those that are prone to ambiguity or involve in-depth semantic analysis.

Sensitive word filtering is a complex and dynamic process that requires both efficient technical means and flexible strategies to adapt to changing language environments and policy requirements. Hope the above content helps you.

Related functions

How to filter sensitive words on the website, ideas and practices of filtering sensitive words

Defining sensitive thesaurus

Algorithm design for sensitive word filtering

The timing of sensitive word replacement

Tutorial on using the Txt article publishing function of Anqi Box

How to automatically send website messages to designated emails in Anqi CMS

How to display the current time in the template of Anqi CMS

Supports the CMS of Markdown editor, and Anqi CMS is considered one