Sensitive word filtering is a technology implemented in websites, applications, or platforms to conduct content review, used to prevent users from posting content that contains inappropriate, illegal, or content that does not comply with policies.We often need to worry about certain users posting content that contains sensitive words in the actual operation of our website, these words often lead to our website being reported by users, even being banned by the server operator, being summoned by relevant departments, and fined.To prevent this situation from occurring, we need to filter sensitive words.

The implementation of sensitive word filtering involves multiple steps, including both technical implementation and strategy formulation. The following takes the sensitive word filtering design of Anqi CMS as an example for elaboration.

Define sensitive word library

The composition of sensitive word libraries usually includes words related to SE situations, political issues, FAN violent activities, and advertising laws.According to the different definitions of our website, we can collect a concentrated part to obtain all the vocabulary.We can generally download these words from the internet or collect them manually.

In terms of collecting sensitive words in the library, AnQi CMS is designed with a manual collection + system synchronization dual mode.The default system does not have sensitive words, you can choose to synchronize the preset sensitive word library from the official website, or manually add custom sensitive words.


Algorithm design for sensitive word filtering

For daily corporate websites, it is not necessary to use context analysis, semantic analysis, and other AI algorithms.For simplicity, we can use the most common and simplest keyword matching algorithm, and to adapt to more rules, we can also add the use of regular expression matching for fuzzy matching.

Anqi CMS uses the dual mode of keyword matching + regular expression fuzzy matching for processing. The specific replacement code is as follows:

The specific code is GoLang code, because AnQiCMS is developed in Go language, so the code takes Go language as an example.

func ReplaceSensitiveWords(content []byte, sensitiveWords []string) []byte {
  // 如果敏感词库为空,或内容为空,直接返回
	if len(sensitiveWords) == 0 || len(content) == 0 {
		return content
	}
  // 顶一个结构体,用于存储替换结果
	type replaceType struct {
		Key   []byte
		Value []byte
	}
	var replacedMatch []*replaceType
	numCount := 0
	//忽略所有html标签的属性,这是为了防止将标签属性替换成为*,导致页面出错
	reg, _ := regexp.Compile("(?i)<!?/?[a-z0-9-]+(\\s+[^>]+)?>")
	content = reg.ReplaceAllFunc(content, func(s []byte) []byte {
		key := []byte(fmt.Sprintf("{$%d}", numCount))
		replacedMatch = append(replacedMatch, &replaceType{
			Key:   key,
			Value: s,
		})
		numCount++

		return key
	})
	// 替换所有敏感词为星号
	for _, word := range sensitiveWords {
		if len(word) == 0 {
			continue
		}
		if bytes.Contains(content, []byte(word)) {
			content = bytes.ReplaceAll(content, []byte(word), bytes.Repeat([]byte("*"), utf8.RuneCountInString(word)))
		} else {
			// 支持正则表达式替换,定义正则表达式以{开头}结束,如:{[1-9]\d{4,10}}
			if strings.HasPrefix(word, "{") && strings.HasSuffix(word, "}") && len(word) > 2 {
				// 移除首尾花括号
				newWord := word[1 : len(word)-1]
				re, err := regexp.Compile(newWord)
				if err == nil {
					content = re.ReplaceAll(content, bytes.Repeat([]byte("*"), utf8.RuneCountInString(word)))
				}
				continue
			}
		}
	}
	// 将上面忽略的html标签属性还原回来
	for i := len(replacedMatch) - 1; i >= 0; i-- {
		content = bytes.Replace(content, replacedMatch[i].Key, replacedMatch[i].Value, 1)
	}

	return content
}


The timing of sensitive word replacement

Sensitive word replacement can be performed at the following times:

Real-time filtering on submission: The system will automatically detect and filter sensitive words when the user submits content.
Bulk filtering: The system scans the content in the database at regular intervals and filters out sensitive words.
Filtering while displaying: When displaying content, the system automatically detects and filters out sensitive words.
The AnQi CMS mainly uses the third timing scheme.When the page is rendered, the system automatically filters sensitive words. This is also to consider different data input sources and the dynamic update of the sensitive word library. If real-time filtering is done when submitted, the sensitive words added later will not take effect, and batch filtering may also often lead to the invalidation of sensitive words due to timeliness.Therefore, the filtering process is more rigorous when displayed, although this will sacrifice some performance.

To achieve filtering sensitive words during display, AnQi CMS has rewritten the ExecuteWriter output function, the specific code is as follows:
func (s *DjangoEngine) ExecuteWriter(w io.Writer, filename string, _ string, bindingData interface{}) error {
	// 如果开启了debug模式,每次渲染的时候,重新解析模板。
	if s.reload {
		if err := s.LoadStart(true); err != nil {
			return err
		}
	}
	ctx := w.(iris.Context)
	currentSite := provider.CurrentSite(ctx)
	if tmpl := s.fromCache(currentSite.Id, filename); tmpl != nil {
		data, err := tmpl.ExecuteBytes(getPongoContext(bindingData))
		if err != nil {
			return err
		}
		// 对data进行敏感词替换
		data = currentSite.ReplaceSensitiveWords(data)
		buf := bytes.NewBuffer(data)
		_, err = buf.WriteTo(w)
		return err
	}
  // 如果模板不存在,返回错误
	return view2.ErrNotExist{Name: filename, IsLayout: false, Data: bindingData}
}


The思路 and practice of sensitive word filtering. In actual use, we should optimize and adjust according to actual needs.On the basis of automatic filtering by machines, increase the manual review of some content, carry out regular inspections, especially those that are easy to produce ambiguity or involve in-depth semantic analysis.

Sensitive word filtering is a complex and dynamic process that requires efficient technical means as well as flexible and adaptive strategies to adapt to the constantly changing language environment and policy requirements.Hope this content helps you.