Sensitive word filtering is a technology implemented in websites, applications, or platforms to conduct content review, used to prevent users from posting content that contains inappropriate, illegal, or non-compliant with policies.We often need to worry about the content published by certain users containing sensitive words during the actual operation of our website. These words often lead to our website being reported by users, even being banned by the server operator, being summoned by relevant departments, and being fined.In order to prevent this situation from occurring, we need to filter sensitive words.

The implementation of sensitive word filtering involves multiple steps, including both technical implementation and strategy formulation. The following takes the sensitive word filtering design of Anqi CMS as an example to illustrate.

Define sensitive word library

The composition of the sensitive word library generally includes words related to SE situations, political issues, fan violence, and advertising laws.According to the different definitions on our website, we can collect a concentration of parts to obtain all the vocabulary.Generally, we can download these words from the internet or collect them manually.




Algorithm design for sensitive word filtering

For everyday corporate websites, it is not necessary to use AI algorithms such as context analysis and semantic analysis.For simplicity, we can use the most common and simplest keyword matching algorithm. To achieve greater adaptability, we can also add the use of regular expression matching algorithms for fuzzy matching.

The AQCMS adopts a dual mode of keyword matching + regular expression fuzzy matching for processing. The specific replacement code is as follows:

The specific code is GoLang code, because AnQiCMS is developed in Go language, so the code is taken as an example in Go language.

func ReplaceSensitiveWords(content []byte, sensitiveWords []string) []byte {
  // 如果敏感词库为空,或内容为空,直接返回
	if len(sensitiveWords) == 0 || len(content) == 0 {
		return content
	}
  // 顶一个结构体,用于存储替换结果
	type replaceType struct {
		Key   []byte
		Value []byte
	}
	var replacedMatch []*replaceType
	numCount := 0
	//忽略所有html标签的属性,这是为了防止将标签属性替换成为*,导致页面出错
	reg, _ := regexp.Compile("(?i)<!?/?[a-z0-9-]+(\\s+[^>]+)?>")
	content = reg.ReplaceAllFunc(content, func(s []byte) []byte {
		key := []byte(fmt.Sprintf("{$%d}", numCount))
		replacedMatch = append(replacedMatch, &replaceType{
			Key:   key,
			Value: s,
		})
		numCount++

		return key
	})
	// 替换所有敏感词为星号
	for _, word := range sensitiveWords {
		if len(word) == 0 {
			continue
		}
		if bytes.Contains(content, []byte(word)) {
			content = bytes.ReplaceAll(content, []byte(word), bytes.Repeat([]byte("*"), utf8.RuneCountInString(word)))
		} else {
			// 支持正则表达式替换,定义正则表达式以{开头}结束,如:{[1-9]\d{4,10}}
			if strings.HasPrefix(word, "{") && strings.HasSuffix(word, "}") && len(word) > 2 {
				// 移除首尾花括号
				newWord := word[1 : len(word)-1]
				re, err := regexp.Compile(newWord)
				if err == nil {
					content = re.ReplaceAll(content, bytes.Repeat([]byte("*"), utf8.RuneCountInString(word)))
				}
				continue
			}
		}
	}
	// 将上面忽略的html标签属性还原回来
	for i := len(replacedMatch) - 1; i >= 0; i-- {
		content = bytes.Replace(content, replacedMatch[i].Key, replacedMatch[i].Value, 1)
	}

	return content
}


The timing of sensitive word replacement

Sensitive word replacement can be performed at the following occasions:.

Real-time filtering upon submission: When users submit content, the system will automatically detect and filter out sensitive words.
Batch filtering: The system scans the database content at regular intervals and filters sensitive words in bulk.
Display filtering: When displaying content, the system will automatically detect and filter sensitive words.
The main usage of AnQi CMS is the third timing scheme.When the page renders, the system automatically filters sensitive words.This is also to consider different data input sources and the dynamic update of sensitive word libraries. If real-time filtering is performed at the time of submission, the sensitive words added later will not take effect, and batch filtering may often fail due to the timely failure of sensitive words.Therefore, it is more rigorous to filter and process when displaying, although this will sacrifice some performance.

To implement the filtering of sensitive words during display, the ExecuteWriter output function of Anqi CMS has been rewritten, and the specific code is as follows:
func (s *DjangoEngine) ExecuteWriter(w io.Writer, filename string, _ string, bindingData interface{}) error {
	// 如果开启了debug模式,每次渲染的时候,重新解析模板。
	if s.reload {
		if err := s.LoadStart(true); err != nil {
			return err
		}
	}
	ctx := w.(iris.Context)
	currentSite := provider.CurrentSite(ctx)
	if tmpl := s.fromCache(currentSite.Id, filename); tmpl != nil {
		data, err := tmpl.ExecuteBytes(getPongoContext(bindingData))
		if err != nil {
			return err
		}
		// 对data进行敏感词替换
		data = currentSite.ReplaceSensitiveWords(data)
		buf := bytes.NewBuffer(data)
		_, err = buf.WriteTo(w)
		return err
	}
  // 如果模板不存在,返回错误
	return view2.ErrNotExist{Name: filename, IsLayout: false, Data: bindingData}
}


The thoughts and practices of sensitive word filtering.In the process of actual use, we should optimize and adjust according to actual needs.On the basis of machine automatic filtering, add manual review of some parts of the content, carry out regular inspections, especially those that are prone to ambiguity or involve deep semantic analysis.

Sensitivity word filtering is a complex and dynamic process, which requires both efficient technical means and flexible strategies to adapt to the constantly changing linguistic environment and policy requirements.Hope this content helps you.