As an experienced website operation expert, I am well aware that the health of server resources is crucial for the stable operation of a CMS system, especially for systems like AnQiCMS that emphasize high performance and concurrent processing.When the server resources, especially memory, are tight, the system kernel may terminate those processes with excessive memory usage without hesitation to maintain overall stability, which is what we commonly refer to as 'OOM (Out Of Memory) killed' processes.
Then, how should we capture these 'exceptional signals' in the logs when the AnQiCMS process is unfortunately killed by the system OOM?This usually does not leave detailed 'last words' like a program actively exiting, but by observing the AnQiCMS daemon log and the system kernel log, we can still piece together the truth of the event.
OOM-Killed AnQiCMS: A quiet 'death' and a hurried 'rebirth'
When the system faces the crisis of memory exhaustion, the Linux kernel's OOM Killer mechanism is triggered, which will select and kill one or more processes with high memory usage based on a complex scoring algorithm (OOM Score) to release resources and prevent the system from crashing completely.For AnQiCMS such a Go language application, being killed by OOM means that the process terminates suddenly without any warning, it doesn't even have time to perform any cleanup operations, let alone record the specific reason for being killed by OOM in its application logs.
However, AnQiCMS usually uses a guardian script (such as the one mentioned in the document),start.shEnsure its continuous operation.This script will periodically check if the AnQiCMS process exists, and if it finds that the process has unexpectedly terminated, it will immediately attempt to restart.It is this 'resurrection from the dead' mechanism that leaves us the key clues to trace the OOM events in the logs.
The 'traces' in the log: Tracking the abnormal interruption of AnQiCMS
To determine if AnQiCMS has been killed by OOM, we need to pay attention to two types of logs: the守护进程 logs of AnQiCMS itself (especially those responsible for checking and restarting) and the system kernel logs.
1.check.log:Heartbeat record of abnormal pulse
According to the deployment method of AnQiCMS (especially throughstart.shLinux deployment managed by scripts), there is usually acheck.logOr similar log files, recording the information of the periodic check of the AnQiCMS process status by the guardian script.
Normal.check.logThe record will show that the AnQiCMS process PID (Process ID) is continuously present, for example:
20240723 10:00:01 anqicms PID check: 1
20240723 10:01:01 anqicms PID check: 1
20240723 10:02:01 anqicms PID check: 1
...
While the AnQiCMS process is killed due to OOM, the guard script will find that the original PID does not exist (orps -efUnable to find), and it will try to restart it. At this time,check.logthe following exception pattern will appear:
20240723 10:03:01 anqicms PID check: 1 # 进程正常运行
20240723 10:04:01 anqicms PID check: 0 # 进程被杀死,检查发现不存在
20240723 10:04:01 anqicms NOT running # 守护脚本记录进程未运行
20240723 10:04:01 (启动命令...) # 守护脚本尝试启动新进程
20240723 10:05:01 anqicms PID check: 1 # 新进程启动,新的PID出现
...
You will see a PID changing from '1' to '0', followed by the 'NOT running' prompt and the startup command, and then the new PID appears again.This 'PID disappearance-restart-new PID appearance' pattern is a strong signal that the AnQiCMS process was interrupted abnormally and was automatically restored by the guardian script.Although it does not directly state OOM, but combined with the sudden stop of the application logs, it points to the possibility of the system being forcibly terminated.
2.running.log: A sudden calm
running.log(or the actual output path of AnQiCMS application) usually records the running status, request processing, error information, etc. of the AnQiCMS application.When the AnQiCMS process is killed by OOM, since it is terminated suddenly, the application has no opportunity to write any shutdown or error information.
Therefore,running.logThe characteristics shown are:
- The log output suddenly stopsIn the time before OOM occurs, logs may still be output normally, but after that, they will suddenly stop with no log entries for 'graceful shutdown' or 'error exit'.
- The log timestamp jumpsEnglish: When the guardian script restarts AnQiCMS, a new process will start writing logs.You will find that the log timestamp jumps suddenly from a certain point to the time after restart, and the context of the new log has no connection with the old log, as if it is a completely independent session.
This 'cliff-style' interruption of the log is a strong evidence that the application process has been forcibly terminated by external forces.
3. System kernel log: conclusive 'death certificate'
To obtain the ultimate and most convincing evidence that the AnQiCMS process was killed due to OOM, we need to check the system kernel log.These logs record all important system-level events, including the activity of OOM Killer.
In the Linux system, you can find the relevant logs at the following location:
/var/log/syslogor/var/log/messages(Different Linux distributions may vary)These are the main system log files.dmesgCommand output:dmesgDisplay information about the kernel ring buffer, which includes detailed records of OOM events.
You can usegrepCommand combined with keywords to find OOM events. For example:
grep -i 'oom|out of memory|killed process' /var/log/syslog
or
dmesg | grep -i 'oom|out of memory|killed process'
When AnQiCMS is killed by OOM, similar records as follows are usually displayed in the kernel log:
kernel: Out of memory: Kill process 12345 (anqicms) score 999 or sacrifice child
kernel: Killed process 12345 (anqicms) total-vm:4123456kB, anon-rss:3987654kB, file-rss:123456kB, shmem-rss:0kB
Among them:
process 12345 (anqicms)It clearly points out the process name and PID that was killed.Out of memoryorKilled processDirectly indicates the nature of the event.total-vm/anon-rssIt will show the memory usage of the process before it is killed, helping you analyze which type of memory is consuming too much.
These kernel logs are the golden standard for diagnosing OOM issues, providing direct evidence and context information of the process being forcibly terminated.
Why might AnQiCMS in Go language also be killed by OOM?
Although Go language is known for its efficient memory management and lightweight Goroutines, this does not mean that Go applications are immune to OOM. When AnQiCMS runs in the following scenarios, even Go applications may face OOM risks:
- High concurrency and instantaneous traffic peak: Although Goroutines are lightweight, if a large number of requests flood in at once, each Goroutine may allocate a small amount of memory, and the cumulative total may quickly exceed the available physical memory.
- Process large files or large datasets:When AnQiCMS needs to handle large files uploaded, perform a large amount of content collection or batch import, or operate ultra-large data structures in memory, it may occupy a large amount of memory in a short period of time.
- Memory leak (not common but may occur):Even though Go has a garbage collection mechanism, if the program logic is not well-designed, such as holding references to objects that are no longer needed for a long time, or improper handling when interacting with Cgo, it may still lead to memory not being released in time, causing memory leaks.
- Server resource allocation is insufficient:The most direct reason is that the memory allocated to AnQiCMS by the server is not enough to support its normal operation and business peak.
Summary
When the AnQiCMS process is killed by OOM, the most direct clue ischeck.logthe sudden change in PID,running.logThe abrupt stop of log output. But to confirm OOM events, system kernel logs (such assyslogordmesg
Common Questions (FAQ)
- [en] Q: What is AnQiCMS?
check.logThe PID changes frequently, does it necessarily mean OOM? Answer:running.logThere is no sudden interruption with no error stack or exit information, and whether the system kernel log contains OOM records