As a website operator who has been engaged in AnQiCMS (AnQiCMS) for many years, I know that logs are our most loyal partners when facing the stability challenges of the system under high load. When the AnQiCMS process crashes unfortunately under high load,PIDCheck the logs, especiallycheck.logwhich can provide us with key initial clues to guide us in deeper problem diagnosis.
Understanding AnQiCMS's process management and logging mechanism
AnQiCMS is an enterprise-level content management system developed based on the Go programming language, which is known for its high concurrency and performance. To ensure the stable operation of the system under various conditions, AnQiCMS usually works with a守护脚本 such asstart.sh,to monitor the running status of the main process.
start.shThe core responsibilities of the script include periodically checking if the main process of AnQiCMS exists.If the process has stopped, it will try to restart AnQiCMS.check.logfile. At the same time, when the AnQiCMS process is started, its standard output and standard error will be redirected torunning.logThis file includes detailed information about the application's runtime and any error reports.
PIDCheck logscheck.logThe preliminary clues provided.
check.log文件扮演着一个“哨兵”的角色,它记录了 AnQiCMS 进程的存活状态。在 Englishstart.shIn the script, each check records the current number of AnQiCMS processes, usually in the form ofPID anqicms check: Xwhere,Xis the one currently foundanqicmsThe number of processes.
Direct indication of process status
In AnQiCMS running normally,check.logrecorded inXthe value should always be greater than 0 (usually 1, representing that the main process is running). If we find in the logsPID anqicms check: 0The record, this is a clear signal indicating that the AnQiCMS process has stopped running during the check. Under high load scenarios, if the process was normal before, and then suddenly appearsX=0The record, it can be almost certain that the AnQiCMS process has crashed.
The time point and frequency of the crash
check.logEach record contains a timestamp. By analyzing these timestamps, we can precisely determine the specific time point of the AnQiCMS process crash. Ifcheck.logShow the process stops and restarts repeatedly (i.e., frequently)PID anqicms check: 0Then it returned to normal), which suggests that there may be instability in the system, and it may be associated with peak website traffic, scheduled task execution, or other predictable external events.These time points are crucial for us to associate issues with specific operations, system load patterns, or fluctuations in external dependencies.
Identify abnormal restarts
Combine with system startup scripts (such as)start.shThe execution frequency of (for example, if it is configured to be executed once every minute by a cron job),check.loghelp us determine whether the process has experienced an unexpected restart. OnceX=0The records will usually followanqicms NOT runningandnohup ... &Attempt to restart information. By observing these patterns, we can confirm whether the crash was detected and attempted to recover by the guardian script.
Combinerunning.logPerform in-depth analysis
Althoughcheck.logTell usAnQiCMS process has crashedbut it cannot explainwhy it crashed. To understand the root cause of the crash, we need to turn torunning.log.
running.logContains the standard output and standard error information generated by the AnQiCMS application during runtime. When a Go application encounters runtime errors (such aspanic)or encounter other critical issues, detailed stack traces and error messages are usually recorded in this file. In high-load scenarios, common causes of crashes may include:
- Go Runtime Panic: This is the most direct crash manifestation of a Go program, usually accompanied by detailed Goroutine stack information. It may be due to null pointer reference, array out of bounds, concurrency safety issues (such as
mapConcurrency read-write without lock)et al。“ - Memory overflow (OOM - Out Of Memory): Although Go language has a garbage collection mechanism, under high concurrency, if the number of Goroutines increases sharply, there are memory leaks, or if a large amount of data is processed causing rapid memory growth, it may still lead to system memory exhaustion and eventually be killed by the system.
running.logIt may not directly display OOM, but a reduction in the amount of log before the application stops or errors indicating resource allocation failure may be clues. - Database connection pool exhausted or deadlock: AnQiCMS depends on the database, under high load, if the database connection configuration is not set properly, it may lead to the exhaustion of the connection pool, resulting in the inability to obtain a database connection and subsequently triggering application errors.The database deadlock may also cause long-time blocking of requests, and too many accumulated requests may eventually cause the application to crash.
- External service timeout or error: AnQiCMS may depend on external API, caching service, or file storage.If these external services respond slowly or return errors under high load, and AnQiCMS is unable to handle them properly, it may also cause itself to crash.
- Go Goroutine leak: Goroutine is a lightweight thread in the Go language.If a Goroutine is created and fails to exit normally, and continues to consume resources, long-term accumulation may lead to exhaustion of system resources.
running.logMay not directly display the leak, but will display other resource-related errors orpanic.
By using inrunning.logSearch inpanic/fatal error/runtime error/out of memory/timeout/too many connectionsWe can locate the specific error information and code location where the crash occurred, providing the most valuable basis for problem diagnosis.
Comprehensive troubleshooting steps and suggestions
Whencheck.log指示 AnQiCMS 进程在高负载下崩溃时,我建议您按照以下步骤进行综合排查:
- 确认崩溃时间与频率: 仔细查看
check.log,记录所有PID anqicms check: 0The time point, and compare it with the website traffic curve, system resource monitoring data (such as CPU, memory, network I/O, disk I/O) and any known scheduled task execution time. - In-depth analysis
running.log: English localization for crash timerunning.log. Search for any anomalies, errors, warnings orpanic. Stack information. This is a key step in diagnosing the root cause of the problem - . System resource monitoring: Check the CPU utilization, memory usage, network bandwidth, and disk I/O before and after the server crashes.Under high load, the memory growth curve of Go applications is worth paying special attention to, as it may imply Goroutine leakage or OOM.
- Database performance check: Check database logs and performance metrics, including connection numbers, slow query logs, lock waits, etc.Ensure that the database can stably support AnQiCMS requests under high load.
- External dependency health status: If AnQiCMS depends on other services, check the logs and health status of these services.For example, caching services (such as Redis), message queues, image processing services, or third-party APIs.
- Review AnQiCMS configurationCheck.
config.jsonThe related configuration, especially the settings related to concurrency, caching, timeouts, and database connection pools. Unreasonable configuration may become a bottleneck under high load.
By this systematic approach,check.logacting as an initial signal of an exception,running.logAs a core diagnostic tool, combined with comprehensive system resources and external dependency monitoring, we can locate and resolve AnQiCMS crash issues under high load more efficiently.
Common Questions and Answers (FAQ)
AnQiCMS process crashes under high load, what clues can the PID check log provide?
Q1:check.logfrequent appearancesexists=0? Butrunning.logthere is no obvious error message, what should I do?
A1:This may be because the application was forcibly terminated under high load, for example due to memory exhaustion (OOM Killer) or reaching process resource limits. In this case, the application may not have had the opportunity to write detailed error informationrunning.log. You need to check the system's kernel logs (for example, on Linux,/var/log/messagesordmesgCommand output), check for records of OOM Killer or other processes being terminated notifications. Also, it is necessary to checkulimitDoes the system-level resource limit set too low. In addition, increaserunning.logThe detail level (if AnQiCMS supports) or adding a more robust error handling and logging mechanism may be helpful.
Q2: AnQiCMS after restart runs for a while and then crashes, what could be the reasons?
A2:This phenomenon usually points to resource accumulation issues, or defects in specific code paths under high load. Common causes include:
- Memory leak: Application may have memory leaks, causing memory usage to gradually increase after each restart and eventually deplete system resources.
- Database connection not released: Database connection may not be closed properly or returned to the connection pool, which can lead to a rapid increase in connection count under high concurrency, ultimately causing a crash.
- Goroutine leak: The Go language's Goroutine will continue to consume memory and other resources if it does not exit normally.Under high load, a large number of Goroutines accumulating can lead to a sharp decline in system performance and eventually crash.
- External service is unstable: AnQiCMS relies on external services (such as Redis, remote API) that may be intermittent unstable, causing cumulative effects during system operation.
Q3: Besidescheck.logandrunning.logWhat are some logs that can help troubleshoot problems?
A3:In addition to the logs of AnQiCMS itself, the following important system and application logs are also included:
- 系统日志 (System Logs): 如 Linux 上的 (English)
/var/log/syslog//var/log/messages或通过journalctlView. These logs record the events of the operating system, including OOM Killer, hardware failures, network issues, etc. - Web Server Logs (Nginx/Apache Logs): If you use Nginx or Apache as a reverse proxy, their access logs (
access.log) and error logs (error.log) Can help you understand the traffic pattern, response time, and errors at the proxy level. - Database logs (MySQL Logs): The database error log, slow query log, and binary log (if enabled) are crucial for diagnosing performance bottlenecks or errors at the database level.
- Cloud service monitoring log: If AnQiCMS is deployed on a cloud platform, the monitoring and log services provided by the cloud service provider (such as AWS CloudWatch, Alibaba Cloud Log Service) can provide more comprehensive resource metrics and application log aggregation analysis.