Practical Guide to Analyzing Linux System Calls with strace
Practical Guide to Analyzing Linux System Calls with strace
When it comes to Linux system debugging, strace is absolutely a powerful tool. When troubleshooting problems, I often encounter situations where programs mysteriously slow down, or some service suddenly stops working, but you can't see any clues from the logs. This is when strace comes in handy - it helps you see what the program is actually doing at the underlying level.
What exactly is strace?
Simply put, strace is an "eavesdropper". It can monitor all conversations between your program and the Linux kernel. Whenever your program wants to read a file, write a network packet, or allocate some memory, it must communicate with the kernel through system calls. strace records all these conversations, making everything clear at a glance.
For example, when you run a simple ls
command:
strace ls
You'll see a lot of output, like this:
execve("/bin/ls", ["ls"], 0x7fff8b8e2220) = 0
openat(AT_FDCWD, "/lib64/ld-linux-x86-64.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0"..., 832) = 832
...
openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
getdents64(3, /* 8 entries */, 32768) = 240
write(1, "file1.txt file2.txt README.md\n", 33) = 33
What problems can strace help you solve?
strace is particularly useful in these scenarios:
1. Program startup exceptions or performance issues
Once I encountered a Java program that started very slowly. Through strace, I found it was stuck on a configuration file for a long time:
strace -f -e trace=openat java -jar myapp.jar 2>&1 | grep -v ENOENT
It turned out the program was trying to read hundreds of non-existent configuration files, waiting for timeout each time.
2. File operation problems
People often say "my program can't find files" but don't know what exactly it's looking for. With strace, it becomes clear:
strace -e trace=openat,access ./myprogram
3. Network connection issues
Program can't connect to database or API? See where it's actually trying to connect:
strace -e trace=connect,socket ./network_app
4. Memory and process problems
Program has memory leaks or creates too many processes?
strace -e trace=mmap,munmap,clone,fork ./memory_app
Real Case: Troubleshooting a Slow Web Service Response
Once a web service was responding very slowly, but CPU and memory looked normal. We used strace to investigate:
# First find the process ID
ps aux | grep web_service
# Start tracing
strace -p 12345 -f -e trace=read,write,openat,close -o web_service.strace
By analyzing the strace output, we found that every request would read a huge configuration file, and it was reading it repeatedly. Problem found!
But there's an issue here - strace output is really overwhelming. Tens of thousands of lines of output make you dizzy. This is when you need better tools for analysis.
Using ctbots.com Visualization Tools for Efficient Analysis
Looking at raw strace output directly is really painful. I developed ctbots.com's strace online analysis tool, hoping to solve similar troublesome problems.
Tool URL: https://ctbots.com/en/linux/performance/strace.html
Data Collection
First, we need to collect strace data. The tool page will generate standard commands for you. We won't go into details here - just follow the guide to generate commands.
Advantages of Online Analysis
After uploading strace logs to the ctbots tool, you can search through strace logs without having to search through logs bit by bit.
Common Suspicious Points in Problem Troubleshooting
From my summary, these places are most prone to problems:
File system related
# Common problem patterns
openat(AT_FDCWD, "/nonexistent/path/config.conf", O_RDONLY) = -1 ENOENT
# Program is looking for non-existent configuration files
stat("/usr/share/locale/zh_CN/LC_MESSAGES/app.mo", {st_mode=S_IFREG|0644, st_size=45678, ...}) = 0
# Program repeatedly reading language files, possible caching issues
Network connection problems
connect(3, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED
# Database connection refused
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("8.8.8.8")}, 16) = -1 ETIMEDOUT
# Network timeout
Memory allocation exceptions
mmap(NULL, 1073741824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM
# Program wants to allocate 1GB memory but failed
Process creation problems
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f1234567890) = -1 EAGAIN
# System cannot create new process, possibly reached process limit
Remember, strace is a powerful tool, but it should be used reasonably. Be careful when using it in production environments because it affects program performance. Combined with ctbots.com's online analysis tools, you can solve various strange problems with half the effort!