Use dtrace to analyze Apache process consuming 100% CPU

Written by - 3 comments

Published on - Listed in Linux Internet Apache


A couple of days ago I was able to follow a course on dtrace, which is a powerful and mighty tool to debug and troubleshoot (almost) any kind of issues on a Unix host.
The original dtrace program was developed from Sun for Solaris and it's now slowly getting ported on Linux (check out dtrace4linux). Once compiled, dtrace can then be loaded as a kernel module and the executable can be used to trace processes, i/o operations, etc.

Today I had a case where an Apache child process used 100% of a CPU for a long time (and kept eating CPU).

# ps auxf | grep www
root     19428  0.0  0.0 112312   852 pts/0    S+   19:17   0:00                      \_ grep www
www-data 18980  0.0  2.2 1516220 362164 ?      Sl   Oct01   0:20  \_ /usr/sbin/apache2 -k start
www-data 31152 99.1  1.9 1192780 323240 ?      R    16:42 154:13  \_ /usr/sbin/apache2 -k start
www-data  5476  0.0  0.1 239548 28844 ?        S    17:30   0:00  \_ /usr/sbin/apache2 -k start
www-data 19336 15.2  1.6 1400428 266288 ?      Sl   19:17   0:05  \_ /usr/sbin/apache2 -k start

What I usually do in such cases is a quick look at the server-status table, which, in 90% of the times, shows the website to blame. But in this case, the whole process was full with W entries (Sending Reply). So let's use dtrace to find the bad guy!

First I wanted to see what kind of function is used the most, so I counted all functions, which happen (live) in this process and output the summary as number:

# dtrace -n 'syscall:::entry { @[probefunc] = count() }' -p 31152
proc-stub:rd_event_enable
proc-stub:rd_errstr err=26
dtrace: description 'syscall:::entry ' matched 633 probes
^C
  read                                                           1095
  munmap                                                         2072
  mmap                                                           2073
  close                                                          2421
  open                                                           3047
  stat                                                           3225
  fstat                                                          4107
  lstat                                                        211141

So the lstat function is the one used the most often - 50x more than fstat.
Let's see who (which file) is using lstat so often.

# dtrace -n 'syscall::lstat:entry { @[copyinstr(arg0)] = count() } END {trunc(@, 5);}' -p 31152
proc-stub:rd_event_enable
proc-stub:rd_errstr err=26
dtrace: description 'syscall::lstat:entry ' matched 3 probes
^C
CPU     ID                    FUNCTION:NAME
  6      2                             :END

  /var/www/web7/html/example.com                           22143
  /var/www/web7                                            30231
  /var/www/web7/html                                       30231
  /var                                                     37380
  /var/www                                                 37380

What I do in the above dtrace command is simply to check for all lstat functions which enter the system from PID 31152. Instead of counting the functions, this time I'm counting the filenames. Because the filename is used in user-space (Apache is a user process) and we're "debugging in kernel space" (simple description) the filename (a string) needs to be copied (copyinstr). arg0 is the filename itself. Instead of having a list of thousands of files, I wanna truncate the result and only show the top 5.

As this is a shared hosting and all websites are in /var/www, it is no surprise that this path is the most often called. But right behind is example.com which resided in /var/www/web7.

Let's check in the access log what's happening:

# tail -f /var/www/web7/log/access_log
203.170.85.135 - - [04/Oct/2012:20:01:46 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:47 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:47 +0200] "POST /administrator/index.php HTTP/1.1" 303 - "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:49 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:49 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:50 +0200] "POST /administrator/index.php HTTP/1.1" 303 - "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:51 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:52 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

Oh indeed, the IP 203.170.85.135 makes several requests per second (GET and POST) on a Joomla Administrator login. Probably a brute-force attack. The HTTP Header shows bingbot though but that's probably fake, as I already saw this (see post GoogleBot used as HTTP Agend by Hackers/Botscripts). A quick nslookup will show if the IP is really from bing:

# nslookup 203.170.85.135
cpanel30.per.syra.net.au.

ERRR! Nope, that's definitely not bing. Looks like a hacked shared hosting server (due to the name of cpanel).

Let's see how many TCP connections there are from this IP:

# netstat -an | grep "203.170.85.135" | head
tcp6       0      0 1.2.3.4:80      203.170.85.135:37295    TIME_WAIT
tcp6       0      0 1.2.3.4:80      203.170.85.135:37291    TIME_WAIT
tcp6       0      0 1.2.3.4:80      203.170.85.135:37100    TIME_WAIT
tcp6       0      0 1.2.3.4:80      203.170.85.135:37068    TIME_WAIT
tcp6       0      0 1.2.3.4:80      203.170.85.135:37043    TIME_WAIT

# netstat -an | grep "203.170.85.135" -c
101

The TIME_WAIT entries explain why this Apache child process hasn't been killed as has been run for such a long time (and is still running). As long as there are TIME_WAIT's (or others as ESTABLISHED) in the child process, Apache (master) won't kill it.

I don't want this website be allowed to be hacked and I hate CPU intensive processes which run forever, so lets stop the attack through a simple .htaccess file and a deny:

# cat /var/www/web7/html/.htaccess
order allow,deny
deny from 203.170.85.135
allow from all

Shortly after that the TCP connections went down:

# netstat -an | grep "203.170.85.135" -c
0

Solved!


Add a comment

Show form to leave a comment

Comments (newest first)

Claudio from Switzerland wrote on Oct 6th, 2012:

That's a very good idea, will apply that. Thanks.


Warper from wrote on Oct 6th, 2012:

To avoid this brute force attack, that is very popular nowadays, you should add this rule to your /administrator/.htaccess instead to ban every individual ip:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]
RewriteRule .* - [R=403,L]

The bot attackers are using bingbot always as user referer, so I think this is a good way to avoid it.
Anyway, those bots have nothing to do in /administrator/ folder, so block them is safe.

Regards,
warper


Dennis B. from Germany wrote on Oct 5th, 2012:

Very nice example ! I'll bookmark this :)


RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder