A couple of days ago I was able to follow a course on dtrace, which is a powerful and mighty tool to debug and troubleshoot (almost) any kind of issues on a Unix host.
The original dtrace program was developed from Sun for Solaris and it's now slowly getting ported on Linux (check out dtrace4linux). Once compiled, dtrace can then be loaded as a kernel module and the executable can be used to trace processes, i/o operations, etc.
Today I had a case where an Apache child process used 100% of a CPU for a long time (and kept eating CPU).
# ps auxf | grep www
root 19428 0.0 0.0 112312 852 pts/0 S+ 19:17 0:00 \_ grep www
www-data 18980 0.0 2.2 1516220 362164 ? Sl Oct01 0:20 \_ /usr/sbin/apache2 -k start
www-data 31152 99.1 1.9 1192780 323240 ? R 16:42 154:13 \_ /usr/sbin/apache2 -k start
www-data 5476 0.0 0.1 239548 28844 ? S 17:30 0:00 \_ /usr/sbin/apache2 -k start
www-data 19336 15.2 1.6 1400428 266288 ? Sl 19:17 0:05 \_ /usr/sbin/apache2 -k start
What I usually do in such cases is a quick look at the server-status table, which, in 90% of the times, shows the website to blame. But in this case, the whole process was full with W entries (Sending Reply). So let's use dtrace to find the bad guy!
First I wanted to see what kind of function is used the most, so I counted all functions, which happen (live) in this process and output the summary as number:
# dtrace -n 'syscall:::entry { @[probefunc] = count() }' -p 31152
proc-stub:rd_event_enable
proc-stub:rd_errstr err=26
dtrace: description 'syscall:::entry ' matched 633 probes
^C
read 1095
munmap 2072
mmap 2073
close 2421
open 3047
stat 3225
fstat 4107
lstat 211141
So the lstat function is the one used the most often - 50x more than fstat.
Let's see who (which file) is using lstat so often.
# dtrace -n 'syscall::lstat:entry { @[copyinstr(arg0)] = count() } END {trunc(@, 5);}' -p 31152
proc-stub:rd_event_enable
proc-stub:rd_errstr err=26
dtrace: description 'syscall::lstat:entry ' matched 3 probes
^C
CPU ID FUNCTION:NAME
6 2 :END
/var/www/web7/html/example.com 22143
/var/www/web7 30231
/var/www/web7/html 30231
/var 37380
/var/www 37380
What I do in the above dtrace command is simply to check for all lstat functions which enter the system from PID 31152. Instead of counting the functions, this time I'm counting the filenames. Because the filename is used in user-space (Apache is a user process) and we're "debugging in kernel space" (simple description) the filename (a string) needs to be copied (copyinstr). arg0 is the filename itself. Instead of having a list of thousands of files, I wanna truncate the result and only show the top 5.
As this is a shared hosting and all websites are in /var/www, it is no surprise that this path is the most often called. But right behind is example.com which resided in /var/www/web7.
Let's check in the access log what's happening:
# tail -f /var/www/web7/log/access_log
203.170.85.135 - - [04/Oct/2012:20:01:46 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:47 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:47 +0200] "POST /administrator/index.php HTTP/1.1" 303 - "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:49 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:49 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:50 +0200] "POST /administrator/index.php HTTP/1.1" 303 - "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:51 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "http://example.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
203.170.85.135 - - [04/Oct/2012:20:01:52 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
Oh indeed, the IP 203.170.85.135 makes several requests per second (GET and POST) on a Joomla Administrator login. Probably a brute-force attack. The HTTP Header shows bingbot though but that's probably fake, as I already saw this (see post GoogleBot used as HTTP Agend by Hackers/Botscripts). A quick nslookup will show if the IP is really from bing:
# nslookup 203.170.85.135
cpanel30.per.syra.net.au.
ERRR! Nope, that's definitely not bing. Looks like a hacked shared hosting server (due to the name of cpanel).
Let's see how many TCP connections there are from this IP:
# netstat -an | grep "203.170.85.135" | head
tcp6 0 0 1.2.3.4:80 203.170.85.135:37295 TIME_WAIT
tcp6 0 0 1.2.3.4:80 203.170.85.135:37291 TIME_WAIT
tcp6 0 0 1.2.3.4:80 203.170.85.135:37100 TIME_WAIT
tcp6 0 0 1.2.3.4:80 203.170.85.135:37068 TIME_WAIT
tcp6 0 0 1.2.3.4:80 203.170.85.135:37043 TIME_WAIT
# netstat -an | grep "203.170.85.135" -c
101
The TIME_WAIT entries explain why this Apache child process hasn't been killed as has been run for such a long time (and is still running). As long as there are TIME_WAIT's (or others as ESTABLISHED) in the child process, Apache (master) won't kill it.
I don't want this website be allowed to be hacked and I hate CPU intensive processes which run forever, so lets stop the attack through a simple .htaccess file and a deny:
# cat /var/www/web7/html/.htaccess
order allow,deny
deny from 203.170.85.135
allow from all
Shortly after that the TCP connections went down:
# netstat -an | grep "203.170.85.135" -c
0
Solved!
Claudio from Switzerland wrote on Oct 6th, 2012:
That's a very good idea, will apply that. Thanks.
Warper from wrote on Oct 6th, 2012:
To avoid this brute force attack, that is very popular nowadays, you should add this rule to your /administrator/.htaccess instead to ban every individual ip:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]
RewriteRule .* - [R=403,L]
The bot attackers are using bingbot always as user referer, so I think this is a good way to avoid it.
Anyway, those bots have nothing to do in /administrator/ folder, so block them is safe.
Regards,
warper
Dennis B. from Germany wrote on Oct 5th, 2012:
Very nice example ! I'll bookmark this :)
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder