I previously wrote in article Googlebot freezes Apache and server load increase about a weird behavior between the Googlebot (Google Spider/Crawler) and the Apache webserver. Short summary: A connection was opened by Googlebot, Apache gives answer, connection is never correctly closed and even if meanwhile new Apache child processes were spawned, the process kept alive due to this open connection which caused at the end a huge increase in server load.
Well after further analysis of this problem this weekend, I figured out, that always the same website (out of hundreds) was responsible to not properly close the connection to Googlebot.
Once more, top shows which processes use most CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1242 www-data 20 0 706m 166m 51m S 126 4.2 175:39.77 apache2
32486 www-data 20 0 551m 130m 43m S 49 3.3 102:12.12 apache2
We look for connections by these processes:
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
apache2 1145 root 3u IPv6 1215939860 TCP *:www (LISTEN)
apache2 1242 www-data 26u IPv6 1216870094 TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:55825 (CLOSE_WAIT)
apache2 1242 www-data 28u IPv6 1216884422 TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:42851 (CLOSE_WAIT)
apache2 16944 www-data 3u IPv6 1215939860 TCP *:www (LISTEN)
apache2 16944 www-data 26u IPv6 1217132916 TCP area-1.ch:www->crawl-66-249-71-197.googlebot.com:34349 (ESTABLISHED)
apache2 17207 www-data 3u IPv6 1215939860 TCP *:www (LISTEN)
apache2 17207 www-data 26u IPv6 1217132929 TCP area-1.ch:www->net66-219-58-45.static-customer.corenap.com:57036 (ESTABLISHED)
apache2 32486 www-data 34u IPv6 1216853652 TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:47004 (CLOSE_WAIT)
And now we take a look at the extended Apache status (ExtendedStatus on) for further Apache child information:
0-1 32486 0/34/438 W 20.03 7057 0 0.0 2.86 20.77 66.249.66.136 BADWEBSITE GET /category.php?id_category=2&id_lang=1 HTTP/1.1
2-1 1242 0/32/349 W 291.95 6310 0 0.0 5.52 21.18 66.249.66.136 BADWEBSITE GET /category.php?id_category=2&id_lang=1 HTTP/1.1
2-1 - 0/0/361 . 520.53 6133 40 0.0 0.00 24.98 84.22.49.237 AGOODWEBSITE POST /?_task=mail&_action=autocomplete HTTP/1.1
2-1 - 0/0/362 . 529.47 6133 0 0.0 0.00 22.47 88.65.227.133 ANOTHERGOODONE GET /images/klein-shop-vfb-ksc-poster-200x133.jpg HTTP/1.1
2-1 - 0/0/367 . 523.36 6133 274 0.0 0.00 16.91 84.22.49.237 AGOODWEBSITE POST / HTTP/1.1
2-1 - 0/0/363 . 511.97 6133 30 0.0 0.00 20.74 91.8.246.156 GOODWEBSITE2 GET /images/product_images/info_images/670_0.jpg HTTP/1.1
2-1 - 0/0/361 . 536.08 6132 458 0.0 0.00 30.53 207.46.199.23 BGOODWEBSITE GET /wp-content/uploads/shadowbox-js/d46661d2c927dea304addb1b47
2-1 - 0/0/359 . 515.40 6133 40 0.0 0.00 56.72 178.82.216.64 CGOODWEBSITE GET /_images/nav/beratung_b.gif HTTP/1.1
2-1 1242 0/18/344 W 4.81 6682 0 0.0 2.05 16.81 66.249.66.136 BADWEBSITE GET /category.php?id_category=2&id_lang=1 HTTP/1.1
In the extended status we can find the same process id's and seen in top and lsof again. And what a surprise, it always is BADWEBSITE which is still holding the connection to the Googlebot (66.249.66.136) (take a look at the W which stays for Sending Reply).
I don't know what the programmer of this website did in the file category.php but I definitely don't want Googlebot to crawl these pages anymore. So the solution is to use a robots.txt to "talk" to the Googlebot:
# Googlebot ain't allowed to check the pages here
User-Agent: Googlebot
Disallow: /
# All other bots go ahead
User-agent: *
Disallow:
Since then the server load is a steady below 1 and there are no problems anymore!
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder