Googlebot and Apache CLOSE_WAIT's: SOLVED!

Written by - 0 comments

Published on - Listed in Linux Internet Apache


I previously wrote in article Googlebot freezes Apache and server load increase about a weird behavior between the Googlebot (Google Spider/Crawler) and the Apache webserver. Short summary: A connection was opened by Googlebot, Apache gives answer, connection is never correctly closed and even if meanwhile new Apache child processes were spawned, the process kept alive due to this open connection which caused at the end a huge increase in server load.

Well after further analysis of this problem this weekend, I figured out, that always the same website (out of hundreds) was responsible to not properly close the connection to Googlebot.

Once more, top shows which processes use most CPU:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1242 www-data  20   0  706m 166m  51m S  126  4.2 175:39.77 apache2
32486 www-data  20   0  551m 130m  43m S   49  3.3 102:12.12 apache2

We look for connections by these processes:

COMMAND   PID     USER   FD   TYPE     DEVICE SIZE NODE NAME
apache2  1145     root    3u  IPv6 1215939860       TCP *:www (LISTEN)
apache2  1242 www-data   26u  IPv6 1216870094       TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:55825 (CLOSE_WAIT)
apache2  1242 www-data   28u  IPv6 1216884422       TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:42851 (CLOSE_WAIT)
apache2 16944 www-data    3u  IPv6 1215939860       TCP *:www (LISTEN)
apache2 16944 www-data   26u  IPv6 1217132916       TCP area-1.ch:www->crawl-66-249-71-197.googlebot.com:34349 (ESTABLISHED)
apache2 17207 www-data    3u  IPv6 1215939860       TCP *:www (LISTEN)
apache2 17207 www-data   26u  IPv6 1217132929       TCP area-1.ch:www->net66-219-58-45.static-customer.corenap.com:57036 (ESTABLISHED)
apache2 32486 www-data   34u  IPv6 1216853652       TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:47004 (CLOSE_WAIT)

And now we take a look at the extended Apache status (ExtendedStatus on) for further Apache child information:

0-1    32486    0/34/438    W     20.03    7057    0    0.0    2.86    20.77     66.249.66.136    BADWEBSITE    GET /category.php?id_category=2&id_lang=1 HTTP/1.1

2-1    1242    0/32/349    W     291.95    6310    0    0.0    5.52    21.18     66.249.66.136    BADWEBSITE    GET /category.php?id_category=2&id_lang=1 HTTP/1.1

2-1    -    0/0/361    .     520.53    6133    40    0.0    0.00    24.98     84.22.49.237    AGOODWEBSITE    POST /?_task=mail&_action=autocomplete HTTP/1.1

2-1    -    0/0/362    .     529.47    6133    0    0.0    0.00    22.47     88.65.227.133    ANOTHERGOODONE    GET /images/klein-shop-vfb-ksc-poster-200x133.jpg HTTP/1.1

2-1    -    0/0/367    .     523.36    6133    274    0.0    0.00    16.91     84.22.49.237    AGOODWEBSITE    POST / HTTP/1.1

2-1    -    0/0/363    .     511.97    6133    30    0.0    0.00    20.74     91.8.246.156    GOODWEBSITE2    GET /images/product_images/info_images/670_0.jpg HTTP/1.1

2-1    -    0/0/361    .     536.08    6132    458    0.0    0.00    30.53     207.46.199.23    BGOODWEBSITE    GET /wp-content/uploads/shadowbox-js/d46661d2c927dea304addb1b47

2-1    -    0/0/359    .     515.40    6133    40    0.0    0.00    56.72     178.82.216.64    CGOODWEBSITE    GET /_images/nav/beratung_b.gif HTTP/1.1

2-1    1242    0/18/344    W     4.81    6682    0    0.0    2.05    16.81     66.249.66.136    BADWEBSITE    GET /category.php?id_category=2&id_lang=1 HTTP/1.1

In the extended status we can find the same process id's and seen in top and lsof again. And what a surprise, it always is BADWEBSITE which is still holding the connection to the Googlebot (66.249.66.136) (take a look at the W which stays for Sending Reply).

I don't know what the programmer of this website did in the file category.php but I definitely don't want Googlebot to crawl these pages anymore. So the solution is to use a robots.txt to "talk" to the Googlebot:

# Googlebot ain't allowed to check the pages here
User-Agent: Googlebot
Disallow: /

# All other bots go ahead
User-agent: *
Disallow:

Since then the server load is a steady below 1 and there are no problems anymore!


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder