ELK stack not sending notifications anymore because of DNS cache

Written by - 0 comments

Published on - Listed in ELK Java DNS


In our primary ELK stack we enabled XPack to send notifications to a Slack channel:

# Slack config for data team
xpack.notification.slack:
  account:
    datateam-watcher:
      url: https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXX
      message_defaults:
        from: watcher

But one day these notifications suddenly stopped. As you can see from the config, the xpack.notification is supposed to connect to https://hooks.slack.com.  

When we checked the firewal logs we saw that ElasticSearch always connected to the same IP address, yet our DNS resolution check pointed towards another (new) IP address. Means: Slack has changed the public IP for hooks.slack.com. But ElasticSearch, which uses the local Java settings, wasn't aware of that change. This is because, by default, DNS is cached forever in the JVM (see DNS cache settings).

To change this, I checked $JAVA_HOME/jre/lib/security/java.security and the defaults were the following:

# The Java-level namelookup cache policy for successful lookups:
#
# any negative value: caching forever
# any positive value: the number of seconds to cache an address for
# zero: do not cache
#
# default value is forever (FOREVER). For security reasons, this
# caching is made forever when a security manager is set. When a security
# manager is not set, the default behavior in this implementation
# is to cache for 30 seconds.
#
# NOTE: setting this to anything other than the default value can have
#       serious security implications. Do not set it unless
#       you are sure you are not exposed to DNS spoofing attack.
#
#networkaddress.cache.ttl=-1



# The Java-level namelookup cache policy for failed lookups:
#
# any negative value: cache forever
# any positive value: the number of seconds to cache negative lookup results
# zero: do not cache
#
# In some Microsoft Windows networking environments that employ
# the WINS name service in addition to DNS, name service lookups
# that fail may take a noticeably long time to return (approx. 5 seconds).
# For this reason the default caching policy is to maintain these
# results for 10 seconds.
#
#
networkaddress.cache.negative.ttl=10

And changed it to use an internal DNS cache of 5 minutes (300s) but failed resolutions should not be cached at all:

# grep ttl $JAVA_HOME/jre/lib/security/java.security
networkaddress.cache.ttl=300
networkaddress.cache.negative.ttl=0

After this a restart of ElasticSearch was needed.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder