ELK stack not sending notifications anymore because of DNS cache

Written by Claudio Kuenzler - 0 comments

Published on November 20th 2018 - Listed in ELK Java DNS

In our primary ELK stack we enabled XPack to send notifications to a Slack channel:

# Slack config for data team
xpack.notification.slack:
account:
    datateam-watcher:
      url: https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXX
      message_defaults:
        from: watcher

But one day these notifications suddenly stopped. As you can see from the config, the xpack.notification is supposed to connect to https://hooks.slack.com.

When we checked the firewal logs we saw that ElasticSearch always connected to the same IP address, yet our DNS resolution check pointed towards another (new) IP address. Means: Slack has changed the public IP for hooks.slack.com. But ElasticSearch, which uses the local Java settings, wasn't aware of that change. This is because, by default, DNS is cached forever in the JVM (see DNS cache settings).

To change this, I checked $JAVA_HOME/jre/lib/security/java.security and the defaults were the following:

# The Java-level namelookup cache policy for successful lookups:
#
# any negative value: caching forever
# any positive value: the number of seconds to cache an address for
# zero: do not cache
#
# default value is forever (FOREVER). For security reasons, this
# caching is made forever when a security manager is set. When a security
# manager is not set, the default behavior in this implementation
# is to cache for 30 seconds.
#
# NOTE: setting this to anything other than the default value can have
# serious security implications. Do not set it unless
# you are sure you are not exposed to DNS spoofing attack.
#
#networkaddress.cache.ttl=-1

# The Java-level namelookup cache policy for failed lookups:
#
# any negative value: cache forever
# any positive value: the number of seconds to cache negative lookup results
# zero: do not cache
#
# In some Microsoft Windows networking environments that employ
# the WINS name service in addition to DNS, name service lookups
# that fail may take a noticeably long time to return (approx. 5 seconds).
# For this reason the default caching policy is to maintain these
# results for 10 seconds.
#
#
networkaddress.cache.negative.ttl=10

And changed it to use an internal DNS cache of 5 minutes (300s) but failed resolutions should not be cached at all:

# grep ttl $JAVA_HOME/jre/lib/security/java.security
networkaddress.cache.ttl=300
networkaddress.cache.negative.ttl=0

After this a restart of ElasticSearch was needed.

Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PHP Perl Personal PostgreSQL PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder