While doing performance troubleshooting analysis on a central web application infrastructure, one of the possible culprits was found in the backend running Varnish cache. A lot of CPU spikes and ESI errors were noticed in that Varnish instance.
Within 24 hours, around 400'000 ESI errors were registered and added to the counter (MAIN.esi_errors)!
Note: If you'd like to monitor and graph your Varnish server, read this article: Monitoring multiple Varnish instances and create a Grafana Varnish dashboard.
Spoiler alert: At the end this Varnish server was not the culprit causing the performance problems initially analyzed, however the findings in Varnish needed further clarification what's happening. It definitely didn't look normal.
As mentioned, a lot of ESI errors were found in varnishstat. Time to investigate on which exact path these errors appear. The varnishlog command was run for a couple of minutes to identify the BereqURL causing the errors:
root@varnish:~# varnishlog -q "(*Error) or (BerespStatus >= 500)" | egrep "(BereqURL|Storage|ESI_xmlerror)"
- BereqURL /includes/esiContent
- Storage malloc Transient
- ESI_xmlerror ERR after 15 VEP ended inside a tag
- BereqURL /includes/esiContent
- Storage malloc Transient
- ESI_xmlerror ERR after 15 VEP ended inside a tag
- BereqURL /includes/esiContent
- Storage malloc Transient
- ESI_xmlerror ERR after 15 VEP ended inside a tag
- BereqURL /includes/esiContent
- Storage malloc Transient
- ESI_xmlerror ERR after 15 VEP ended inside a tag
- BereqURL /includes/esiContent
- Storage malloc Transient
- ESI_xmlerror ERR after 15 VEP ended inside a tag
- BereqURL /includes/esiContent
- Storage malloc Transient
- ESI_xmlerror ERR after 15 VEP ended inside a tag
- BereqURL /includes/esiContent
- Storage malloc Transient
- ESI_xmlerror ERR after 15 VEP ended inside a tag
[...]
You probably guessed it right: It's always the same request path causing these ESI errors.
Even more interesting is the fact, that these requests resulting in errors are using transient memory (different memory block than defined with -s malloc parameter!). In case of stress situations (high traffic), this could lead to excessive additional memory usage (besides the malloc reserved for cached objects) and finally cause an OOM event. This definitely needs to be investigated and solved!
We already made our first experiences with ESI (Edge Side Includes) a couple of months ago, when we introduced a new technology (Nuxt JS) as the backend web application behind Varnish. This application uses ESI to load personalized content, therefore it could not be cached and a pass rule was created in VCL (see article Troubleshooting (and solving) high load spikes and timeouts on Varnish using backend with ESI for more details). Back then we also learned that Varnish tries to check the ESI content if it is valid XML. However in this particular scenario the included content does not serve XML.
When the new backend was introduced, we still ran on Varnish Cache 4.1. The Varnish ESI documentation of 4.1 only mentioned the following:
Doing ESI on JSON and other non-XML'ish content
Please
note that Varnish will peek at the included content. If it doesn't
start with a "<" Varnish assumes you didn't really mean to include it
and disregard it. You can alter this behaviour by setting the
'esi_syntax' parameter (see ref:ref-varnishd).
This information led to the use the following parameter:
feature=+esi_disable_xml_check
But obviously this wasn't enough.
Once again it was time to put Systems and Developer heads together. How exactly does the content look like which is included through ESI? How does the include syntax look like?
It turned out the developers used the ESI tags the following way in the source code:
<!--esi <esi:include src=“/includes/esiContent”/> -->
And the loaded/included content would be a JavaScript content.
Meanwhile moved on to Varnish 6.0, the ESI documentation was read again and interestingly this contains much more information now:
Doing ESI on JSON and other non-XML'ish content
Varnish will peek at the first byte of an object and if it is not a "<" Varnish assumes you didn't really mean to ESI process. You can alter this behaviour by:
param.set feature +esi_disable_xml_check
Ignoring BOM in ESI objects
If you backend spits out a Unicode Byte-Order-Mark as the first bytes of the reponse, the "<" check will fail unless you set:
param.set feature +esi_remove_bom
ESI on invalid XML
The ESI parser expects the XML to be reasonably well formed, but this may fail if you are ESI including non-XML files. You can make the ESI parser disrecard anything but ESI tags by setting:
param.set feature +esi_ignore_other_elements
ESI includes with HTTPS protocol
If ESI:include tags specify HTTPS protocol, it will be ignored by default, because varnish has no way to fetch it encryption enabled. If you want to treat HTTPS in ESI:include tags as if it were HTTP, set:
param.set feature +esi_ignore_https
Interesting, another parameter esi_ignore_other_elements is mentioned, specifically when using non-xml files!
The parameter was added using varnishadm:
root@varnish:~# varnishadm -n varnish-instance
200
-----------------------------
Varnish Cache CLI 1.0
-----------------------------
Linux,4.4.0-168-generic,x86_64,-junix,-smalloc,-sdefault,-hcritbit
varnish-plus-6.0.5r3 revision b70503bcbcb5ddf303a515ffc8c07637b6e9329b
Type 'help' for command list.
Type 'quit' to close CLI session.
varnish> param.set feature +esi_ignore_other_elements
200
varnish> param.show feature
200
feature
Value is: +esi_disable_xml_check,+esi_ignore_other_elements
Default is: none
Enable/Disable various minor features.
none Disable all features.
Use +/- prefix to enable/disable individual feature:
short_panic Short panic message.
wait_silo Wait for persistent silo.
no_coredump No coredumps.
esi_ignore_https Treat HTTPS as HTTP in
ESI:includes
esi_disable_xml_check Don't check of body looks like
XML
esi_ignore_other_elements Ignore non-esi XML-elements
esi_remove_bom Remove UTF-8 BOM
https_scheme Also split https URIs
http2 Support HTTP/2 protocol
http_date_postel Relax parsing of timestamps in
HTTP headers
After this (live) change, time to sit back and see whether or not this would help bring down the ESI errors.
Now that a nice Varnish dashboard in Grafana is available, the graph of the MAIN.esi_errors counter stopped growing already! Even after now ~20 hours later, the number of ESI errors did not increase anymore and the web application still works correctly.
Of course using varnishadm to set the feature parameter only set it temporarily. To make this change permanent (survive a restart of Varnish), the additional feature flag was added in the Systemd service unit file of Varnish (usually /etc/systemd/system/varnish.service if a single Varnish instance is used):
root@varnish:~# grep ExecStart /etc/systemd/system/varnish.service
ExecStart=/usr/sbin/varnishd -j unix,user=vcache -n varnish-leodegar -F -a :6081 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,4096m -p feature=+esi_disable_xml_check,+esi_ignore_other_elements
Followed, of course, by a Systemd daemon reload:
root@varnish:~# systemctl daemon-reload
Great, problem solved - right in time for the weekend!
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder