Last week I experienced several Varnish panics/crashes on a high-traffic website. Before this Varnish was taken into production, stress-tests (with up to 3200 concurrent connections) were all successful. However as soon as this Varnish got into production, it took around 20 minutes and the first crash happened.
The log entry was unfortunately not very helpful:
Child (6970) Panic at: Tue, 12 Sep 2017 18:23:34 GMT#012Assert error in VEP_Init(), cache/cache_esi_parse.c line 1033:#012 Condition((vep) != 0) not true.#012version = varnish-5.1.3 revision 05c5ac6b9, vrt api = 6.0#012ident = Linux,4.4.0-89-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 3055882.671440 (mono), 1505240579.638393 (real)#012Backtrace:#012 0x438c25: /usr/sbin/varnishd() [0x438c25]#012 0x4256c2: /usr/sbin/varnishd(VEP_Init+0x172) [0x4256c2]#012 0x42324b: /usr/sbin/varnishd() [0x42324b]#012 0x42a2f5: /usr/sbin/varnishd(VFP_Open+0x85) [0x42a2f5]#012 0x427811: /usr/sbin/varnishd() [0x427811]#012 0x4520a2: /usr/sbin/varnishd() [0x4520a2]#012 0x45254b: /usr/sbin/varnishd() [0x45254b]#012 0x7feff213a6ba: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7feff213a6ba]#012 0x7feff1e703dd: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7feff1e703dd]#012thread = (cache-worker)#012thr.req = (nil) {#012},#012thr.busyobj = 0x7fefb779f020 {#012 ws = 0x7fefb779f0a0 {#012 OVERFLOWED id = \"Bo\",#012 {s, f, r, e} = {0x7fefb77a0f60, +57472, (nil), +57496},#012 },#012 retries = 1, failed = 0, flags = {do_esi, is_gzip},#012 http_conn = 0x7fefb77a8b30 {#012 fd = 137 (@0x7fefcd3887f4),#012 doclose = NULL,#012 ws = 0x7fefb779f0a0 {#012 [Already dumped, see above]#012 },#012 {rxbuf_b, rxbuf_e} = {0x7fefb77a8b90, 0x7fefb77a8eb4},#012 {pipeline_b, pipeline_e} = {0x7fefb77a8eb4, 0x7fefb77aeef0},#012 content_length = -1,#012 body_status = chunked,#012 first_byte_timeout = 60.000000,#012 between_bytes_timeout = 60.000000,#012 },#012 filters = ESI_GZIP=0 GUNZIP=0#012 director_req = 0x7fefcf76e2f8 {#012 vcl_name = niobe,#012 type = backend {#012 display_name = reload_20170821_100741.niobe,#012 ipv4 = 127.0.0.1,#012 port = 8080,#012 hosthdr = 127.0.0.1,#012 health = healthy,#012 admin_health = probe, changed = 1505067128.961975,#012 n_conn = 82,#012 },#012 },#012 director_resp = director_req,#012 http[bereq] = 0x7fefb779f670 {#012 ws = 0x7fefb779f0a0 {#012 [Already dumped, see above]
[...]
After a restart, the next crash happened just a few minutes later. Because this was a going-live, there was not much time to find the cause and I switched the traffic to a standby machine on which I installed Varnish 4.1.
Note: As of this writing in September 2017, Varnish 4.1 is considered the "stable" version while 5.1 is the "latest" version.
But even with Varnish 4.1 there was a crash, although it took much longer:
Child (1826) Last panic at: Tue, 12 Sep 2017 20:41:32 GMT#012"Assert error in http_EstimateWS(), cache/cache_http.c line 793:#012 Condition((fm->hd[u].b) != 0) not true.#012thread = (cache-worker)#012version = varnish-4.1.8 revision d266ac5c6#012ident = Linux,4.4.0-89-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 2968968.394245 (mono), 1505248890.997950 (real)#012Backtrace:#012 0x434122: pan_ic+0x182#012 0x42d4f7: http_EstimateWS+0x3e7#012 0x422a81: vbf_beresp2obj+0x51#012 0x4233ff: vbf_fetch_thread+0x37f#012 0x44a5d9: WRK_Thread+0x4c9#012 0x44aa4b: pool_thread+0x2b#012 0x7f8f0038d6ba: libpthread.so.0(+0x76ba) [0x7f8f0038d6ba]#012 0x7f8f000c33dd: libc.so.6(clone+0x6d) [0x7f8f000c33dd]#012busyobj = 0x7f8ee7124020 {#012 ws = 0x7f8ee71240e0 {#012 OVERFLOWED id = \"Bo\",#012 {s,f,r,e} = {0x7f8ee7125f98,+57440,(nil),+57440},#012 },#012 refcnt = 2,#012 retries = 1, failed = 1, state = 1,#012 flags = {do_esi, is_gzip},#012 director_req = 0x7f8eff553ab8 {#012 vcl_name = niobe,#012 type = backend {#012 display_name = boot.niobe,#012 ipv4 = 127.0.0.1,#012 port = 8080,#012 hosthdr = 127.0.0.1,#012 health=healthy, admin_health=probe, changed=1505248796.1,#012 n_conn = 34,#012 },#012 },#012 director_resp = director_req,#012 http[bereq] = 0x7f8ee71246a8 {#012 ws[Bo] = 0x7f8ee71240e0,#012
Here the error message is different (Assert error in http_EstimateWS) than before. But at least Varnish 4.1 behaved much more stable and didn't crash every few minutes like Varnish 5.1. This finally gave me the time to troubleshoot.
On my research I came across an old bug (trac ticket 1352, meanwhile migrated into Github issue 1352) from 2013 which wasn't exactly the same error but it looked kind of similar due to the cache_esi_parse file and vep condition:
Child (22377) Panic message: Assert error in VEP_Init(), cache_esi_parse.c line 1001:
Condition((sp->wrk->vep) != 0) not true.
thread = (cache-worker)
The ticket was closed back then as "invalid" with a comment, to increase the sess_workspace:
"This assert is caused by the session workspace running out. When doing ESI deliveries, the session workspace usage will increase, especially when doing nested ESI includes. Increase the sess_workspace runtime parameter.
Regards, Martin Blix Grydeland"
Martin Blix Grydeland is a senior developer at Varnish. So if he pinpoints this error message to the session workspace, I believe him (who else is able to understand the panic message anyway?).
Whlie researching "sess_workspace" and its usage, I came across a helpful article (Varnish "sess_workspace" and why it is important) by Cosimo Streppone. It basically explains what this parameter means and how to set it during Varnish startup (-p sess_workspace=N).
This description also matches the environment of my Varnish setup:
"We found out we had to increase the default (16kb), especially since we're doing quite a bit of HTTP header copying and rewriting around."
Indeed the website running through my panicked Varnish also has a lot more HTTP headers (and rewrites) than a usual website.
I finally thought this could be the cause, however the sess_workspace parameter does not exist anymore in Varnish 5.1. As by the upgrade notes of Varnish 4.0:
sess_workspace
In 3.0 it was often necessary to increase sess_workspace if a lot of VMODs, complex header operations or ESI were in use.
This is no longer necessary, because ESI scratch space happens elsewhere in 4.0.
If you are using a lot of VMODs, you may need to increase either workspace_backend and workspace_client based on where your VMOD is doing its work.
So instead I have to increase two parameters: workspace_backend and workspace_client. For this I adapted the Varnish's SystemD unit file:
claudio@varnish02:~$ cat /etc/systemd/system/varnish.service | grep ExecStart
ExecStart=/usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/niobe.vcl -S /etc/varnish/secret -s malloc,20480m -p workspace_client=256k -p workspace_backend=256k
This worked fine on Varnish 4.1 and after several hours without any panic/crash I decided to try the same with Varnish 5.1, too.
Once I adapted these parameters, followed by a systemctl daemon-reload and restart of Varnish, I haven't seen any Varnish panics/crashes anymore for several days.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder