Varnish panic / crash due to low workspace sizing

Written by - 0 comments

Published on - Listed in Varnish Linux


Last week I experienced several Varnish panics/crashes on a high-traffic website. Before this Varnish was taken into production, stress-tests (with up to 3200 concurrent connections) were all successful. However as soon as this Varnish got into production, it took around 20 minutes and the first crash happened.
The log entry was unfortunately not very helpful:

Child (6970) Panic at: Tue, 12 Sep 2017 18:23:34 GMT#012Assert error in VEP_Init(), cache/cache_esi_parse.c line 1033:#012  Condition((vep) != 0) not true.#012version = varnish-5.1.3 revision 05c5ac6b9, vrt api = 6.0#012ident = Linux,4.4.0-89-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 3055882.671440 (mono), 1505240579.638393 (real)#012Backtrace:#012  0x438c25: /usr/sbin/varnishd() [0x438c25]#012  0x4256c2: /usr/sbin/varnishd(VEP_Init+0x172) [0x4256c2]#012  0x42324b: /usr/sbin/varnishd() [0x42324b]#012  0x42a2f5: /usr/sbin/varnishd(VFP_Open+0x85) [0x42a2f5]#012  0x427811: /usr/sbin/varnishd() [0x427811]#012  0x4520a2: /usr/sbin/varnishd() [0x4520a2]#012  0x45254b: /usr/sbin/varnishd() [0x45254b]#012  0x7feff213a6ba: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7feff213a6ba]#012  0x7feff1e703dd: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7feff1e703dd]#012thread = (cache-worker)#012thr.req = (nil) {#012},#012thr.busyobj = 0x7fefb779f020 {#012  ws = 0x7fefb779f0a0 {#012    OVERFLOWED id = \"Bo\",#012    {s, f, r, e} = {0x7fefb77a0f60, +57472, (nil), +57496},#012  },#012  retries = 1, failed = 0, flags = {do_esi, is_gzip},#012  http_conn = 0x7fefb77a8b30 {#012    fd = 137 (@0x7fefcd3887f4),#012    doclose = NULL,#012    ws = 0x7fefb779f0a0 {#012      [Already dumped, see above]#012    },#012    {rxbuf_b, rxbuf_e} = {0x7fefb77a8b90, 0x7fefb77a8eb4},#012    {pipeline_b, pipeline_e} = {0x7fefb77a8eb4, 0x7fefb77aeef0},#012    content_length = -1,#012    body_status = chunked,#012    first_byte_timeout = 60.000000,#012    between_bytes_timeout = 60.000000,#012  },#012  filters = ESI_GZIP=0 GUNZIP=0#012  director_req = 0x7fefcf76e2f8 {#012    vcl_name = niobe,#012    type = backend {#012      display_name = reload_20170821_100741.niobe,#012      ipv4 = 127.0.0.1,#012      port = 8080,#012      hosthdr = 127.0.0.1,#012      health = healthy,#012      admin_health = probe, changed = 1505067128.961975,#012      n_conn = 82,#012    },#012  },#012  director_resp = director_req,#012  http[bereq] = 0x7fefb779f670 {#012    ws = 0x7fefb779f0a0 {#012      [Already dumped, see above]
[...]

After a restart, the next crash happened just a few minutes later. Because this was a going-live, there was not much time to find the cause and I switched the traffic to a standby machine on which I installed Varnish 4.1.
Note: As of this writing in September 2017, Varnish 4.1 is considered the "stable" version while 5.1 is the "latest" version.
But even with Varnish 4.1 there was a crash, although it took much longer:

Child (1826) Last panic at: Tue, 12 Sep 2017 20:41:32 GMT#012"Assert error in http_EstimateWS(), cache/cache_http.c line 793:#012  Condition((fm->hd[u].b) != 0) not true.#012thread = (cache-worker)#012version = varnish-4.1.8 revision d266ac5c6#012ident = Linux,4.4.0-89-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 2968968.394245 (mono), 1505248890.997950 (real)#012Backtrace:#012  0x434122: pan_ic+0x182#012  0x42d4f7: http_EstimateWS+0x3e7#012  0x422a81: vbf_beresp2obj+0x51#012  0x4233ff: vbf_fetch_thread+0x37f#012  0x44a5d9: WRK_Thread+0x4c9#012  0x44aa4b: pool_thread+0x2b#012  0x7f8f0038d6ba: libpthread.so.0(+0x76ba) [0x7f8f0038d6ba]#012  0x7f8f000c33dd: libc.so.6(clone+0x6d) [0x7f8f000c33dd]#012busyobj = 0x7f8ee7124020 {#012  ws = 0x7f8ee71240e0 {#012    OVERFLOWED id = \"Bo\",#012    {s,f,r,e} = {0x7f8ee7125f98,+57440,(nil),+57440},#012  },#012  refcnt = 2,#012  retries = 1, failed = 1, state = 1,#012  flags = {do_esi, is_gzip},#012  director_req = 0x7f8eff553ab8 {#012    vcl_name = niobe,#012    type = backend {#012      display_name = boot.niobe,#012      ipv4 = 127.0.0.1,#012      port = 8080,#012      hosthdr = 127.0.0.1,#012      health=healthy, admin_health=probe, changed=1505248796.1,#012      n_conn = 34,#012    },#012  },#012  director_resp = director_req,#012  http[bereq] = 0x7f8ee71246a8 {#012    ws[Bo] = 0x7f8ee71240e0,#012

Here the error message is different (Assert error in http_EstimateWS) than before. But at least Varnish 4.1 behaved much more stable and didn't crash every few minutes like Varnish 5.1. This finally gave me the time to troubleshoot.

On my research I came across an old bug (trac ticket 1352, meanwhile migrated into Github issue 1352) from 2013 which wasn't exactly the same error but it looked kind of similar due to the cache_esi_parse file and vep condition:

Child (22377) Panic message: Assert error in VEP_Init(), cache_esi_parse.c line 1001:
Condition((sp->wrk->vep) != 0) not true.
thread = (cache-worker)

The ticket was closed back then as "invalid" with a comment, to increase the sess_workspace:

"This assert is caused by the session workspace running out. When doing ESI deliveries, the session workspace usage will increase, especially when doing nested ESI includes. Increase the sess_workspace runtime parameter.
Regards, Martin Blix Grydeland"

Martin Blix Grydeland is a senior developer at Varnish. So if he pinpoints this error message to the session workspace, I believe him (who else is able to understand the panic message anyway?).

Whlie researching "sess_workspace" and its usage, I came across a helpful article (Varnish "sess_workspace" and why it is important) by Cosimo Streppone. It basically explains what this parameter means and how to set it during Varnish startup (-p sess_workspace=N).
This description also matches the environment of my Varnish setup:

"We found out we had to increase the default (16kb), especially since we're doing quite a bit of HTTP header copying and rewriting around."

Indeed the website running through my panicked Varnish also has a lot more HTTP headers (and rewrites) than a usual website. 

I finally thought this could be the cause, however the sess_workspace parameter does not exist anymore in Varnish 5.1. As by the upgrade notes of Varnish 4.0:

sess_workspace

In 3.0 it was often necessary to increase sess_workspace if a lot of VMODs, complex header operations or ESI were in use.
This is no longer necessary, because ESI scratch space happens elsewhere in 4.0.
If you are using a lot of VMODs, you may need to increase either workspace_backend and workspace_client based on where your VMOD is doing its work.

So instead I have to increase two parameters: workspace_backend and workspace_client. For this I adapted the Varnish's SystemD unit file:

claudio@varnish02:~$ cat /etc/systemd/system/varnish.service  | grep ExecStart
ExecStart=/usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/niobe.vcl -S /etc/varnish/secret -s malloc,20480m -p workspace_client=256k -p workspace_backend=256k

This worked fine on Varnish 4.1 and after several hours without any panic/crash I decided to try the same with Varnish 5.1, too.
Once I adapted these parameters, followed by a systemctl daemon-reload and restart of Varnish, I haven't seen any Varnish panics/crashes anymore for several days.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder