Today I've seen a strange SMF (Service Management Facility) issue on a SmartOS machine running joyent_20120614T001014Z.
All svc commands, like svcs or svcadm, didn't run anymore:
# svcs
svcs: Could not bind to repository server: repository server unavailable. Exiting.
Short summary: A reboot of the zone solved the problem and the commands like svcs were running again.
But I wanted to find out the reason for this of course. So if you have a little bit of time and curiosity, stick with me...
On my online research I came across Solaris bug reports, mailing list entries, etc. but none had the actual solution to the problem. One mailing list post however pointed me to the right direction by asking whether "svc.configd" is running or not. To my confirmation, configd was not running:
# ps -ef | grep svc
root 72946 72871 0 Oct 30 ? 5569:20 /lib/svc/bin/svc.startd
root 73017 72871 0 Oct 30 ? 0:05 /sbin/sh /lib/svc/method/svc-dlmgmtd
root 54594 51575 0 10:34:58 pts/2 0:00 grep svc
There's svc.startd but as mentioned, there should also be a process /lib/svc/bin/svc.configd.
I tried to manually start the configd (config daemon) to see what would happen:
# /lib/svc/bin/svc.configd
# svcs
svc.configd: Fatal error: invalid integer "-2147483463" in field "lnk_prop_id"
svcs: svcs.c:389: Unexpected libscf error: connection to repository broken. Exiting.
The manual start actually worked without an error and the process appeared again in the processlist. But as soon as I re-used svcs, the configd process crashed and showed the error above.
This means that there is a problem with the repository. And now it get's a little bit complicated if no SMF knowledge is around.
Luckily I found this blog post (Racing in the depths of SMF) which explains the whole SMF architecture in a technical and graphical way.
What's necessary to know for this case is:
At the boot of the machine, startd and configd are launched. Commands like svcs are using startd while startd itself talks with configd. configd is using a SQLite database in the background to store service states, variables, etc.
Now that I knew there is an SQLite database in the background of configd, the error message makes much more sense (field "lnk_prop_id"). It looks like an invalid integer (in this case a negative number) was inserted into the database which configd doesn't seem to handle well and crashes. On the system I found two SQLite databases which could be responsible for this error message:
# find /etc/svc -name "*.db"
/etc/svc/repository.db
/etc/svc/volatile/svc_nonpersist.db
I began my research with /etc/svc/repository.db but I can already tell you this was not the one containing the negative number, so I skip that part and go directly to /etc/svc/volatile/svc_nonpersist.db.
SmartOS comes with an embedded SQLite client. I opened the database using /lib/svc/bin/sqlite and first took a look at the existing tables:
# /lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db
SQLite version 2.8.15-repcached-joyent_20120614T001014Z
Enter ".help" for instructions
sqlite> .tables
id_tbl pg_tbl prop_lnk_tbl schema_version value_tbl
The table called prop_lnk_tbl sounds promising as the svcs error message pointed to a field called "lnk_prop_id". So I wanted to see what kind of columns this table holds:
sqlite> explain select * from prop_lnk_tbl;
0|ColumnName|0|0|lnk_prop_id
1|ColumnName|1|0|lnk_pg_id
2|ColumnName|2|0|lnk_gen_id
3|ColumnName|3|0|lnk_prop_name
4|ColumnName|4|0|lnk_prop_type
5|ColumnName|5|1|lnk_val_id
[...]
Seems I got the right table - the mentioned field/column lnk_prop_id is here. Now I just wanted to see if there are such entries as seen in the error message:
sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
-2147483463|151|247577566|start_method_waitstatus|i|577678377
-2147483230|151|247577566|auxiliary_state|s|577678403
-2147483155|151|247577566|next_state|s|577678404
-2147483071|151|247577566|state|s|577678405
-2147483018|151|247577566|state_timestamp|t|577678406
Jackpot! There are actually several entries with a negative number. The first one matches the number of the error message. Unfortunately from this information I could not see which service is responsible for these entries. I tried to get search three tables with referencing the lnk_pg_id (151) but still couldn't find the name. Here's for documentation purpose only, might help someone for a similar case:
table prop_lnk_tbl
Query: select * from prop_lnk_tbl where lnk_prop_id < 0;
lnk_prop_id | lnk_pg_id |
lnk_gen_id | lnk_prop_name | lnk_prop_type | lnk_val_id |
-2147483463 | 151 | 247577566 | start_method_waitstatus | i |
577678377 |
-2147483230 | 151 | 247577566 | auxiliary_state | s |
577678403 |
-2147483155 | 151 | 247577566 | next_state | s |
577678404 |
-2147483071 | 151 | 247577566 | state |
s |
577678405 |
-2147483018 | 151 | 247577566 | state_timestamp | t |
577678406 |
The field lnk_pg_id references to the table pg_tbl. lnk_val_id references to the table value_tbl.
table pg_tbl
Query: sqlite> select * from pg_tbl where pg_id = '151';
pg_id |
pg_parent_id |
pg_name | pg_type | pg_flags | pg_gen_id |
151 | 823 | restarter |
framework | 1 |
247577566 |
At first I thought, I finally found the responsible service name, but there were actually many entries restarter/framework in the database.
table value_tbl
Query: select * from value_tbl where value_id = '577678377' OR value_id = '577678403' OR value_id = '577678404' OR value_id =
value_id | value_type |
value_value | value_order |
577678377 | i | 0 | 0 |
577678403 | s | ct_ev_exit | 0 |
577678404 | s | offline | 0 |
577678405 | s | offline | 0 |
577678406 | t | 577678405 | 0 |
At this point I knew I had two ways how to solve the problem:
1) Try it with a simple reboot
and if that wouldn't work:
2) Manually delete these entries from the database
As written at the begin of the post, the reboot worked for me and svcs ran again without error. I also checked the database for negative numbers and they were all gone.
I'm sure though that the second option would have worked in case the reboot wouldn't have helped.
Update December 3rd 2013:
Today I ran into this very same issue. Time to test if option 2 (Manually delete these entries from the database) would work - and to answer the critics of Ahmed's comment ;-):
Search for the negative value entries in the SMF database:
/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db
sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
-2147483513|128|253154846|state_timestamp|t|590693950
sqlite> select * from pg_tbl where pg_id = '128';
128|819|restarter|framework|1|253154846
sqlite> select * from value_tbl where value_id = '590693950';
590693950|t|1386059595.646420000|0
sqlite> delete from prop_lnk_tbl where lnk_prop_id < 0;
sqlite> delete from value_tbl where value_id = '590693950';
Verification if the negative values are gone and then exit:
sqlite> select * from prop_lnk_tbl where lnk_prop_id < 0;
sqlite> .exit
Would svcs work again? Let's try it:
svcs
svcs: Could not bind to repository server: repository server unavailable.Exiting.
Narf. Let's try a restart (as mentioned in the troubleshooting above):
/lib/svc/bin/svc.configd
Another try:
svcs
STATE STIME FMRI
online* - svc:/site/wsapp:movelogspam
online Sep_12 svc:/system/svc/restarter:default
online Sep_12 svc:/system/early-manifest-import:default
online Sep_12 svc:/system/filesystem/root:default
online Sep_12 svc:/network/datalink-management:default
online Sep_12 svc:/network/ip-interface-management:default
[...]
Well here we go. Now I'm kind of in the mood to say "See, I told you so!".
Italo Penna from wrote on Dec 5th, 2022:
Thank you very much for this detailed content.
Slava B from New York wrote on Jul 13th, 2017:
It seems on our zone the /var run out of space and created the problem. Was enough to restart the svc.configd though to take care of it.
Calixto Mercado from Bolivia wrote on Jun 25th, 2016:
I found this... it is easiest
Restoring svc repository
svcs: Could not bind to repository server: repository server unavailable. Exiting.
What this means is that your SMF repo might be currupt and broken. Luckily backups are kept and you can go on restoring the repository by running the following command:
# /lib/svc/bin/restore_repository
A menu of options will come up asking for what restore point you would like to go back to by default its “boot”, and that will go to the most recent one – thats usually what I do. Just hit ENTER and it will restore it. The on screen menu will make sense and explain your options and show all of the restore points.
It will then auto reboot your system, or you will have to manually do it
# reboot
francois scheurer from zurich wrote on Jul 17th, 2015:
(previous comment continued..)
delete from value_tbl where value_tbl.value_id in (select lnk_val_id from prop_lnk_tbl where lnk_prop_id < 0);
delete from prop_lnk_tbl where lnk_prop_id < 0;
.exit
/lib/svc/bin/svc.configd
svcs -xv
francois scheurer from zurich wrote on Jul 17th, 2015:
Thx Claudio, great research!
It worked! but i also needed to kill /lib/svc/bin/svc.configd that had zombies, else svcs was working but enabling/disabling services was not (it did nothing without showing an error):
svcs -xv
svcs: Could not bind to repository server: repository server unavailable. Exiting.
prstat -s rss
pgrep -lf svc.configd
pkill -f /lib/svc/bin/svc.configd
ptree
pkill -f /lib/svc/bin/svc.startd
/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db .dump >/root/svc_nonpersist.txt
/lib/svc/bin/sqlite /etc/svc/volatile/svc_nonpersist.db
.schema
.headers on
select * from prop_lnk_tbl where lnk_prop_id < 0;
lnk_prop_id|lnk_pg_id|lnk_gen_id|lnk_prop_name|lnk_prop_type|lnk_val_id
-2147483647|127|247807630|auxiliary_state|s|578217144
-2147483450|127|247807630|next_state|s|578217145
-2147483379|127|247807630|state|s|578217146
-2147483330|127|247807630|state_timestamp|t|578217147
-2147483113|128|247807631|logfile|s|578217097
-2147483078|128|247807631|contract|c|578217098
-2147482828|128|247807631|start_pid|c|578217099
-2147482699|128|247807631|start_method_timestamp|t|578217100
-2147482505|128|247807631|start_method_waitstatus|i|578217101
-2147482424|128|247807631|auxiliary_state|s|578217148
-2147482239|128|247807631|next_state|s|578217149
-2147482115|128|247807631|state|s|578217150
-2147482075|128|247807631|state_timestamp|t|578217151
select distinct pg_tbl.* from pg_tbl where pg_id in (select lnk_pg_id from prop_lnk_tbl where lnk_prop_id < 0);
127|823|restarter|framework|1|247807630
128|821|restarter|framework|1|247807631
select distinct value_tbl.* from value_tbl where value_tbl.value_id in (select lnk_val_id from prop_lnk_tbl where lnk_prop_id < 0);
578217097|s|/var/svc/log/site-wsapp:movelogsend.log|0
578217098|c|27664308|0
578217099|c|68991|0
578217100|t|1437114522.078017000|0
578217101|i|0|0
5782
Andre Sachs from Johannesburg wrote on May 7th, 2014:
Thanks that worked a treat !
I did however need to invoke /lib/svc/bin/svc.startd for services to become managed again.
Claudio from Switzerland wrote on Jul 1st, 2013:
Hi Ahmed. No, the final conclusion is that for some reason a negative value is added to the sqlite database. By deleting the affected rows and restarting /lib/svc/bin/svc.configd, this issue should be fixed.
However my superior preferred that I do a reboot rather than manually deleting entries in the sqlite database. That's why I wrote at the begin that a reboot solves the issue. But it is still unclear where it came from.
Ahmed Kamal from Egypt wrote on Jun 30th, 2013:
So that was some cool analysis, but the final conclusion was to yet again simply reboot ?! I was wishing for a more insightful resolution .. If you later find out more info about this issue, let me know
Thanks
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder