Many paths lead to Rome. The same also applies when doing text manipulations in Bash. In this short article awk, cut and sed are compared how to remove the first word of a line.
The line itself is an output from another command - but it doesn't matter if the output comes from a file with content or from another command's stdout. As I'm currently working on fixing issue 9 of check_netio, I was looking for a generic way to remove the first word of a line:
root@linux:~# cat /proc/net/netstat
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPHPHits TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures TCPMemoryPressuresChrono TCPSACKDiscard TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged TCPSackShiftFallback TCPBacklogDrop PFMemallocDrop TCPMinTTLDrop TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPFastOpenBlackhole TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe TCPKeepAlive TCPMTUPFail TCPMTUPSuccess TCPWqueueTooBig
TcpExt: 0 0 0 105 0 0 0 0 0 0 2202215 0 0 0 6 211917 2726 446896 0 3 5747904 26883443 3968155 0 9122 0 704 0 0 0 0 83 333 7 0 21 2 9728 192 1702 12938 3042 0 72 0 446953 9 7961 11 518150 10 0 88 0 0 0 0 0 11 5637 52 0 0 0 22379 10366 21027 0 0 0 0 0 0 0 0 0 858733 631 0 8 0 0 0 0 0 0 0 0 0 1 0 1343155 0 0 0 2199 62739777 3573 69078 88 3242 3 0 8 0 0 0 7 0 0 0 0
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts InECT0Pkts InCEPkts ReasmOverlaps
IpExt: 0 0 21 8885283 227435 0 27399435386 27248510872 812 390952368 49094631 0 0 106478999 0 462174 0 0
Note the lines start with an informational "TcpExt:" or "IpExt:". These need to be removed. Globally saying: The first word of each line needs to be removed.
When working with awk, it's obvious that the fields can be printed out manually and leaving out the first field/word, such as:
root@linux:~# echo "first second third fourth fifth" | awk '{ print $2" "$3" "$4" "$5 }'
second third fourth fifth
But obviously this method only works if you know the exact number of words/columns in a line and you really like to type.
A better way is to use a for loop and tell awk where to start:
root@linux:~# echo "first second third fourth fifth" | awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}'
second third fourth fifth
The for loop starts with the second entry (i=2) and it should continue to loop through all the fields until NF is reached. NF is an internal variable used in awk to represent the last field (word "fifth" in this case). I agree, it looks complicated, but this can be used generally across all kinds of files or output, no matter the length of a line.
Source: Stackoverflow
sed is another powerful command which comes with more functions than anyone would think of. The problem: Using these functions is sometimes pretty "weird" and complicated - depending what one wants to achieve (well, awk is not much better in this case). However for this particular use-case to remove the first word of a line, the sed command is pretty easy:
root@linux:~# echo "first second third fourth fifth" | sed "s/^[^ ]* //"
second third fourth fifth
Basically sed is told here to use a substitution (= search and replace) function and to look for "anything but whitespace" at the beginning of the line. The "anything but" here is defined by using a special bracket expression: [^ ] . From the sed documentation:
A bracket expression is a list of characters enclosed by ‘[’ and ‘]’. It matches any single character in that list; if the first character of the list is the caret ‘^’, then it matches any character not in the list.
This means the substitution is applied on everything until the first blank space/white-space is found. And in this case this is the first word at the line beginning.
Based on this question.
Just by hearing the command's name "cut", would let one think that this is the obvious command to use. Simply cut the first word off, right? And yes - it basically is that simple. There are two ways how to achieve this with cut:
root@linux:~# echo "first second third fourth fifth" | cut -d ' ' -f 2-
second third fourth fifth
In the above example, cut is told to use a white-space as field delimiter -d ' ' (to separate the words) and print fields 2 and later (-f 2-).
The other method is to "reverse" the cut command by saying it should print everything except the first field. This can be achieved by using the additional parameter --complement:
root@linux:~# echo "first second third fourth fifth" | cut -d ' ' -f 1 --complement
second third fourth fifth
That's the nice part: Every command is a winner. The goal was achieved and every admin or developer should use the command one prefers. But if there's a measurement to declare a winner, it's the time factor.
On a Debian 9 (Stretch) system with a current system load of almost 0, the different commands were run alongside the time command.
ck@linux:~$ time echo "first second third fourth fifth" | awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}'; \
> time echo "first second third fourth fifth" | sed "s/^[^ ]* //"; \
> time echo "first second third fourth fifth" | cut -d ' ' -f 2-; \
> time echo "first second third fourth fifth" | cut -d ' ' -f 1 --complement
second third fourth fifth
real 0m0.004s
user 0m0.000s
sys 0m0.000s
second third fourth fifth
real 0m0.005s
user 0m0.000s
sys 0m0.000s
second third fourth fifth
real 0m0.003s
user 0m0.000s
sys 0m0.000s
second third fourth fifth
real 0m0.003s
user 0m0.000s
sys 0m0.000s
The same command was run ten times with a random sleep time in between. This finally results in the following table:
|
awk | sed | cut | cut reverse |
1 | 0.004 | 0.005 | 0.003 | 0.003 |
2 | 0.004 | 0.005 | 0.003 | 0.002 |
3 | 0.004 | 0.005 | 0.004 | 0.002 |
4 | 0.004 | 0.005 | 0.004 | 0.003 |
5 | 0.004 | 0.005 | 0.004 | 0.003 |
6 | 0.004 | 0.005 | 0.004 | 0.003 |
7 | 0.004 | 0.005 | 0.003 | 0.002 |
8 | 0.004 | 0.005 | 0.003 | 0.002 |
9 | 0.004 | 0.004 | 0.003 | 0.003 |
10 | 0.004 | 0.005 | 0.004 | 0.003 |
Avg | 0.0040 | 0.0049 | 0.0035 | 0.0026 |
I'm actually quite surprised, but the winner, according to the command runtime is clearly the "reversed" cut command! sed on the other hand is clearly the slowest command.
Yassine Chaouche from Algiers wrote on Aug 25th, 2022:
But how to do it in pure bash?
ck from Switzerland wrote on Sep 4th, 2020:
That is correct, cut does not do any (regex) parsing. And the program itself is also much smaller (hence quicker startup):
claudio@nas:~$ du /usr/bin/cut
44 /usr/bin/cut
claudio@nas:~$ ls -la /usr/bin/awk
lrwxrwxrwx 1 root root 21 Sep 14 2018 /usr/bin/awk -> /etc/alternatives/awk
claudio@nas:~$ file /etc/alternatives/awk
/etc/alternatives/awk: symbolic link to /usr/bin/mawk
claudio@nas:~$ du /usr/bin/mawk
120 /usr/bin/mawk
claudio@nas:~$ du /bin/sed
104 /bin/sed
Michael Heiniger from wrote on Sep 4th, 2020:
It's actually not surprising that the cut command wins, it does less work. It just searches one well-defined character on each line and omits anything before the first one. It does not have to apply a regex for each character.
Also the startup time of the command has to be considered. There is not much parsing in cut, while in sed and awk it first needs to parse the command you pass.
It would have been a bit more representative if you piped in a copy of your netstat than just 5 strings.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder