Rewrite Perl script with HTML::Strip to use newer HTML::Restrict module

Written by Claudio Kuenzler - 0 comments

Published on July 17th 2024 - last updated on July 24th 2024 - Listed in Linux Perl Coding

As this is July 2024, we are in the first month after the end of official EPEL 7 (RHEL 7, CentOS 7) support. Some of the EPEL 7 servers have been running for a long time, so it would not be a suprise when certain applications or scripts wouldn't work anymore after a distribution upgrade.

In this particular example, I've come across a problematic Perl script which uses HTML::Strip and used to work fine under RHEL 7. But once the server was upgraded, the script would fail.

The purpose of HTML::Strip

The purpose of the HTML::Strip module is to look for HTML tags (e.g. <a....) and removes the HTML code from a standard input (stdin), handled as argument.

root@rhel7 ~ $ cat /home/ck/perlscript.pl
#!/usr/bin/perl

use HTML::Strip;

my $tf = HTML::Strip->new();
my $html_dirty=$ARGV[0];
my $html_clean = $tf->parse($html_dirty);
$tf->eof();
print $html_clean."\n";

On a normal text input, this would simply show the same text again:

root@rhel7 ~ $ /home/ck/perlscript.pl "Text output"
Text output

But if the text is detected to be inside HTML tags, the text (within the tags) is removed:

root@rhel7 ~ $ /home/ck/perlscript.pl "<Text output>"
root@rhel7 ~ $

See? No output.

Can't locate HTML/Strip.pm in @INC

But after the OS upgrade from RHEL7 to RHEL8, this script wouldn't work anymore and fail with the following error:

root@rhel8 ~ $ /home/ck/perlscript.pl "Text output"
Can't locate HTML/Strip.pm in @INC (you may need to install the HTML::Strip module) (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at ./html-tag-filter.pl line 15.
BEGIN failed--compilation aborted at ./html-tag-filter.pl line 15.

Annoying, but it can happen. Seems like not all installed packages, the Perl packages specifically, were upgraded. That's what I thought.

But as it turns out, the HTML::Strip RPM package is called perl-HTML-Strip and - surprise! - only exists for EPEL 7:

perl-html-strip packages only exist for EPEL 7

A dnf search did not return the wanted perl-HTML-Strip package either, however hinted to another package (perl-HTML-Restrict):

root@rhel8 ~ $ dnf search perl | grep strip
Red Hat CodeReady Linux Builder for RHEL 8 x86_ 24 MB/s | 9.7 MB 00:00
perl-HTML-Restrict.noarch : Perl module to strip unwanted HTML tags and attributes

Note: I also found packages named "perl-HTML-StripScripts" and "perl-HTML-StripScripts-Parser" using dnf, however they did not provide the needed HTML/Strip.pm file.

The reason for the "missing" perl-HTML-Strip package in newer RHEL (EPEL) versions seems to be a bug in HTML::Strip with UTF8 encoded text. Or the package maintainer just wasn't up to it anymore. Who knows.

Shifting to HTML::Restrict module

The dnf search above already pointed to another Perl module (perl-HTML-Restrict) available as package install. A quick look at the HTML::Restrict documentation showed that it is very similar to the old HTML::Strip module.

Rewriting the Perl script to use HTML::Restrict instead of HTML::Strip would eventually turn out easier than trying to get the old HTML::Strip module somehow into the RHEL8 system!

Installing the perl-HTML-Restrict package installed a bunch of other Perl modules from the codeready and epel repositories as well:

root@rhel8 ~ $ dnf install perl-HTML-Restrict.noarch
[...]
Install 23 Packages

Total download size: 1.0 M
Installed size: 1.8 M
Is this ok [y/N]: y
[...]

The Perl script was then rewritten to use the newer HTML::Restrict module:

root@rhel8 ~ $ cat /home/ck/perlscript.pl
#!/usr/bin/perl

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $html_dirty=$ARGV[0];
my $html_clean = $hr->process($html_dirty);
print $html_clean."\n";

Told you it looks very similar! ;-)

And execution of the script works again!

root@rhel8 ~ $ /home/ck/perlscript.pl "Text output"
Text output

root@rhel8 ~ $ /home/ck/perlscript.pl "<Text output>"
root@rhel8 ~ $

Problems with greater than > and less than < characters

The previous HTML::Strip module kept the greater than ">" and less than "<" characters as standalone characters, as they were not considered HTML tags. However the newer HTML::Restrict module rewrites these and encodes the characters into HTML code:

ck@mint ~ $ ./perlscript.pl "Hello this is < yes"
Hello this is < yes

According to this answer on StackOverflow the HTML::Strip module does a HTML decode at the end. HTML::Restrict does not seem to do that.

To handle this, the $html_clean variable needs to be HTML decoded, which can be done with HTML::Entities:

ck@mint ~ $ cat perlscript.pl
#!/usr/bin/perl

use HTML::Restrict;
use HTML::Entities;

my $hr = HTML::Restrict->new();
my $html_dirty=$ARGV[0];
my $html_clean = $hr->process($html_dirty);
my $html_clean = HTML::Entities::decode($html_clean);
print $html_clean."\n";

With the HTML decoding added, the standalone characters "greater than" and "less than" are showing up correctly again:

ck@mint ~ $ ./perlscript.pl "Hello this is < yes"
Hello this is < yes

Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder