Rewrite Perl script with HTML::Strip to use newer HTML::Restrict module

Written by - 0 comments

Published on - last updated on July 24th 2024 - Listed in Linux Perl Coding


As this is July 2024, we are in the first month after the end of official EPEL 7 (RHEL 7, CentOS 7) support. Some of the EPEL 7 servers have been running for a long time, so it would not be a suprise when certain applications or scripts wouldn't work anymore after a distribution upgrade.

In this particular example, I've come across a problematic Perl script which uses HTML::Strip and used to work fine under RHEL 7. But once the server was upgraded, the script would fail.

The purpose of HTML::Strip

The purpose of the HTML::Strip module is to look for HTML tags (e.g. <a....) and removes the HTML code from a standard input (stdin), handled as argument.

root@rhel7 ~ $ cat /home/ck/perlscript.pl
#!/usr/bin/perl

use HTML::Strip;

my $tf = HTML::Strip->new();
my $html_dirty=$ARGV[0];
my $html_clean = $tf->parse($html_dirty);
$tf->eof();
print $html_clean."\n";

On a normal text input, this would simply show the same text again:

root@rhel7 ~ $ /home/ck/perlscript.pl "Text output"
Text output

But if the text is detected to be inside HTML tags, the text (within the tags) is removed:

root@rhel7 ~ $ /home/ck/perlscript.pl "<Text output>"
root@rhel7 ~ $

See? No output.

Can't locate HTML/Strip.pm in @INC

But after the OS upgrade from RHEL7 to RHEL8, this script wouldn't work anymore and fail with the following error:

root@rhel8 ~ $ /home/ck/perlscript.pl "Text output"
Can't locate HTML/Strip.pm in @INC (you may need to install the HTML::Strip module) (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at ./html-tag-filter.pl line 15.
BEGIN failed--compilation aborted at ./html-tag-filter.pl line 15.

Annoying, but it can happen. Seems like not all installed packages, the Perl packages specifically, were upgraded. That's what I thought.

But as it turns out, the HTML::Strip RPM package is called perl-HTML-Strip and - surprise! - only exists for EPEL 7:

A dnf search did not return the wanted perl-HTML-Strip package either, however hinted to another package (perl-HTML-Restrict):

root@rhel8 ~ $ dnf search perl | grep strip
Red Hat CodeReady Linux Builder for RHEL 8 x86_  24 MB/s | 9.7 MB     00:00
perl-HTML-Restrict.noarch : Perl module to strip unwanted HTML tags and attributes

Note: I also found packages named "perl-HTML-StripScripts" and "perl-HTML-StripScripts-Parser" using dnf, however they did not provide the needed HTML/Strip.pm file.

The reason for the "missing" perl-HTML-Strip package in newer RHEL (EPEL) versions seems to be a bug in HTML::Strip with UTF8 encoded text. Or the package maintainer just wasn't up to it anymore. Who knows.

Shifting to HTML::Restrict module

The dnf search above already pointed to another Perl module (perl-HTML-Restrict) available as package install. A quick look at the HTML::Restrict documentation showed that it is very similar to the old HTML::Strip module.

Rewriting the Perl script to use HTML::Restrict instead of HTML::Strip would eventually turn out easier than trying to get the old HTML::Strip module somehow into the RHEL8 system!

Installing the perl-HTML-Restrict package installed a bunch of other Perl modules from the codeready and epel repositories as well:

root@rhel8 ~ $ dnf install perl-HTML-Restrict.noarch
[...]
Install  23 Packages

Total download size: 1.0 M
Installed size: 1.8 M
Is this ok [y/N]: y
[...]

The Perl script was then rewritten to use the newer HTML::Restrict module:

root@rhel8 ~ $ cat /home/ck/perlscript.pl
#!/usr/bin/perl

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $html_dirty=$ARGV[0];
my $html_clean = $hr->process($html_dirty);
print $html_clean."\n";

Told you it looks very similar! ;-)

And execution of the script works again!

root@rhel8 ~ $ /home/ck/perlscript.pl "Text output"
Text output

root@rhel8 ~ $ /home/ck/perlscript.pl "<Text output>"
root@rhel8 ~ $

Problems with greater than > and less than < characters

The previous HTML::Strip module kept the greater than ">" and less than "<" characters as standalone characters, as they were not considered HTML tags. However the newer HTML::Restrict module rewrites these and encodes the characters into HTML code:

ck@mint ~ $ ./perlscript.pl "Hello this is < yes"
Hello this is &lt; yes

According to this answer on StackOverflow the HTML::Strip module does a HTML decode at the end. HTML::Restrict does not seem to do that.

To handle this, the $html_clean variable needs to be HTML decoded, which can be done with HTML::Entities:

ck@mint ~ $ cat perlscript.pl
#!/usr/bin/perl

use HTML::Restrict;
use HTML::Entities;

my $hr = HTML::Restrict->new();
my $html_dirty=$ARGV[0];
my $html_clean = $hr->process($html_dirty);
my $html_clean = HTML::Entities::decode($html_clean);
print $html_clean."\n";

With the HTML decoding added, the standalone characters "greater than" and "less than" are showing up correctly again:

ck@mint ~ $ ./perlscript.pl "Hello this is < yes"
Hello this is < yes



Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder