Monday, June 1, 2009

There's more than one way to s//

Being absolutely obsessed with perl, I tend to pipe everything to it - even when sed or bash will do it just the same. Just because something is a habit, doesn't necessarily mean it's the best way. For instance - getting the username from a directory, assuming the directory '/home/john/test' translates to user 'john'

Out of habit, I will find myself doing something like this:

username=`echo -n "/home/john/test" | perl -pe s'/\/home\/([^\/]+)\/?.*/$1/;'`

Which will, set username to 'john'. This is the desired output, so I would typically just move on without even thinking twice.

But, what if thinking twice can yield the same results - but not rely on perl. Sure, this hypothetical shell script is probably preparing the environment to execute a perl script - but since we're speaking hypothetically, the shell script is really just preparing the environment for a FORTH application. How can we accomplish the same feat?

Well, the simple answer - sed to the rescue. If you're using perl, you should be familiar with sed and awk (and if you aren't, you skipped a step in the evolution of OSS - shame on your teacher). In sed, you can obtain the same information:
username=`echo -n "/home/john/test" | sed -e 's/\/home\/\(.*\)\/.*/\1/;'`
Looks easy enough right? But, what if in this custom linux system, you don't have sed available at all (but for some reason, you find yourself running Bash 3). Well, since there's always more than one way to s//, you can just use the builtin bash regular expression engine:

[[ "/home/john/test" =~ "/home/([^/]+)/?.*" ]] && \
username="${BASH_REMATCH[1]}"

The same principal applies here - capture the grouped result and store it in the value of 'username'. Each one of these one-liners report 'john' as the result, which effectively gives you more than one way to s//. So now, your life is nearly complete...

Of course, by this time you're asking yourself "which is the best to use? Assuming your system is the exact same make/model/speed as mine, including memory and CPU power, using perl's wonderful 'Benchmark' module, we have the following report:


[root@delta s-test]# ./timethem.pl
Benchmark: timing 1000 iterations of just_bash, with_perl, with_sed...
just_bash: 9.50427 wallclock secs ( 0.13 usr 0.79 sys + 3.54 cusr 5.22 csys = 9.68 CPU) @ 1086.96/s (n=1000)
with_perl: 30.0591 wallclock secs ( 0.17 usr 0.99 sys + 13.50 cusr 15.92 csys = 30.58 CPU) @ 862.07/s (n=1000)
with_sed: 19.2811 wallclock secs ( 0.21 usr 0.75 sys + 5.74 cusr 13.03 csys = 19.73 CPU) @ 1041.67/s (n=1000)
[root@delta s-test]#

According to this report, bash by itself consumes less CPU (by a large factor), and is much faster at providing the output to the system. Of course, there is no overhead in executing an additional application and my server is not running entirely from memory, but somehow I feel that those results would yield similar results. (Anyone have a 1T RAMDISK hanging around?).

For reference, here are the scripts that I used in my timings:

[root@delta s-test]# cat timethem.pl
#!/usr/bin/perl
use Time::HiRes ();
use Benchmark ':hireswallclock';
use warnings;
use strict;


&Benchmark::timethese( 1000, {
with_perl => sub { `./with-perl.sh`; },
with_sed => sub { `./with-sed.sh`; },
just_bash => sub { `./just-bash.sh`; } }
);
[root@delta s-test]# cat with-perl.sh
#!/bin/bash
username=`echo -n "/home/john/test" | perl -pe s'/\/home\/([^\/]+)\/?.*/$1/;'`
[root@delta s-test]# cat with-sed.sh
#!/bin/bash
username=`echo -n "/home/john/test" | sed -e 's/\/home\/\(.*\)\/.*/\1/;'`
[root@delta s-test]# cat just-bash.sh
#!/bin/bash
[[ "/home/john/test" =~ "/home/([^/]+)/?.*" ]] && username="${BASH_REMATCH[1]}"
[root@delta s-test]#


Happy BASHing!

0 comments:

Post a Comment