Interesting little regex

Alan Young alansyoungiii at gmail.com
Thu Feb 23 13:48:39 MST 2006


Updated script at bottom.
On 2/23/06, Uri Guttman <uri at stemsystems.com> wrote:
>   AY> $text =~ s{(
>   AY>              (\b\w+(?:['-]+\w+)*\b)
>
> why the multiple ['-] inside the words? could those chars ever begin or
> end words? so just [\w'-]+ should be fine there.

It's possible to have multi-hyphenated words.  I didn't think it was
worth the time to figure out how to handle that and single apostrophe
words at the same time.  Besides, I'm not verifying the accuracy of
the text.

In the spirit of testing though, I changed it to (\b[\w'-]*\b) and it
took 40 seconds and found 's and ' as words where the original did
not.

>   AY>              (??{!$unique{$^N}++?"(?=)":"(?!)"})
>
> i am not sure why you do that boolean trick there. i have seen it before
> (and actually use it somewhere but what is its purpose here?

Well, as we were looking at it, we realized it wasn't really necessary
for the word parsing.  What is was originally doing, however, was
finding the unique occurrences in a string of text.

Basically, if the match was not in the hash then (?=) would force the
regex to succeed, otherwise it would force it to fail.

This is the way I understand it:

(??{<code>}) replaces the regex at the current pos() with the result
of the <code> block.

If the the match ($^N) was not in the hash, then it would auto-vivify
the key and increment it and return (?!) which is a negative lookahead
on nothing, which always fails so we force it to backtrack and try
again.

If the match ( $^N) is in the hash, then it increments the value and
returns (?=) which is a positive lookahead on nothing, which always
succeeds so we continue on.

I'm still wrapping my brain around this concept so I may have it
twisted a little.

Changing the regex to

  1 while $text =~ m{(
            (\b\w+(?:['-]+\w+)*\b)
            (?{!$unique{$^N}++})
           )
          }xg;

dropped the time down to 3s.

> since you just replace the word by itself, why use s///? m// will get
> the same results and should be much faster.

There was no appreciable difference between the two types of regexes
(see my code below).

>   AY> print "$_ => $unique{$_}\n" for sort keys %unique;
>
> if you want raw speed, that makes lots of calls to print which is very
> slow as it needs to invoke stdio code for each call. this should be
> faster (even with the ram usage):
>
>         print map "$_ => $unique{$_}\n", sort keys %unique;

Didn't seem to make a difference, but I like this way better.  Seems
more perlish.

Before changing the regex as indicated where I explained how we didn't
really need to do it that way :/, and with your other changes the
speed was still right around 7s (using time ./simple.pl).  However,
memory usage was noticeably (if not significantly) improved.

#!/usr/bin/perl -w

use strict;

use File::Slurp;

my $text = read_file( './kjv10.txt' );

my %unique;

if ( 0 ) {
print "substitution\n";

#  $text =~ s{(
#             (\b\w+(?:['-]+\w+)*\b)
#             (??{!$unique{$^N}++?"(?=)":"(?!)"})
#           )
#          }{}xg;

  $text =~ s{(
             (\b\w+(?:['-]+\w+)*\b)
             (?{$unique{$^N}++})
           )
          }{}xg;

} else {

  print "while loop\n";

#  1 while $text =~ m{(
#            (\b\w+(?:['-]+\w+)*\b)
#            (??{!$unique{$^N}++?"(?=)":"(?!)"})
#           )
#          }xg;

  1 while $text =~ m{(
            (\b\w+(?:['-]+\w+)*\b)
            (?{!$unique{$^N}++})
           )
          }xg;

}

print map "$_ => $unique{$_}\n", sort keys %unique;
--
Alan



More information about the PLUG mailing list