I complained about the speed, or lack of speed, when using regular
expressions for string compares under Python. Of course, it turns out I
was not utilizing the resource properly. I looked at what was going on
and decided to try doing the same work in one large regex apposed to
several small ones. It only took a few minutes to add the code that
stitched the 167 strings together in such a way that they will act as
one huge regex.
# Stitch a list together as one long regular expression string. def MakeReg(sourceList): stringTotal="" for string in sourceList: stringTotal = stringTotal + "|" + str(string) # Return the list with the first or bar '|' return stringTotal[1:] |
That code simply takes the existing list of strings and stitches
them together with an or bar '|' (capital back slash). This makes one
regex perform hundreds of searches in one fell swoop. What this means
is, now, the whole search takes two minutes less on 69000 files than it
did with standard string searches. That is an hour and twenty minutes
down to less than ten minutes with the same result at the end. The time
savings come in because the number of regex encounters went from
sixteen million to three hundred thousand.
In stead of 167x69000x3x80 or whatever it was, it is now more like
69000x4x2-(a bunch of removed redundancy)+(some checking for duplicate
files). It turns out running fewer huge regex are far less stressful
than many small checks that accomplish nearly the same thing. I'm sure
there is more tweaking that can be done. I'm not exactly an expert.
Sample output.
2008-08-29_17-51 -------- Summary ------------------------------ Total files processed > .............................. Good files in playlist > .............................. Bad files out of list > .............................. Fantastic files > .............................. Number of Duplicates > .............................. -------- time ------------------------------ Process seconds > .............................. Elapsed seconds > .............................. Files/second > .............................. reg ex count > .............................. CPU time of this process > .............................. -------- Info ------------------------------ Good file > .............................. Fantastic file > .............................. Log file > ..................... kelly_log_2008-08-29_17-42.log Directory processed > .............................. Items in kill list > .............................. Items in fantastic list > .............................. Options > .............................. ['-o', '-f', 'kelly_fan', '-g', 'kelly_good', '-l', 'kelly_log'] ------------------------------ |
No comments:
Post a Comment