It wasn't me. You can't prove anything.


2008-08-29

Speed of Regular Expressions

I complained about the speed, or lack of speed, when using regular expressions for string compares under Python. Of course, it turns out I was not utilizing the resource properly. I looked at what was going on and decided to try doing the same work in one large regex apposed to several small ones. It only took a few minutes to add the code that stitched the 167 strings together in such a way that they will act as one huge regex.


  # Stitch a list together as one long regular expression string.
  def MakeReg(sourceList):
      stringTotal=""
      for string in sourceList:
          stringTotal = stringTotal + "|" + str(string)
      # Return the list with the first or bar '|'
      return stringTotal[1:]

That code simply takes the existing list of strings and stitches them together with an or bar '|' (capital back slash). This makes one regex perform hundreds of searches in one fell swoop. What this means is, now, the whole search takes two minutes less on 69000 files than it did with standard string searches. That is an hour and twenty minutes down to less than ten minutes with the same result at the end. The time savings come in because the number of regex encounters went from sixteen million to three hundred thousand.

In stead of 167x69000x3x80 or whatever it was, it is now more like 69000x4x2-(a bunch of removed redundancy)+(some checking for duplicate files). It turns out running fewer huge regex are far less stressful than many small checks that accomplish nearly the same thing. I'm sure there is more tweaking that can be done. I'm not exactly an expert.

Sample output.


2008-08-29_17-51
-------- Summary ------------------------------
---------------------------------
Total files processed     > ..............................
................ 69013
Good files in playlist    > ..............................
................ 41310
Bad files out of list     > ..............................
................ 11370
Fantastic files           > ..............................
................. 5184
Number of Duplicates      > ..............................
................ 14097
-------- time ------------------------------
------------------------------------
Process seconds           > ..............................
.................. 523
Elapsed seconds           > ..............................
.................. 523
Files/second              > ..............................
.................. 131
reg ex count              > ..............................
............... 302011
CPU time of this process  > ..............................
............... 499.29
-------- Info ------------------------------
------------------------------------
Good file                 > ..............................
....... kelly_good.m3u
Fantastic file            > ..............................
........ kelly_fan.m3u
Log file                  > ..................... kelly_log_2008-08-29_17-42.log
Directory processed       > ..............................
....... ['/mnt/music']
Items in kill list        > ..............................
.................. 167
Items in fantastic list   > ..............................
................... 84
Options                   > ..............................
............ See below
['-o', '-f', 'kelly_fan', '-g', 'kelly_good', '-l', 'kelly_log']
------------------------------
--------------------------------------------------


No comments: