Monday, October 2, 2017

Text pattern matching, revisited

This is more of a retro post to reflect on one of my very first code experiences.

1. Backstory

Back then, I was a student in an Eastern European city in the middle of a social paradigm change, and as a consequence, with parents struggling to make ends meet. So, even though I did receive a PC (or rather, $500 needed to assemble one) as my prom gift, that PC (with its AMD 5x86-133 processor) was very quickly out-specced by almost everyone in the class...

Long story short, when the city's telephone directory database was leaked and immediately became a very popular piece of software, I was annoyed to find out that it was slow as a snail on a system like mine. It was even more annoying to see that the database, of around 7MB in size, was quite enough to fit entirely in memory, so, given the right shell, we could potentially try to make it faster.

So I was approached by another guy with an older PC and we decided to write such a "fast shell". The first step was to reverse engineer the database, which was quite simple. Each record was 16 bytes, like so:

TT TT TT  NN NN NN  I1 I2  SS SS  SN SN SB SL AP AP 
telephone surname   inits  street  str.#/l    apt.#  

TT -- telephone number 
NN -- last name (integer key)
I1, I2 -- given name initials
SS -- street name (integer key)
SN -- street number
SB -- numeric street number suffix (for numbers like 23/1, pretty common in ex-USSR)
SL -- literal street number suffix (for numbers like 23A, less common but present)
AP -- apartment number

Note that SB and SL could be combined, resulting in a shorter 15-byte record; however the benefit was deemed small compared with having an aligned 16-byte record length, allowing the database file to be very easily browsable in a hex viewer.

The database had some 500-600k entries and was sorted by telephone number (allowing instant look-ups by phone number via binary search). The look-ups by by a specific name or street were very fast as well -- once the integer key for the name was found (again by binary searching, very fast), matching an integer through the entire database took about 1 second.

2. Problem 

What turned out a bit problematic was to look up by an approximate or patterned street name or last name, such as "find everyone whose name matches J?nes and who lives on West* street". On a platform that we had (some protected mode 32-bit Pascal or even standard Borland DPMI -- cannot remember now -- possibly with the use of Turbo Vision), even a simple string comparison for all entries in the database would run for ~30 seconds. Any conceivable pattern matching routine would take up minutes, which was prohibitively long.

To remedy this, we reverted to the assembly language and wrote this: (we assume, for simplicity, that both the pattern and the test string are null-terminated and that the length of the pattern is known in advance):


...
; preparation

mov ecx, __length     ;must be pattern length w/o trailing zero!
mov esi, __pattern    ;pattern string
mov edi, __test       ;test string
mov bh, 0x2A          ;*
mov bl, 0x3F          ;?
...

; main function

_resume:
repe cmpsb
je _pass
mov ah, [esi-1]
mov al, [edi-1]
cmp ah, bh            ;* encountered 
je _pass
cmp al, 0x00          ;test string too short
je _fail
cmp ah, bl            ;? encountered, skip char
jne _fail             ;otherwise -- failed
cmp cx, 0              
je _pass
jmp _resume
...


This routine (at the heart of which was a low-level repe cmpsb which is extremely fast) could handle "?" and the trailing "*" in the pattern (much like in MS-DOS wildcards), going through the entire DB in about 5-6 seconds. What it could not do was to handle "*" in the middle of the string, so as to be able to search for something like Wood*ck. And I wanted this functionality quite badly.

3. Solution

So the idea would be to extend the handling of "*" with the basic idea of  look what's behind "*" in the pattern; keep skipping characters in the test string until it matches or until the string runs out. Finally I came up with this: 


...
_resume:
repe cmpsb
je _pass
mov ah, [esi-1]
mov al, [edi-1]
cmp ah, bh ; * encountered
jne _cont

 inc esi
 dec cx ; skip *
 mov ah, [esi-1] ;look behind *
 cmp ah, 0x00 
 je _pass ;pass if * trailing
_keep:
 inc edi
 mov al, [edi-1]
 cmp al, 0x00 ; out of chars
 je _fail
 cmp al, ah ; next char found
 je _resume ; resume loop; keep cx intact!
 jmp _keep
  
_cont:
cmp al, 0x00 ; test runs out
je _fail
cmp ah, bl ; ? encountered
jne _fail
cmp cx, 0 ; pattern runs out
je _pass
jmp _resume


This allowed full pattern matching in less than 10 seconds, and without using any stack operations - everything is done entirely with registers and minimized memory access.
The only limitation was to disallow constructions such as "*?" (but you could use "?*" instead) and "**" (but you should use a single "*"). (Or on the flipside, this allows to search for literal "*" and "?" in the test string even though it was never practically relevant.)
You can test it out in an emulator by adding some housekeeping code for I/O (just borrowed this from their "Hello, World" example, this will be heavily platform-specific):
  _pass: ;output - pattern matched
     mov edx, 7    ;message length
     mov ecx, __pass    ;message to write
     mov ebx, 1     ;file descriptor (stdout)
     mov eax, 4     ;system call number (sys_write)
     int 0x80        ;call kernel
     mov eax, 1     ;system call number (sys_exit)
     int 0x80        ;call kernel
    
  _fail: ;output - pattern not matched
     mov edx, 11    ;message length
     mov ecx, __fail    ;message to write
     mov ebx, 1     ;file descriptor (stdout)
     mov eax, 4     ;system call number (sys_write)
     int 0x80        ;call kernel
     mov eax, 1     ;system call number (sys_exit)
     int 0x80        ;call kernel

  section .data ;input
     __pattern db 'testi*e',0,0xa 
     __test db 'testicle',0,0xa 
     __pass        db  'Matched',0xa
     __fail        db  'Not matched',0xa
     __length equ 8 ;length of pattern


One may wonder if it made any sense at all to do string matching on every DB entry, rather than generate a list of all key values for matching names or streets and then checking if the key of each entry matches that list. I would agree that the second approach would make sense under most circumstances, and even in those rare occasions where the list would have thousands of entries (like, if we only know the first letter of the name) matching can be made faster using binary search (our names list is sorted, remember?). I was, however, too lazy to write and debug such a matching routine, though.

No comments:

Post a Comment