This is more of a retro post to reflect on one of my very first code experiences.
Long story short, when the city's telephone directory database was leaked and immediately became a very popular piece of software, I was annoyed to find out that it was slow as a snail on a system like mine. It was even more annoying to see that the database, of around 7MB in size, was quite enough to fit entirely in memory, so, given the right shell, we could potentially try to make it faster.
So I was approached by another guy with an older PC and we decided to write such a "fast shell". The first step was to reverse engineer the database, which was quite simple. Each record was 16 bytes, like so:
Note that SB and SL could be combined, resulting in a shorter 15-byte record; however the benefit was deemed small compared with having an aligned 16-byte record length, allowing the database file to be very easily browsable in a hex viewer.
The database had some 500-600k entries and was sorted by telephone number (allowing instant look-ups by phone number via binary search). The look-ups by by a specific name or street were very fast as well -- once the integer key for the name was found (again by binary searching, very fast), matching an integer through the entire database took about 1 second.
To remedy this, we reverted to the assembly language and wrote this: (we assume, for simplicity, that both the pattern and the test string are null-terminated and that the length of the pattern is known in advance):
This routine (at the heart of which was a low-level repe cmpsb which is extremely fast) could handle "?" and the trailing "*" in the pattern (much like in MS-DOS wildcards), going through the entire DB in about 5-6 seconds. What it could not do was to handle "*" in the middle of the string, so as to be able to search for something like Wood*ck. And I wanted this functionality quite badly.
1. Backstory
Back then, I was a student in an Eastern European city in the middle of a social paradigm change, and as a consequence, with parents struggling to make ends meet. So, even though I did receive a PC (or rather, $500 needed to assemble one) as my prom gift, that PC (with its AMD 5x86-133 processor) was very quickly out-specced by almost everyone in the class...Long story short, when the city's telephone directory database was leaked and immediately became a very popular piece of software, I was annoyed to find out that it was slow as a snail on a system like mine. It was even more annoying to see that the database, of around 7MB in size, was quite enough to fit entirely in memory, so, given the right shell, we could potentially try to make it faster.
So I was approached by another guy with an older PC and we decided to write such a "fast shell". The first step was to reverse engineer the database, which was quite simple. Each record was 16 bytes, like so:
TT TT TT NN NN NN I1 I2 SS SS SN SN SB SL AP AP telephone surname inits street str.#/l apt.# TT -- telephone number NN -- last name (integer key) I1, I2 -- given name initials SS -- street name (integer key) SN -- street number SB -- numeric street number suffix (for numbers like 23/1, pretty common in ex-USSR) SL -- literal street number suffix (for numbers like 23A, less common but present) AP -- apartment number
Note that SB and SL could be combined, resulting in a shorter 15-byte record; however the benefit was deemed small compared with having an aligned 16-byte record length, allowing the database file to be very easily browsable in a hex viewer.
The database had some 500-600k entries and was sorted by telephone number (allowing instant look-ups by phone number via binary search). The look-ups by by a specific name or street were very fast as well -- once the integer key for the name was found (again by binary searching, very fast), matching an integer through the entire database took about 1 second.
2. Problem
What turned out a bit problematic was to look up by an approximate or patterned street name or last name, such as "find everyone whose name matches J?nes and who lives on West* street". On a platform that we had (some protected mode 32-bit Pascal or even standard Borland DPMI -- cannot remember now -- possibly with the use of Turbo Vision), even a simple string comparison for all entries in the database would run for ~30 seconds. Any conceivable pattern matching routine would take up minutes, which was prohibitively long.To remedy this, we reverted to the assembly language and wrote this: (we assume, for simplicity, that both the pattern and the test string are null-terminated and that the length of the pattern is known in advance):
... ; preparation mov ecx, __length ;must be pattern length w/o trailing zero! mov esi, __pattern ;pattern string mov edi, __test ;test string mov bh, 0x2A ;* mov bl, 0x3F ;? ... ; main function _resume: repe cmpsb je _pass mov ah, [esi-1] mov al, [edi-1] cmp ah, bh ;* encountered je _pass cmp al, 0x00 ;test string too short je _fail cmp ah, bl ;? encountered, skip char jne _fail ;otherwise -- failed cmp cx, 0 je _pass jmp _resume ...
This routine (at the heart of which was a low-level repe cmpsb which is extremely fast) could handle "?" and the trailing "*" in the pattern (much like in MS-DOS wildcards), going through the entire DB in about 5-6 seconds. What it could not do was to handle "*" in the middle of the string, so as to be able to search for something like Wood*ck. And I wanted this functionality quite badly.
3. Solution
So the idea would be to extend the handling of "*" with the basic idea of look what's behind "*" in the pattern; keep skipping characters in the test string until it matches or until the string runs out. Finally I came up with this:... _resume: repe cmpsb je _pass mov ah, [esi-1] mov al, [edi-1] cmp ah, bh ; * encountered jne _cont inc esi dec cx ; skip * mov ah, [esi-1] ;look behind * cmp ah, 0x00 je _pass ;pass if * trailing _keep: inc edi mov al, [edi-1] cmp al, 0x00 ; out of chars je _fail cmp al, ah ; next char found je _resume ; resume loop; keep cx intact! jmp _keep _cont: cmp al, 0x00 ; test runs out je _fail cmp ah, bl ; ? encountered jne _fail cmp cx, 0 ; pattern runs out je _pass jmp _resume
This allowed full pattern matching in less than 10 seconds, and without using any stack operations - everything is done entirely with registers and minimized memory access.
The only limitation was to disallow constructions such as "*?" (but you could use "?*" instead) and "**" (but you should use a single "*"). (Or on the flipside, this allows to search for literal "*" and "?" in the test string even though it was never practically relevant.)
You can test it out in an emulator by adding some housekeeping code for I/O (just borrowed this from their "Hello, World" example, this will be heavily platform-specific):
_pass: ;output - pattern matched mov edx, 7 ;message length mov ecx, __pass ;message to write mov ebx, 1 ;file descriptor (stdout) mov eax, 4 ;system call number (sys_write) int 0x80 ;call kernel mov eax, 1 ;system call number (sys_exit) int 0x80 ;call kernel _fail: ;output - pattern not matched mov edx, 11 ;message length mov ecx, __fail ;message to write mov ebx, 1 ;file descriptor (stdout) mov eax, 4 ;system call number (sys_write) int 0x80 ;call kernel mov eax, 1 ;system call number (sys_exit) int 0x80 ;call kernel section .data ;input __pattern db 'testi*e',0,0xa __test db 'testicle',0,0xa __pass db 'Matched',0xa __fail db 'Not matched',0xa __length equ 8 ;length of pattern
One may wonder if it made any sense at all to do string matching on every DB entry, rather than generate a list of all key values for matching names or streets and then checking if the key of each entry matches that list. I would agree that the second approach would make sense under most circumstances, and even in those rare occasions where the list would have thousands of entries (like, if we only know the first letter of the name) matching can be made faster using binary search (our names list is sorted, remember?). I was, however, too lazy to write and debug such a matching routine, though.
No comments:
Post a Comment