Friday, October 6, 2017

Card tricks brute forced

(Now going from lowest-level to highest-level coding)


When I was in middle school, we used a standard 36-card playing deck (from 6 onwards) to play a bunch of games. We also used this deck to do some "loves me, loves me not" type fortune telling where "loves me" was indicated by, essentially, an occurrence of two aces in a row somewhere in the deck.

Much later on I got curious about how likely that outcome was. A short back of the envelope calculation gives
P = [P(A1)*P(A2|A1)]*N 
where 
P(A1) = probability of "there is an ace at position 1 in the deck", 1/9
P(A2|A1) = probability of "there is an ace  at position 2 n the deck if there already is an ace at position 1", 3/35 (3 aces left, out of 35 cards possible)
N = 35 -- number of different positions in the deck where two consecutive aces can occur.

Hence

P = [(1/9)*(3/35)]*35 = 1/3


This looks too high, at first sight. 
So let's write a simple brute-force proof in Wolfram Language (ex-Mathematica):
cards = "6789TJQKA";
adeck = Characters[cards<>cards<>cards<>cards];
testdeck[deck_List] := Length[StringPosition[StringJoin@@ToString/@deck,"AA"]]>0;
(*Monte-Carlo*)
n=100000;For[c=0;i=0,i<n,i++,(thedeck=RandomSample[adeck];If[testdeck[thedeck],c++])]
approx=N[c/n]

It is more or less obvious what the code does: it just tests a large number n of permutations on a deck (using RandomSample[])and then testing whether the resulting deck has at least two aces in a row (searching for "AA" in a deck converted to a string) and counting the number of decks in which this is found.

You can run the code in Wolfram Cloud and make sure the answer is indeed quite close to 0.3
(In hindsight, is it so strange, really? A "loves me, loves me not" trick will only survive among kids if it has a significantly non-negligible chance of a favorable outcome. And I do believe this is how most fortune telling works.)


Naturally you can modify the code very simply to include other card tricks.
For example, you can set cards="23456789TJQKA" for a standard 52-card deck; additionally you can append <>"XX" to cards<>cards<>cards<>cards to include jokers (or <>"AA" if you want them to be wild and count as aces).
Or you can change the testing function to test for any card sequence (like "AK" or "6789"), or indeed modify the criterion to any degree of complexity.

Finally, if you want to differentiate the suits of cards, we will need a slightly more complicated code:

cards = "6789TJQKA"; suits = "SHDC";
adeck = Outer[StringJoin, cards//Characters, suits//Characters]//Flatten;
testdeck[deck_List] := Length[StringPosition[StringJoin@@(First[Characters[#]]&)/@deck,"AA"]]>0;
Isn't it just beautiful how you can combine Map (/@) and Apply (@@) along with Mathematica's pure functions to create tests on lists without unnecessary boilerplate like loops? (And we can verify that making all aces distinct does not change the probability of two of them following each other.)

Finally, I initially planned to do a rigorous (rather than Monte-Carlo) test using Permutations[adesk], but it turns out that Mathematica won't have lists with "more than machine integer" elements. Sad but would probably work (and chances are, will be much faster than Monte-Carlo) with shorter decks.

Monday, October 2, 2017

Text pattern matching, revisited

This is more of a retro post to reflect on one of my very first code experiences.

1. Backstory

Back then, I was a student in an Eastern European city in the middle of a social paradigm change, and as a consequence, with parents struggling to make ends meet. So, even though I did receive a PC (or rather, $500 needed to assemble one) as my prom gift, that PC (with its AMD 5x86-133 processor) was very quickly out-specced by almost everyone in the class...

Long story short, when the city's telephone directory database was leaked and immediately became a very popular piece of software, I was annoyed to find out that it was slow as a snail on a system like mine. It was even more annoying to see that the database, of around 7MB in size, was quite enough to fit entirely in memory, so, given the right shell, we could potentially try to make it faster.

So I was approached by another guy with an older PC and we decided to write such a "fast shell". The first step was to reverse engineer the database, which was quite simple. Each record was 16 bytes, like so:

TT TT TT  NN NN NN  I1 I2  SS SS  SN SN SB SL AP AP 
telephone surname   inits  street  str.#/l    apt.#  

TT -- telephone number 
NN -- last name (integer key)
I1, I2 -- given name initials
SS -- street name (integer key)
SN -- street number
SB -- numeric street number suffix (for numbers like 23/1, pretty common in ex-USSR)
SL -- literal street number suffix (for numbers like 23A, less common but present)
AP -- apartment number

Note that SB and SL could be combined, resulting in a shorter 15-byte record; however the benefit was deemed small compared with having an aligned 16-byte record length, allowing the database file to be very easily browsable in a hex viewer.

The database had some 500-600k entries and was sorted by telephone number (allowing instant look-ups by phone number via binary search). The look-ups by by a specific name or street were very fast as well -- once the integer key for the name was found (again by binary searching, very fast), matching an integer through the entire database took about 1 second.

2. Problem 

What turned out a bit problematic was to look up by an approximate or patterned street name or last name, such as "find everyone whose name matches J?nes and who lives on West* street". On a platform that we had (some protected mode 32-bit Pascal or even standard Borland DPMI -- cannot remember now -- possibly with the use of Turbo Vision), even a simple string comparison for all entries in the database would run for ~30 seconds. Any conceivable pattern matching routine would take up minutes, which was prohibitively long.

To remedy this, we reverted to the assembly language and wrote this: (we assume, for simplicity, that both the pattern and the test string are null-terminated and that the length of the pattern is known in advance):


...
; preparation

mov ecx, __length     ;must be pattern length w/o trailing zero!
mov esi, __pattern    ;pattern string
mov edi, __test       ;test string
mov bh, 0x2A          ;*
mov bl, 0x3F          ;?
...

; main function

_resume:
repe cmpsb
je _pass
mov ah, [esi-1]
mov al, [edi-1]
cmp ah, bh            ;* encountered 
je _pass
cmp al, 0x00          ;test string too short
je _fail
cmp ah, bl            ;? encountered, skip char
jne _fail             ;otherwise -- failed
cmp cx, 0              
je _pass
jmp _resume
...


This routine (at the heart of which was a low-level repe cmpsb which is extremely fast) could handle "?" and the trailing "*" in the pattern (much like in MS-DOS wildcards), going through the entire DB in about 5-6 seconds. What it could not do was to handle "*" in the middle of the string, so as to be able to search for something like Wood*ck. And I wanted this functionality quite badly.

3. Solution

So the idea would be to extend the handling of "*" with the basic idea of  look what's behind "*" in the pattern; keep skipping characters in the test string until it matches or until the string runs out. Finally I came up with this: 


...
_resume:
repe cmpsb
je _pass
mov ah, [esi-1]
mov al, [edi-1]
cmp ah, bh ; * encountered
jne _cont

 inc esi
 dec cx ; skip *
 mov ah, [esi-1] ;look behind *
 cmp ah, 0x00 
 je _pass ;pass if * trailing
_keep:
 inc edi
 mov al, [edi-1]
 cmp al, 0x00 ; out of chars
 je _fail
 cmp al, ah ; next char found
 je _resume ; resume loop; keep cx intact!
 jmp _keep
  
_cont:
cmp al, 0x00 ; test runs out
je _fail
cmp ah, bl ; ? encountered
jne _fail
cmp cx, 0 ; pattern runs out
je _pass
jmp _resume


This allowed full pattern matching in less than 10 seconds, and without using any stack operations - everything is done entirely with registers and minimized memory access.
The only limitation was to disallow constructions such as "*?" (but you could use "?*" instead) and "**" (but you should use a single "*"). (Or on the flipside, this allows to search for literal "*" and "?" in the test string even though it was never practically relevant.)
You can test it out in an emulator by adding some housekeeping code for I/O (just borrowed this from their "Hello, World" example, this will be heavily platform-specific):
  _pass: ;output - pattern matched
     mov edx, 7    ;message length
     mov ecx, __pass    ;message to write
     mov ebx, 1     ;file descriptor (stdout)
     mov eax, 4     ;system call number (sys_write)
     int 0x80        ;call kernel
     mov eax, 1     ;system call number (sys_exit)
     int 0x80        ;call kernel
    
  _fail: ;output - pattern not matched
     mov edx, 11    ;message length
     mov ecx, __fail    ;message to write
     mov ebx, 1     ;file descriptor (stdout)
     mov eax, 4     ;system call number (sys_write)
     int 0x80        ;call kernel
     mov eax, 1     ;system call number (sys_exit)
     int 0x80        ;call kernel

  section .data ;input
     __pattern db 'testi*e',0,0xa 
     __test db 'testicle',0,0xa 
     __pass        db  'Matched',0xa
     __fail        db  'Not matched',0xa
     __length equ 8 ;length of pattern


One may wonder if it made any sense at all to do string matching on every DB entry, rather than generate a list of all key values for matching names or streets and then checking if the key of each entry matches that list. I would agree that the second approach would make sense under most circumstances, and even in those rare occasions where the list would have thousands of entries (like, if we only know the first letter of the name) matching can be made faster using binary search (our names list is sorted, remember?). I was, however, too lazy to write and debug such a matching routine, though.