?

Log in

No account? Create an account

Shy gypsy, slyly spryly tryst by my crypt II

« previous entry | next entry »
Jul. 20th, 2014 | 01:38 am

Let's have some more fun with computational linguistics. A friend mentioned the "oioioi" vowel pattern; there's no words matching that, but why not generalize to arbitrary vowels, and repeats for that matter?

foreach my $repeat (1..5) {
    foreach my $vowel1 ("a", "e", "i", "o", "u", "y") {
        foreach my $vowel2 ("a", "e", "i", "o", "u", "y") {
            next if $vowel1 eq $vowel2;

            my $pattern = ($vowel1 . $vowel2) x $repeat;
            foreach (grep { $_ =~ m/^$pattern$/ } keys %freq) {
                say "$_: ", $freq{$_}->{'count'};
                say "$_: ", join ", ", @{ $freq{$_}->{'words'} } if $freq{$_}->{'count'} < 15;
            }
        }
    }
}

There are no words matching patterns with a repeat of four or five, but for two repeats, we get rich results. "ieie" (233 words) is the most common one, followed by "aeae" (217) and "aiai" (111); the least common one that exists at all, meanwhile, is "uaua", with just a single word to its credit, "unnatural" (appropriately enough).

For three repeats, there's fewer hits: two for "aiaiai", three for "ieieie", and, amazingly enough, one for "oeoeoe". Want to take a guess? Go on, I'll wait; when you're done, be sure to check the solutions: anticapitalist, anticapitalists; impertinencies, intermittencies, misdeliveries; osteoscleroses.

Those vowel patterns look unbalanced, though, so let's add an extra copy of the first vowel at the end: same code as above, but the pattern's constructed this way:

            my $pattern = (($vowel1 . $vowel2) x $repeat) . $vowel1;

This finds several vowel patterns of length 5, among them an interesting "mirrored" pair with one hit each, "eoeoe" and "oeoeoe"; the words are "testosterone" and "prolegomenon". The word that started it all, "mystifyingly", is also duly identified.

But wait, there's more! The script also finds two "aeaea", "caesarean", "caesareans", twelve "aiaia", "antiaircraft", "antiradical", "antiradicals", "apiarian", "marginalia", "pharisaical", "sanitaria", "sanitarian", "sanitarians", "scandinavia", "scandinavian", "scandinavians", three "eueue", "desuetude", "desuetudes", "enqueue", eight "iaiai", "bipartisanship", "biracialism", "dilapidating", "inactivating", "ingratiating", "insalivating", "invalidating", "irradiating", two "ieiei", "disbelieving", "interviewing", one "ioioi", "microbiotic", and two "oaoao", "collaborator", "collaborators". Finally, there's 35 "eaeae" and 16 "eieie" — too numerous to list, but armed with wordlist and script, you can easily find them yourself. Whew!

What about patterns where one vowel is repeated several times, sandwiched between one copy each of a different vowel?

foreach my $repeat (1..9) {
...
            my $pattern = $vowel1 . ($vowel2 x $repeat) . $vowel1;

Turns out that the longest such occuring pattern is "eiiiie", with a whooping seven hits: "credibilities", "legibilities", "semicivilized", "semiprimitive", "sensibilities", "sensitivities", "specificities".

The opposite also works, of course: what about patterns where a single copy of one vowel is sandwiched between chains of the same repeating vowel?

            my $pattern = ($vowel1 x $repeat) . $vowel2 . ($vowel1 x $repeat);

This time, the largest repeat for which there are hits is two, but there's several beautiful ones, such as "anathemata", "malayalam" (a palindrome, to boot!) and "whencesoever". You can even construct book titles this way: I'm sure that "The Bereavement of the Senegalese Benefactresses" would find both critical and popular acclaim, and its sequel, "The Reenlargement of the Delegatee's Desperateness" would no doubt be a bestseller as well.

The same friend to whom "oioioi" was due also suggested investigating vowel patterns of the form "123123". As easily done as said:

foreach my $vowel1 ("a", "e", "i", "o", "u", "y") {
    foreach my $vowel2 ("a", "e", "i", "o", "u", "y") {
        foreach my $vowel3 ("a", "e", "i", "o", "u", "y") {
            next if $vowel1 eq $vowel2;
            next if $vowel2 eq $vowel3;
            next if $vowel3 eq $vowel1;

            foreach (grep { $_ =~ m/^$vowel1$vowel2$vowel3$vowel1$vowel2$vowel3$/ } keys %freq) {
                say "$_: ", $freq{$_}->{'count'};
                say "$_: ", join ", ", @{ $freq{$_}->{'words'} } if $freq{$_}->{'count'} < 15;
            }
        }
    }
}

There are surprisingly many patterns of this form that have hits, although it's almost invariably just one, maybe two words. This time, an article in a medical journal suggests itself: "The Bounteousness of Inappreciable Tonsillectomies: a Critical Review of Clinical Practice". Coming soon to a Lancet near you!

Patterns of the form "123321" are a natural next step. But instead of The Lancet, this time you get the Medical Enquirer: "Newsflash! Doctors return to age-old cure! Cardiological Immunoglobulin Rehabilitated! Also in this issue: Exhilarative Retinoscopies! Unpropitious Nonintervention!"

In the same vein, what about patterns of the form "1234321"? There's just two, "auioiua" (two hits) and uoieiou (one hit). Guess if you'd like; here's the solutions: "audiovisuals", "audiovisuals"; "unconscientious". Adding yet another vowel and looking for "123454321" patterns produces no results, though, alas, and neither are there any hits for "12341234".

Got any interesting ideas for what to look for? Share 'em and I'll be happy to investigate!

Link | Leave a comment | Share

Comments {0}