Schnee (schnee) wrote,
Schnee
schnee

Shy gypsy, slyly spryly tryst by my crypt III

With vowel patterns having been looked at in the previous two posts, I decided to turn my attention to the vowel-to-letter ratios of words. A new script was quickly produced:

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;

#my %freq = ();
my %ratios = ();

while(<>) {
    chomp;
    (my $vowels = $_) =~ s/[^aeiouy]//g;

    my $ratio = (length $vowels) / (length $_);
    push @{ $ratios{$ratio} }, {
        "word" => $_,
        "vowels" => $vowels,
        "length" => length $_,
        "vlength" => length $vowels,
    };
}

foreach (sort keys %ratios) {
    my @candidates = map { $_->{'word'} } @{ $ratios{$_} };
    say "$_: (", scalar @candidates, ")";

    my $length = 0;
    my @prevcandidates;
    while(@candidates != 0) {
        $length++;
        @prevcandidates = @candidates;
        @candidates = grep { length >= $length } @candidates;
    }

    say "$_ (min=", ($length - 1), "): (", scalar @prevcandidates, ") ",
        (@prevcandidates < 15)
        ? join ", ", @prevcandidates
        : ""
    ;
}

This provided some interesting insights. All in all, there were 66 different vowel-to-letter ratios (in this wordlist[1]); disregarding several all-consonant abbreviations, the lowest vowel-to-letter ratio is 1/9, and there are two words with that ratio. Wanna guess? They both contain nine letters and one vowel: "tsktsking (the act of going "tsk, tsk", in case it isn't clear what this means) and "strengths".

Many other ratios contain far too many words to list, so I limited the script's output to the longest ones in each class. There are precisely two words with twelve letters and two vowels, "latchstrings" and "spendthrifts"; two words with fourteen letters and three vowels, "fortrightness" and "thriftlessness"; and two with twenty-three letters and nine vowels, "disestablishmentarianism" and "electroencephalographic". The latter appears again in slightly altered form if you go back to twenty-two letters and nine vowels, but there's a second word with the same stats, too, a new one: "electroencephalography", "counterclassifications".

And sometimes, there's just a single hit: for instance, the script found precisely one word with nineteen letters and ten vowels, "antirevolutionaries", and one with fifteen letters and nine vowels, "autoinoculation".

My friend who's been providing ideas for further probing then inquired about the longest word in English with the fewest vowels. If you allow no vowels at all, the best hit is a five-letter abbreviation, "hdqrs" (="headquarters"). Otherwise, we'll have to cook up a new script to look into the matter:

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;

my %lengths = ();

while(<>) {
    chomp;
    (my $vowels = $_) =~ s/[^aeiouy]//g;

    my $ratio = (length $vowels) / (length $_);
    push @{ $lengths{length $_} }, {
        "word" => $_,
        "vowels" => $vowels,
        "vlength" => length $vowels,
        "ratio" => $ratio,
    };
}

my @lengths = sort { $a <=> $b } keys %lengths;
foreach(@lengths) {
    my @sorted = sort { $a->{'ratio'} <=> $b->{'ratio'} } @{ $lengths{$_} };
    my $bestratio = $sorted[0]->{'ratio'};
    my @hits = map { $_->{'word'} } grep { $_->{'ratio'} == $bestratio } @{ $lengths{$_} };

    say "$_: best ratio=$bestratio, vowels=", $sorted[0]->{'vlength'};
    say "$_: (", scalar @hits, ") ",
        (@hits < 15)
        ? join ", ", @hits
        : ""
    ;
}

There is no single answer here, as the number of vowels will go up as the words' lengths go up; it's a judgement call whether you consider a longer word with more vowels a better hit or not. As such, I think it's justified to post the script's full output (no guessing games, sorry; also, note that some linebreaks have been added manually to improve readability):

1: best ratio=1, vowels=1
1: (1) a
2: best ratio=0, vowels=0
2: (58)
3: best ratio=0, vowels=0
3: (47)
4: best ratio=0, vowels=0
4: (12) bdrm, bldg, blvd, cmdg, ctrl, dbms, kwhr, mktg, psst, rcpt, tbsp, tnpk
5: best ratio=0, vowels=0
5: (1) hdqrs
6: best ratio=0.166666666666667, vowels=1
6: (573)
7: best ratio=0.142857142857143, vowels=1
7: (71)
8: best ratio=0.125, vowels=1
8: (12) borschts, pschents, schlepps, schlocks, schmaltz, schmucks, schnapps, 
        schticks, sprights, strength, tsktsked, twelfths
9: best ratio=0.111111111111111, vowels=1
9: (2) strengths, tsktsking
10: best ratio=0.2, vowels=2
10: (217)
11: best ratio=0.181818181818182, vowels=2
11: (33)
12: best ratio=0.166666666666667, vowels=2
12: (2) latchstrings, spendthrifts
13: best ratio=0.230769230769231, vowels=3
13: (47)
14: best ratio=0.214285714285714, vowels=3
14: (2) forthrightness, thriftlessness
15: best ratio=0.266666666666667, vowels=4
15: (21)
16: best ratio=0.25, vowels=4
16: (2) shortsightedness, spendthriftiness
17: best ratio=0.294117647058824, vowels=5
17: (10) anthropomorphisms, crystallographers, disfranchisements, misunderstandings, 
         postconvalescents, prepossessingness, straightforwardly, transcendentalism, 
         transcendentalist, transcendentalizm
18: best ratio=0.277777777777778, vowels=5
18: (1) transcendentalists
19: best ratio=0.263157894736842, vowels=5
19: (1) straightforwardness
20: best ratio=0.35, vowels=7
20: (1) incomprehensibleness
21: best ratio=0.380952380952381, vowels=8
21: (3) antienvironmentalists, electroencephalograms, electroencephalograph
22: best ratio=0.363636363636364, vowels=8
22: (1) electroencephalographs
23: best ratio=0.391304347826087, vowels=9
23: (2) disestablismentarianism, electroencephalographic
25: best ratio=0.4, vowels=10
25: (1) antidisestablishmentarian
28: best ratio=0.392857142857143, vowels=11
28: (1) antidisestablishmentarianism

An interesting observation, BTW: with the exception of length one and one vowel, the number of vowels is decreasing monotonically as word length goes down.

With the above script in place, it's only natural to ask about the highest vowel-to-letter ratios as well. This only requires changing a single line:

...
    my @sorted = sort { $b->{'ratio'} <=> $a->{'ratio'} } @{ $lengths{$_} };
...

Again, I'll just share the full results (as before, some linebreaks were added manually for the sake of readability):

1: best ratio=1, vowels=1
1: (1) a
2: best ratio=1, vowels=2
2: (5) ai, ay, ie, ii, ye
3: best ratio=1, vowels=3
3: (9) aye, eau, eye, iii, iou, oui, yay, yea, you
4: best ratio=1, vowels=4
4: (1) ieee
5: best ratio=0.8, vowels=4
5: (26)
6: best ratio=0.666666666666667, vowels=4
6: (277)
7: best ratio=0.714285714285714, vowels=5
7: (33)
8: best ratio=0.75, vowels=6
8: (2) aureolae, eyepiece
9: best ratio=0.666666666666667, vowels=6
9: (17)
10: best ratio=0.6, vowels=6
10: (118)
11: best ratio=0.636363636363636, vowels=7
11: (8) aerobiology, audaciously, audiologies, audiovisual, auxiliaries, beauteously, 
        bourgeoisie, louisianian
12: best ratio=0.666666666666667, vowels=8
12: (1) onomatopoeia
13: best ratio=0.538461538461538, vowels=7
13: (202)
14: best ratio=0.571428571428571, vowels=8
14: (24)
15: best ratio=0.6, vowels=9
15: (1) autoinoculation
16: best ratio=0.5625, vowels=9
16: (7) autoimmunization, autointoxication, automanipulation, automanipulative, 
        editorialization, onomatopoeically, sociosexualities
17: best ratio=0.529411764705882, vowels=9
17: (6) antirevolutionary, bureaucratization, editorializations, individualization, 
        onomatopoetically, semiautomatically
18: best ratio=0.5, vowels=9
18: (9) autobiographically, inconceivabilities, influenceabilities, neurophysiological, 
        nonauthoritatively, overgeneralization, overspecialization, radiosensitivities, 
        subminiaturization
19: best ratio=0.526315789473684, vowels=10
19: (1) antirevolutionaries
20: best ratio=0.5, vowels=10
20: (5) counterrevolutionary, institutionalization, internationalization, 
        microminiaturization, neurophysiologically
21: best ratio=0.476190476190476, vowels=10
21: (2) internationalizations, microminiaturizations
22: best ratio=0.5, vowels=11
22: (1) counterrevolutionaries
23: best ratio=0.391304347826087, vowels=9
23: (2) disestablismentarianism, electroencephalographic
25: best ratio=0.4, vowels=10
25: (1) antidisestablishmentarian
28: best ratio=0.392857142857143, vowels=11
28: (1) antidisestablishmentarianism

The result for length 28, 25, 23 are the same as before; there's very few English words that long. The number of vowels is not decreasing monotonically with word length this time; there's two "humps", from 23/9 to 22/11 and from 13/7 to 12/8.

Outside of that, some beautiful vowel-heavy words words are found: "aureolae", "eyepiece" and "onomatopoeia". And I also couldn't help but think that "Audaciously Louisianian" would be a good title for someone's autobiography. :)

  1. BTW, I also spotted two further errors in the word list: "unpredictabilness", "transcendentalizm". I wouldn't be surprised if there's more still.
Tags: english, interesting stuff, linguistics, perl, programming
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 4 comments