?

Log in

No account? Create an account

Shy gypsy, slyly spryly tryst by my crypt III

« previous entry | next entry »
Jul. 21st, 2014 | 03:48 pm

With vowel patterns having been looked at in the previous two posts, I decided to turn my attention to the vowel-to-letter ratios of words. A new script was quickly produced:

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;

#my %freq = ();
my %ratios = ();

while(<>) {
    chomp;
    (my $vowels = $_) =~ s/[^aeiouy]//g;

    my $ratio = (length $vowels) / (length $_);
    push @{ $ratios{$ratio} }, {
        "word" => $_,
        "vowels" => $vowels,
        "length" => length $_,
        "vlength" => length $vowels,
    };
}

foreach (sort keys %ratios) {
    my @candidates = map { $_->{'word'} } @{ $ratios{$_} };
    say "$_: (", scalar @candidates, ")";

    my $length = 0;
    my @prevcandidates;
    while(@candidates != 0) {
        $length++;
        @prevcandidates = @candidates;
        @candidates = grep { length >= $length } @candidates;
    }

    say "$_ (min=", ($length - 1), "): (", scalar @prevcandidates, ") ",
        (@prevcandidates < 15)
        ? join ", ", @prevcandidates
        : ""
    ;
}

This provided some interesting insights. All in all, there were 66 different vowel-to-letter ratios (in this wordlist[1]); disregarding several all-consonant abbreviations, the lowest vowel-to-letter ratio is 1/9, and there are two words with that ratio. Wanna guess? They both contain nine letters and one vowel: "tsktsking (the act of going "tsk, tsk", in case it isn't clear what this means) and "strengths".

Many other ratios contain far too many words to list, so I limited the script's output to the longest ones in each class. There are precisely two words with twelve letters and two vowels, "latchstrings" and "spendthrifts"; two words with fourteen letters and three vowels, "fortrightness" and "thriftlessness"; and two with twenty-three letters and nine vowels, "disestablishmentarianism" and "electroencephalographic". The latter appears again in slightly altered form if you go back to twenty-two letters and nine vowels, but there's a second word with the same stats, too, a new one: "electroencephalography", "counterclassifications".

And sometimes, there's just a single hit: for instance, the script found precisely one word with nineteen letters and ten vowels, "antirevolutionaries", and one with fifteen letters and nine vowels, "autoinoculation".

My friend who's been providing ideas for further probing then inquired about the longest word in English with the fewest vowels. If you allow no vowels at all, the best hit is a five-letter abbreviation, "hdqrs" (="headquarters"). Otherwise, we'll have to cook up a new script to look into the matter:

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;

my %lengths = ();

while(<>) {
    chomp;
    (my $vowels = $_) =~ s/[^aeiouy]//g;

    my $ratio = (length $vowels) / (length $_);
    push @{ $lengths{length $_} }, {
        "word" => $_,
        "vowels" => $vowels,
        "vlength" => length $vowels,
        "ratio" => $ratio,
    };
}

my @lengths = sort { $a <=> $b } keys %lengths;
foreach(@lengths) {
    my @sorted = sort { $a->{'ratio'} <=> $b->{'ratio'} } @{ $lengths{$_} };
    my $bestratio = $sorted[0]->{'ratio'};
    my @hits = map { $_->{'word'} } grep { $_->{'ratio'} == $bestratio } @{ $lengths{$_} };

    say "$_: best ratio=$bestratio, vowels=", $sorted[0]->{'vlength'};
    say "$_: (", scalar @hits, ") ",
        (@hits < 15)
        ? join ", ", @hits
        : ""
    ;
}

There is no single answer here, as the number of vowels will go up as the words' lengths go up; it's a judgement call whether you consider a longer word with more vowels a better hit or not. As such, I think it's justified to post the script's full output (no guessing games, sorry; also, note that some linebreaks have been added manually to improve readability):

1: best ratio=1, vowels=1
1: (1) a
2: best ratio=0, vowels=0
2: (58)
3: best ratio=0, vowels=0
3: (47)
4: best ratio=0, vowels=0
4: (12) bdrm, bldg, blvd, cmdg, ctrl, dbms, kwhr, mktg, psst, rcpt, tbsp, tnpk
5: best ratio=0, vowels=0
5: (1) hdqrs
6: best ratio=0.166666666666667, vowels=1
6: (573)
7: best ratio=0.142857142857143, vowels=1
7: (71)
8: best ratio=0.125, vowels=1
8: (12) borschts, pschents, schlepps, schlocks, schmaltz, schmucks, schnapps, 
        schticks, sprights, strength, tsktsked, twelfths
9: best ratio=0.111111111111111, vowels=1
9: (2) strengths, tsktsking
10: best ratio=0.2, vowels=2
10: (217)
11: best ratio=0.181818181818182, vowels=2
11: (33)
12: best ratio=0.166666666666667, vowels=2
12: (2) latchstrings, spendthrifts
13: best ratio=0.230769230769231, vowels=3
13: (47)
14: best ratio=0.214285714285714, vowels=3
14: (2) forthrightness, thriftlessness
15: best ratio=0.266666666666667, vowels=4
15: (21)
16: best ratio=0.25, vowels=4
16: (2) shortsightedness, spendthriftiness
17: best ratio=0.294117647058824, vowels=5
17: (10) anthropomorphisms, crystallographers, disfranchisements, misunderstandings, 
         postconvalescents, prepossessingness, straightforwardly, transcendentalism, 
         transcendentalist, transcendentalizm
18: best ratio=0.277777777777778, vowels=5
18: (1) transcendentalists
19: best ratio=0.263157894736842, vowels=5
19: (1) straightforwardness
20: best ratio=0.35, vowels=7
20: (1) incomprehensibleness
21: best ratio=0.380952380952381, vowels=8
21: (3) antienvironmentalists, electroencephalograms, electroencephalograph
22: best ratio=0.363636363636364, vowels=8
22: (1) electroencephalographs
23: best ratio=0.391304347826087, vowels=9
23: (2) disestablismentarianism, electroencephalographic
25: best ratio=0.4, vowels=10
25: (1) antidisestablishmentarian
28: best ratio=0.392857142857143, vowels=11
28: (1) antidisestablishmentarianism

An interesting observation, BTW: with the exception of length one and one vowel, the number of vowels is decreasing monotonically as word length goes down.

With the above script in place, it's only natural to ask about the highest vowel-to-letter ratios as well. This only requires changing a single line:

...
    my @sorted = sort { $b->{'ratio'} <=> $a->{'ratio'} } @{ $lengths{$_} };
...

Again, I'll just share the full results (as before, some linebreaks were added manually for the sake of readability):

1: best ratio=1, vowels=1
1: (1) a
2: best ratio=1, vowels=2
2: (5) ai, ay, ie, ii, ye
3: best ratio=1, vowels=3
3: (9) aye, eau, eye, iii, iou, oui, yay, yea, you
4: best ratio=1, vowels=4
4: (1) ieee
5: best ratio=0.8, vowels=4
5: (26)
6: best ratio=0.666666666666667, vowels=4
6: (277)
7: best ratio=0.714285714285714, vowels=5
7: (33)
8: best ratio=0.75, vowels=6
8: (2) aureolae, eyepiece
9: best ratio=0.666666666666667, vowels=6
9: (17)
10: best ratio=0.6, vowels=6
10: (118)
11: best ratio=0.636363636363636, vowels=7
11: (8) aerobiology, audaciously, audiologies, audiovisual, auxiliaries, beauteously, 
        bourgeoisie, louisianian
12: best ratio=0.666666666666667, vowels=8
12: (1) onomatopoeia
13: best ratio=0.538461538461538, vowels=7
13: (202)
14: best ratio=0.571428571428571, vowels=8
14: (24)
15: best ratio=0.6, vowels=9
15: (1) autoinoculation
16: best ratio=0.5625, vowels=9
16: (7) autoimmunization, autointoxication, automanipulation, automanipulative, 
        editorialization, onomatopoeically, sociosexualities
17: best ratio=0.529411764705882, vowels=9
17: (6) antirevolutionary, bureaucratization, editorializations, individualization, 
        onomatopoetically, semiautomatically
18: best ratio=0.5, vowels=9
18: (9) autobiographically, inconceivabilities, influenceabilities, neurophysiological, 
        nonauthoritatively, overgeneralization, overspecialization, radiosensitivities, 
        subminiaturization
19: best ratio=0.526315789473684, vowels=10
19: (1) antirevolutionaries
20: best ratio=0.5, vowels=10
20: (5) counterrevolutionary, institutionalization, internationalization, 
        microminiaturization, neurophysiologically
21: best ratio=0.476190476190476, vowels=10
21: (2) internationalizations, microminiaturizations
22: best ratio=0.5, vowels=11
22: (1) counterrevolutionaries
23: best ratio=0.391304347826087, vowels=9
23: (2) disestablismentarianism, electroencephalographic
25: best ratio=0.4, vowels=10
25: (1) antidisestablishmentarian
28: best ratio=0.392857142857143, vowels=11
28: (1) antidisestablishmentarianism

The result for length 28, 25, 23 are the same as before; there's very few English words that long. The number of vowels is not decreasing monotonically with word length this time; there's two "humps", from 23/9 to 22/11 and from 13/7 to 12/8.

Outside of that, some beautiful vowel-heavy words words are found: "aureolae", "eyepiece" and "onomatopoeia". And I also couldn't help but think that "Audaciously Louisianian" would be a good title for someone's autobiography. :)

  1. BTW, I also spotted two further errors int he word list: "unpredictabilness", "transcendentalizm". I wouldn't be surprised if there's more still.

Link | Leave a comment | Share

Comments {4}

Transitioning into liminal space

(no subject)

from: stormdog
date: Jul. 23rd, 2014 05:28 pm (UTC)
Link

This is all a little bit like looking at the NLP code that my partner danaeris is frequently working on....

Reply | Thread

Schneelocke

(no subject)

from: schnee
date: Jul. 23rd, 2014 05:38 pm (UTC)
Link

NLP — natural language processing, or neurolinguistic programming? (Or something else entirely?)

Reply | Parent | Thread

Transitioning into liminal space

(no subject)

from: stormdog
date: Jul. 23rd, 2014 06:33 pm (UTC)
Link

Oh, sorry. Yes, natural language processing. A lot of projects she's working on involve computer-processing of text.

Reply | Parent | Thread

Schneelocke

(no subject)

from: schnee
date: Jul. 23rd, 2014 06:45 pm (UTC)
Link

Ah, interesting. Yeah, I reckon this is vaguely similar, though it's of course only done out of linguistic curiosity. :)

Reply | Parent | Thread