In natural language analysis you sometimes need to know distribution of word length in given text.
In other words, a histogram of word length in given text.
You can use Perl as below. I used text corpus from the fortunes package.
Bin size is an estimated number of words in 100 - element random sample.
cat /usr/share/games/fortunes/*.u8 | \
perl -nle 'for $word (/(\w+)/g) { $wordcount++; $ls{length($word)}++; }
END{
print join("\t", qw(length count frequency binsize));
for $length (sort {$a<=>$b} grep {$_<=20} keys %ls) {
$freq = $ls{$length} / $wordcount;
print join("\t", $length, $ls{$length}, $freq,
int(100*$freq
)); } }'
length count frequency binsize
1 32334 0.0723502995016883 7
2 75544 0.169036649519253 16
3 90270 0.201987429208183 20
4 80746 0.180676603066844 18
5 50783 0.11363163418056 11
6 36790 0.0823210094224999 8
7 30286 0.0677677111000226 6
8 20079 0.0449286096274633 4
9 13215 0.0295697781875057 2
10 8457 0.0189233154848079 1
11 4376 0.00979170256137155 0
12 2243 0.00501891884030082 0
13 1075 0.00240541139247587 0
14 406 0.00090846234915833 0
15 160 0.00035801471888013 0
16 55 0.000123067559615045 0
17 16 3.5801471888013e-05 0
18 17 3.80390638810138e-05 0
19 5 1.11879599650041e-05 0
20 3 6.71277597900244e-06 0
Brak komentarzy:
Prześlij komentarz