środa, 1 lipca 2015

Word length histogram & frequency


In natural language analysis you sometimes need to know distribution of word length in given text.

In other words, a histogram of word length in given text.

You can use Perl as below. I used text corpus from the fortunes package.

Bin size is an estimated number of words in 100 - element random sample.

cat /usr/share/games/fortunes/*.u8 | \
perl -nle 'for $word (/(\w+)/g) { $wordcount++; $ls{length($word)}++; } 
  END{
    print join("\t", qw(length count frequency binsize));
    for $length (sort {$a<=>$b} grep {$_<=20} keys %ls) {
      $freq = $ls{$length} / $wordcount;
      print join("\t", $length, $ls{$length}, $freq, int(100*$freq));
    }
  }'
 
length count frequency binsize
1 32334 0.0723502995016883 7
2 75544 0.169036649519253 16
3 90270 0.201987429208183 20
4 80746 0.180676603066844 18
5 50783 0.11363163418056 11
6 36790 0.0823210094224999 8
7 30286 0.0677677111000226 6
8 20079 0.0449286096274633 4
9 13215 0.0295697781875057 2
10 8457 0.0189233154848079 1
11 4376 0.00979170256137155 0
12 2243 0.00501891884030082 0
13 1075 0.00240541139247587 0
14 406 0.00090846234915833 0
15 160 0.00035801471888013 0
16 55 0.000123067559615045 0
17 16 3.5801471888013e-05 0
18 17 3.80390638810138e-05 0
19 5 1.11879599650041e-05 0
20 3 6.71277597900244e-06 0