「諸概念の迷宮(Things got frantic)」用語集

本編で頻繁に使うロジックと関連用語のまとめ。

【推定統計情報倉庫】度数分布(Frequency distribution)と最頻値(Mode)の扱いについて。

f:id:ochimusha01:20190923041709g:plain

近世以降の近似計算方法の発達もあって、代表値Representative value)については例えば既にπ=3.141593sqrt(2)=1.414214といった約束事が先行して存在し、コンピューターはこういう数字をある種の定数として扱う訳ですが、例えば…

統計言語Rによる実現例

K <- 100
N <- 1000
pi.est <- c(NULL)
for (k in seq(1,K)) {
x <- runif(N, min=-1, max=1)
y <- runif(N, min=-1, max=1)
Data<-sum(sqrt(x*x+y*y))/N
pi.est<-c(Data,pi.est)
}
hist(pi.est, breaks=50)
rug(pi.est)

f:id:ochimusha01:20190923110636p:plain
まずは円周率計算に用いた「ランダムにxlim=(-1,1),ylim=(-1,1)の範囲で打った点の中心(0,0)からの距離(x^2+y^20-1)を100個抽出したデータに目を向けて見ましょう。

統計言語Rによる実装例

pi.est

f:id:ochimusha01:20190923111320p:plain
#再現性確保の為のデータ保存
TestData01<-c(0.7575710,0.7626254,0.7720857,0.7636143,0.7559374,0.7636892 ,0.7697000,0.7744773,0.7718450,0.7661308,0.7643701,0.7773141,0.7601370,0.7625076,0.7657443,0.7642925,0.7652136,0.7626871,0.7651091,0.7838558,0.7577962,0.7679088,0.7691799,0.7778908,0.7651222,0.7745353,0.7630110,0.7704386,0.7693323,0.7626189,0.7589299,0.7469987,0.7606626,0.7728721,0.7608035,0.7766716,0.7748260,0.7754264,0.7671104,0.7588898,0.7763099,0.7590662,0.7713618,0.7706218,0.7752811,0.7783880,0.7727876,0.7694160,0.7689929,0.7661827,0.7569213,0.7565509,0.7594678,0.7557077,0.7661841,0.7831111,0.7671762,0.7670367,0.7456229,0.7633547,0.7806836,0.7537715,0.7705932,0.7695142,0.7594232,0.7629710,0.7571859,0.7698099,0.7585800,0.7565154,0.7619804,0.7723939,0.7621805,0.7769550,0.7552302,0.7477231,0.7605412,0.7620618,0.7601785,0.7479306,0.7770422,0.7799119,0.7709682,0.7657947,0.7612600,0.7643751,0.7653982,0.7593435,0.7823729,0.7659631,0.7529565,0.7603398,0.7592395,0.7607167,0.7566061,0.7573243,0.7640064,0.7661753,0.7667829,0.7758546)

ヒストグラムとラグプロットの再表示
hist(TestData01, breaks=50)
rug(TestData01)

f:id:ochimusha01:20190923112140p:plain

度数分布表Frequency distribution table)」表示への道

#そのまま度数分布表示しても酷い事にしかならない。
table(TestData01)

f:id:ochimusha01:20190923111204p:plain

 ①「とりあえずround関数でまとめてみる

数値計算・其の壱

TestData02<-round(TestData01,digits = 4)
table(TestData02)

f:id:ochimusha01:20190923112706p:plain
TestData02<-round(TestData01,digits = 3)
table(TestData02)

f:id:ochimusha01:20190923112740p:plain

TestData02<-round(TestData01,digits = 2)

f:id:ochimusha01:20190923112806p:plain
#見た目通り元データの桁情報が削られていくので上策とはいえない。

 ②ヒストグラム表の出力データを流用する。 

h<-hist(TestData01)
h
$breaks
[1] 0.745 0.750 0.755 0.760 0.765 0.770 0.775 0.780 0.785

$counts
[1] 4 2 19 24 23 13 11 4

$density
[1] 8 4 38 48 46 26 22 8

$mids
[1] 0.7475 0.7525 0.7575 0.7625 0.7675 0.7725 0.7775 0.7825

$xname
[1] "TestData01"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

#breaksが階級を区切る値で、countsが度数。

f:id:ochimusha01:20190923114129p:plain

h <- hist(TestData01, breaks=50)
n <- length(h$counts) # 階級の数
class_names <- NULL # 階級の名前格納用
for(i in 1:n) {
class_names[i] <- paste(h$breaks[i], "~", h$breaks[i+1])
}
frequency_table <- data.frame(class=class_names, frequency=h$counts)

library(xtable)
print(xtable(frequency_table), type = "html")

  class frequency
1 0.745 ~ 0.746 1
2 0.746 ~ 0.747 1
3 0.747 ~ 0.748 2
4 0.748 ~ 0.749 0
5 0.749 ~ 0.75 0
6 0.75 ~ 0.751 0
7 0.751 ~ 0.752 0
8 0.752 ~ 0.753 1
9 0.753 ~ 0.754 1
10 0.754 ~ 0.755 0
11 0.755 ~ 0.756 3
12 0.756 ~ 0.757 4
13 0.757 ~ 0.758 4
14 0.758 ~ 0.759 3
15 0.759 ~ 0.76 5
16 0.76 ~ 0.761 7
17 0.761 ~ 0.762 2
18 0.762 ~ 0.763 7
19 0.763 ~ 0.764 4
20 0.764 ~ 0.765 4
21 0.765 ~ 0.766 7
22 0.766 ~ 0.767 5
23 0.767 ~ 0.768 4
24 0.768 ~ 0.769 1
25 0.769 ~ 0.77 6
26 0.77 ~ 0.771 4
27 0.771 ~ 0.772 2
28 0.772 ~ 0.773 4
29 0.773 ~ 0.774 0
30 0.774 ~ 0.775 3
31 0.775 ~ 0.776 3
32 0.776 ~ 0.777 3
33 0.777 ~ 0.778 3
34 0.778 ~ 0.779 1
35 0.779 ~ 0.78 1
36 0.78 ~ 0.781 1
37 0.781 ~ 0.782 0
38 0.782 ~ 0.783 1
39 0.783 ~ 0.784 2

 これでやっと最頻値Mode)を扱う準備が整った訳です。