iconEuler Home

Statistics and Random Numbers

In this notebook, we demonstrate the main statistical plots, tests and distributions In Euler. See the following reference for more information about the available functions.

Statistics in Euler

Let us start with some descriptive statistics. This is not an introduction to statistics. So you might need some background to understand the details.

Assume the following measurements. We wish to compute the mean value and the measured standard deviation.

>M=[1000,1004,998,997,1002,1001,998,1004,998,997]; ...
 mean(M), dev(M),
999.9
2.72641400622

We can plot the box-and-whiskers plot for the data. In our case there are no outliers.

>boxplot(M):

12 - Statistics

We compute the probability that a value is bigger than 1005, assuming the measured values and a normal distribution.

All functions for distributions in Euler end with ...dis and compute the cumulative probability distribution (CPF).

We print the result in % with 2 digits accuracy using the print function.

>print((1-normaldis(1005,mean(M),dev(M)))*100,2,unit=" %")
      3.07 %

For the next example, we assume the following numbers of men in given a size ranges.

>r=155.5:4:187.5; v=[22,71,136,169,139,71,32,8];

Here is a plot of the distribution.

>plot2d(r,v,a=150,b=200,c=0,d=190,bar=1,style="\/"):

12 - Statistics

We can put such raw data into a table.

Tables are a method to store statistical data. Our table should contain three columns: Start of range, end of range, number of men in the range.

Tables can be printed with headers. We use a vector of strings to set the headers.

>T:=r[1:8]' | r[2:9]' | v'; writetable(T,labc=["from","to","count"])
      from        to     count
     155.5     159.5        22
     159.5     163.5        71
     163.5     167.5       136
     167.5     171.5       169
     171.5     175.5       139
     175.5     179.5        71
     179.5     183.5        32
     183.5     187.5         8

If we need the mean value and other statistics of the sizes, we need to compute the midpoint of the ranges. We can use the first two columns of our table for this.

>(T[,1]+T[,2])/2
        157.5 
        161.5 
        165.5 
        169.5 
        173.5 
        177.5 
        181.5 
        185.5 

But it is easier, to fold the ranges with the vector [1/2,1/2].

>l=fold(r,[0.5,0.5])
[157.5,  161.5,  165.5,  169.5,  173.5,  177.5,  181.5,  185.5]

Now we can compute the mean and deviation of the sample with the given frequencies.

>{m,d}=meandev(l,v); m, d,
169.901234568
5.98912964449

Let us add the normal distribution of the values to the plot.

>plot2d("qnormal(x,m,d)*sum(v)*4", ...
   xmin=min(r),xmax=max(r),thickness=3,add=1):

12 - Statistics

Tables

In the directory of this notebook you find a file with a table. The data represent the results of a survey. Here are the first four lines of the file. The data are from an German online book "Einführung in die Statistik mit R" by A. Handl.

>printfile("table.dat",4);
Person Sex Age Titanic Evaluation Tip Problem
1 m 30 n . 1.80 n
2 f 23 y g 1.80 n
3 f 26 y g 1.80 y

The table contains 7 columns of numbers or tokens (strings). We want read the table from the file. First, we use our own translation for the tokens.

Fo this, we define sets of tokens. The function strtokens() gets a string vector of tokens from a given string.

>mf:=["m","f"]; yn:=["y","n"]; ev:=strtokens("g vg m b vb");

Now we read the table with these translations.

The arguments tok2, tok4 etc. are the translations of the columns of the table. These arguments are not in the parameter list of readtable(), so you need to provide them with ":=".

>{MT,hd}=readtable("table.dat",tok2:=mf,tok4:=yn,tok5:=ev,tok7:=yn);
>load over statistics;

For printing, we need to specify the same token sets. We print the first four lines only.

>writetable(MT[1:4],labc=hd,wc=5,tok2:=mf,tok4:=yn,tok5:=ev,tok7:=yn);
 Person  Sex  Age Titanic Evaluation  Tip Problem
      1    m   30       n          .  1.8       n
      2    f   23       y          g  1.8       n
      3    f   26       y          g  1.8       y
      4    m   33       n          .  2.8       n

The dots "." represent values, which are not available.

If we do not want to specify the tokens for the translation in advance, we only need to specify, which columns contain tokens and not numbers.

>ctok=[2,4,5,7]; {MT,hd,tok}=readtable("table.dat",ctok=ctok);

The function readtable() now returns a set of tokens.

>tok
m
n
f
y
g
vg

The table contains the entries from the file with tokens translated to numbers.

The special string NA="." is interpreted as "Not Available", and gets NAN (not a number) in the table. This translation can be changed with the parameters NA, and NAval.

>MT[1]
[1,  1,  30,  2,  NAN,  1.8,  2]

Here is the content of the table with untranslated numbers.

>writetable(MT,wc=5)
    1    1   30    2    .  1.8    2
    2    3   23    4    5  1.8    2
    3    3   26    4    5  1.8    4
    4    1   33    2    .  2.8    2
    5    1   37    2    .  1.8    2
    6    1   28    4    5  2.8    4
    7    3   31    4    6  2.8    2
    8    1   23    2    .  0.8    2
    9    3   24    4    6  1.8    4
   10    1   26    2    .  1.8    2
   11    3   23    4    6  1.8    4
   12    1   32    4    5  1.8    2
   13    1   29    4    6  1.8    4
   14    3   25    4    5  1.8    4
   15    3   31    4    5  0.8    2
   16    1   26    4    5  2.8    2
   17    1   37    2    .  3.8    2
   18    1   38    4    5    .    2
   19    3   29    2    .  3.8    2
   20    3   28    4    6  1.8    2
   21    3   28    4    1  2.8    4
   22    3   28    4    6  1.8    4
   23    3   38    4    5  2.8    2
   24    3   27    4    1  1.8    4
   25    1   27    2    .  2.8    4

For convenience, you can put the output of readtable() into a list.

>Table={{readtable("table.dat",ctok=ctok)}};

Using the same token columns and the tokens read from the file, we can print the table. We can either specify ctok, tok, etc. or use the list Table.

>writetable(Table,ctok=ctok,wc=5);
 Person  Sex  Age Titanic Evaluation  Tip Problem
      1    m   30       n          .  1.8       n
      2    f   23       y          g  1.8       n
      3    f   26       y          g  1.8       y
      4    m   33       n          .  2.8       n
      5    m   37       n          .  1.8       n
      6    m   28       y          g  2.8       y
      7    f   31       y         vg  2.8       n
      8    m   23       n          .  0.8       n
      9    f   24       y         vg  1.8       y
     10    m   26       n          .  1.8       n
     11    f   23       y         vg  1.8       y
     12    m   32       y          g  1.8       n
     13    m   29       y         vg  1.8       y
     14    f   25       y          g  1.8       y
     15    f   31       y          g  0.8       n
     16    m   26       y          g  2.8       n
     17    m   37       n          .  3.8       n
     18    m   38       y          g    .       n
     19    f   29       n          .  3.8       n
     20    f   28       y         vg  1.8       n
     21    f   28       y          m  2.8       y
     22    f   28       y         vg  1.8       y
     23    f   38       y          g  2.8       n
     24    f   27       y          m  1.8       y
     25    m   27       n          .  2.8       y

The function tablecol() returns the values of columns of the table, skipping any rows with NAN values("." in the file), and the indices of the columns, which contain these values.

>{c,i}=tablecol(MT,[5,6]);

We can use this to extract columns from the table for a new table.

>j=[1,5,6]; writetable(MT[i,j],labc=hd[j],ctok=[2],tok=tok)
    Person Evaluation       Tip
         2          g       1.8
         3          g       1.8
         6          g       2.8
         7         vg       2.8
         9         vg       1.8
        11         vg       1.8
        12          g       1.8
        13         vg       1.8
        14          g       1.8
        15          g       0.8
        16          g       2.8
        20         vg       1.8
        21          m       2.8
        22         vg       1.8
        23          g       2.8
        24          m       1.8

Of course, we need to extract the table itself from the list Table in this case.

>MT=Table[1];

Of course, we can also use it to determine the mean value of a column or any other statistical value.

>mean(tablecol(MT,6))
2.175

The getstatistics() function returns the elements in a vector, and their counts. We apply it to the "m" and "f" values in the second column of our table.

>{xu,count}=getstatistics(tablecol(MT,2)); xu, count,
[1,  3]
[12,  13]

We can print the result in a new table.

>writetable(count',labr=tok[xu])
         m        12
         f        13

The function selecttable() returns a new table with the values in one column selected from a vector of indices. First we look up the indices of two of our values in the token table.

>v:=indexof(tok,["g","vg"])
[5,  6]

Now we can select the rows of the table, which have any of the values in v in their 5-th row.

>MT1:=MT[selectrows(MT,5,v)]; i:=sortedrows(MT1,5);

Now we can print the table, with extracted and sorted values in the 5-th column.

>writetable(MT1[i],labc=hd,ctok=ctok,tok=tok,wc=7);
 Person    Sex    Age Titanic Evaluation    Tip Problem
      2      f     23       y          g    1.8       n
      3      f     26       y          g    1.8       y
      6      m     28       y          g    2.8       y
     18      m     38       y          g      .       n
     16      m     26       y          g    2.8       n
     15      f     31       y          g    0.8       n
     12      m     32       y          g    1.8       n
     23      f     38       y          g    2.8       n
     14      f     25       y          g    1.8       y
      9      f     24       y         vg    1.8       y
      7      f     31       y         vg    2.8       n
     20      f     28       y         vg    1.8       n
     22      f     28       y         vg    1.8       y
     13      m     29       y         vg    1.8       y
     11      f     23       y         vg    1.8       y

For the next statistic, we want to relate two columns of the table. So we extract column 2 and 4 and sort the table.

>i=sortedrows(MT,[2,4]);  ...
   writetable(tablecol(MT[i],[2,4])',ctok=[1,2],tok=tok)
         m         n
         m         n
         m         n
         m         n
         m         n
         m         n
         m         n
         m         y
         m         y
         m         y
         m         y
         m         y
         f         n
         f         y
         f         y
         f         y
         f         y
         f         y
         f         y
         f         y
         f         y
         f         y
         f         y
         f         y
         f         y

With getstatistics(), we can also relate the counts in two columns of the table to each other.

>MT24=tablecol(MT,[2,4]); ...
 {xu1,xu2,count}=getstatistics(MT24[1],MT24[2]); ...
 writetable(count,labr=tok[xu1],labc=tok[xu2])
                   n         y
         m         7         5
         f         1        12

A table can be written to a file.

>filename=eulerhome()+"test.dat"; ...
 writetable(count,labr=tok[xu1],labc=tok[xu2],file=filename);

Then we can read the table from the file.

>{MT2,hd,tok2,hdr}=readtable(filename,>clabs,>rlabs); ...
 writetable(MT2,labr=hdr,labc=hd)
                   n         y
         m         7         5
         f         1        12

And delete the file.

>fileremove(filename);

Distributions

With plot2d, there is a very easy method to plot a distribution of experimental data.

>p=normal(1,1000); plot2d(p,distribution=20,style="\/"); ...
 plot2d("qnormal(x,0,1)",add=1):

12 - Statistics

Here is a comparison of 10 simulations of 1000 normal distributed values using a so-called box plot. This plot shows the median, the 25% and 75% quartiles, the minimal and maximal values, and the outliers.

>p=normal(10,1000); boxplot(p):

12 - Statistics

To generate random integers, Euler has intrandom. Let us simulate dice throws and plot the distribution.

We use the getmultiplicities(v,x) function, which counts how often the elements of v appear in x. Then we plot the the result using columnsplot().

>k=intrandom(1,6000,6);  ...
 columnsplot(getmultiplicities(1:6,k));  ...
 ygrid(1000,color=red):

12 - Statistics

While intrandom(n,m,k) returns uniformly distributed integers from 1 to k, it is possible to use any other given distribution of integers with randpint().

In the following example, the probabilities for 1,2,3 are 0.4,0.1,0.5 respectively.

>randpint(1,1000,[0.4,0.1,0.5]); getmultiplicities(1:3,%)
[378,  102,  520]

Euler can produce random values from more distributions. Have a look into the reference.

E.g., we try the exponential distribution.

>plot2d(randexponential(1,1000,2),>distribution):

12 - Statistics

For many distributions, Euler can compute the distribution function and the inverse.

>plot2d("normaldis",-4,4):

12 - Statistics

The following is one way to plot a quantile.

>plot2d("qnormal(x,1,1.5)",-4,6);  ...
 plot2d("qnormal(x,1,1.5)",a=2,b=5,>add,>filled):

12 - Statistics

The probability to be in the green area is the following.

>normaldis(5,1,1.5)-normaldis(2,1,1.5)
0.248662156979

This can be computed numerically with the following integral.

>gauss("qnormal(x,1,1.5)",2,5)
0.248662156979

Let us compare the binomial distribution with the normal distribution of same mean and deviation. The function invbindis() solves a linear interpolation between integer values.

>invbindis(0.95,1000,0.5), invnormaldis(0.95,500,0.5*sqrt(1000))
525.516721219
526.007419394

The function qdis() is the density of the chi-square distribution. As usual, Euler maps vectors to this function. Thus we get a plot of all chi-square distributions with degrees 5 to 30 easily in the following way.

>plot2d("qchidis(x,(5:5:50)')",0,50):

12 - Statistics

Euler has accurate functions to evaluate distributions. Let us check chidis() with an integral.

The naming tries to be consistent. E.g.,

The complements of the distribution (upper tail) is chicdis().

>chidis(1.5,2), integrate("qchidis(x,2)",0,1.5)
0.527633447259
0.527633447259

Discrete Distributions

To define your own discrete distribution, you can use the following method.

First we set the distribution function.

>wd = 0|((1:6)+[-0.01,0.01,0,0,0,0])/6
[0,  0.165,  0.335,  0.5,  0.666667,  0.833333,  1]

The meaning is that with probability wd[i+1]-wd[i] we produce the random value i.

This is almost a uniform distribution. Let us define a random number generator for this. The find(v,x) function finds the value x in the vector v. It works for vectors x too.

>function wrongdice (n,m) := find(wd,random(n,m))

The error is so subtle that we see it only with very many iterations.

>columnsplot(getmultiplicities(1:6,wrongdice(1,1000000))):

12 - Statistics

Here is a simple function to check for uniform distribution of the values 1...K in v. We accept the result, if for all frequencies

12 - Statistics

>function checkrandom (v, delta=1) ...
   K=max(v); n=cols(v);
   fr=getfrequencies(v,1:K);
   return max(fr/n-1/K)<delta/sqrt(n);
   endfunction

Indeed the function rejects the uniform distribution.

>checkrandom(wrongdice(1,1000000))
0

And it accepts the built-in random generator.

>checkrandom(intrandom(1,1000000,6))
1

We can compute the binomial distribution. First there is binomialsum(), which returns the probability of i or less hits out of n trials.

>bindis(410,1000,0.4)
0.751401349654

The inverse Beta function is used to compute a Clopper-Pearson confidence interval for the parameter p. The default level is alpha.

The meaning of this interval is that if p is outside the interval, the observed result of 410 in 1000 is rare.

>clopperpearson(410,1000)
[0.37932,  0.441212]

The following commands are the direct way to get the above result. But for large n, the direct summation is not accurate and slow.

>p=0.4; i=0:410; n=1000; sum(bin(n,i)*p^i*(1-p)^(n-i))
0.751401349655

By the way, invbinsum() computes the inverse of binomialsum().

>invbindis(0.75,1000,0.4)
409.932733047

In Bridge, we assume 5 outstanding cards (out of 52) in two hands (26 cards). Let us compute the probability of a distribution worse than 3:2 (e.g. 0:5, 1:4, 4:1 or 5:0).

>2*hypergeomsum(1,5,13,26)
0.321739130435

There is also a simulation of multinomial distributions.

>randmultinomial(10,1000,[0.4,0.1,0.5])
          381           100           519 
          376            91           533 
          417            80           503 
          440            94           466 
          406           112           482 
          408            94           498 
          395           107           498 
          399            96           505 
          428            87           485 
          400            99           501 

Plotting Data

To plot data, we try the results of the German elections since 1990, measured in seats.

>BW := [ ...
 1990,662,319,239,79,8,17; ...
 1994,672,294,252,47,49,30; ...
 1998,669,245,298,43,47,36; ...
 2002,603,248,251,47,55,2; ...
 2005,614,226,222,61,51,54; ...
 2009,622,239,146,93,68,76; ...
 2013,631,311,193,0,63,64];

For the parties, we use a string of names.

>P:=["CDU/CSU","SPD","FDP","Gr","Li"];

Let us print the percentages nicely.

First we extract the necessary columns. Columns 3 to 7 are the seats of each party, and column 2 is the total number of seats. column is the year of the election.

>BT:=BW[,3:7]; BT:=BT/sum(BT); YT:=BW[,1]';

Then we print the statistics in table form. We use the names as column headers, and the years as headers for the rows. The default width for the columns is wc=10, but we prefer a denser output. The columns will be expanded for the labels of the columns, if necessary.

>writetable(BT*100,wc=6,dc=0,>fixed,labc=P,labr=YT)
       CDU/CSU   SPD   FDP    Gr    Li
  1990      48    36    12     1     3
  1994      44    38     7     7     4
  1998      37    45     6     7     5
  2002      41    42     8     9     0
  2005      37    36    10     8     9
  2009      38    23    15    11    12
  2013      49    31     0    10    10

The following matrix multiplication extracts the sum of the percentages of the two big parties showing that the small parties have gained footage in the parliament until 2009.

>BT1:=(BT.[1;1;0;0;0])'*100
[84.29,  81.25,  81.1659,  82.7529,  72.9642,  61.8971,  79.8732]

There is also a simple statistical plot. We use it to display lines and points simultaneously. The alternative is to call plot2d twice with >add.

>statplot(YT,BT1,"b"):

12 - Statistics

Define some colors for each party.

>CP:=[rgb(0.5,0.5,0.5),red,yellow,green,rgb(0.8,0,0)];

Now we can plot the results of the 2009 election and the changes into one plot using figure. We can add a vector of columns to each plot.

>figure(2,1);  ...
 figure(1); columnsplot(BW[6,3:7],P,color=CP); ...
 figure(2); columnsplot(BW[6,3:7]-BW[5,3:7],P,color=CP);  ...
 figure(0):

12 - Statistics

Data plots combine rows of statistical data in one plot.

>J:=BW[,1]'; DP:=BW[,3:7]'; ...
 dataplot(YT,BT',color=CP);  ...
 labelbox(P,colors=CP,styles="[]",>points,w=0.2,x=0.3,y=0.4):

12 - Statistics

A 3D columns plot shows rows of statistical data in form of columns. We provide labels for the rows and the columns. angle is the viewing angle.

>columnsplot3d(BT,scols=P,srows=YT, ...
   angle=30°,ccols=CP):

12 - Statistics

Another representation is the mosaic plot. Note that the columns of the plot represent the columns of the matrix here. Because of the length of the label CDU/CSU, we take a smaller window than usual.

>shrinkwindow(>smaller);  ...
 mosaicplot(BT',srows=YT,scols=P,color=CP,style="#"); ...
 shrinkwindow():

12 - Statistics

We can also do a pie chart. Since black and yellow form a coalition, we reorder the elements.

>i=[1,3,5,4,2]; piechart(BW[6,3:7][i],color=CP[i],lab=P[i]):

12 - Statistics

Here is another kind of plot.

>starplot(normal(1,10)+4,lab=1:10,>rays):

12 - Statistics

Some plots in plot2d are good for statics. Here is an impulse plot of random data, uniformly distributed in [0,1].

>plot2d(makeimpulse(1:10,random(1,10)),>bar):

12 - Statistics

But for exponentially distributed data, we may need a logarithmic plot.

>logimpulseplot(1:10,-log(random(1,10))*10):

12 - Statistics

The function columnsplot() is easier to use, since it needs just a vector of values. Moreover, it can set its labels to anything we want, we demonstrated this already in this tutorial.

Here is another application, where we count characters in a sentence and plot a statistics.

>v=strtochar("the quick brown fox jumps over the lazy dog"); ...
 w=ascii("a"):ascii("z"); x=getmultiplicities(w,v); ...
 cw=[]; for k=w; cw=cw|char(k); end; ...
 columnsplot(x,lab=cw,width=0.05):

12 - Statistics

It is also possible to manually set axes.

>n=10; p=0.4; i=0:n; x=bin(n,i)*p^i*(1-p)^(n-i); ...
 columnsplot(x,lab=i,width=0.05,<frame,<grid); ...
 yaxis(0,0:0.1:1,style="->",>left); xaxis(0,style="."); ...
 label("p",0,0.25), label("i",11,0); ...
 textbox(["Binomial distribution","with p=0.4"]):

12 - Statistics

The following is a way to plot the frequencies of numbers in a vector.

We create a vector of integer random numbers 1 to 6.

>v:=intrandom(1,10,10)
[8,  5,  8,  8,  6,  8,  8,  3,  5,  5]

Then extract the unique numbers in v.

>vu:=unique(v)
[3,  5,  6,  8]

And plot the frequencies in a columns plot.

>columnsplot(getmultiplicities(vu,v),lab=vu,style="/"):

12 - Statistics

We want to demonstrate functions for the empirical distribution of values.

>x=normal(1,20);

The function empdist(x,vs) needs a sorted array of values. So we have to sort x before we can use it.

>xs=sort(x);

Then we plot the empirical distribution and some density bars into one plot. Instead of a bar plot for the distribution we use a sawtooth plot this time.

>figure(2,1); ...
 figure(1); plot2d("empdist",-4,4;xs); ...
 figure(2); plot2d(histo(x,v=-4:0.2:4,<bar));  ...
 figure(0):

12 - Statistics

A scatter plot is easy to do in Euler with the usual point plot. The following graph shows that the X and X+Y are clearly positively correlated.

>x=normal(1,100); plot2d(x,x+rotright(x),>points,style=".."):

12 - Statistics

Often, we wish to compare two samples of different distributions. This can be done with a quantile-quantile-plot.

For a test, we try the student-t distribution and exponential distribution.

>x=randt(1,1000,5); y=randnormal(1,1000,mean(x),dev(x)); ...
 plot2d("x",r=6,style="--",yl="normal",xl="student-t",>vertical); ...
 plot2d(sort(x),sort(y),>points,color=red,style="x",>add):

12 - Statistics

The plot clearly shows that the normal distributed values tend to be smaller at the extreme ends.

If we have two distributions of different size, we can expand the smaller one or shrink the larger one. The following function is good for both. It takes the median values with percentages between 0 and 1.

>function medianexpand (x,n) := median(x,p=linspace(0,1,n-1));

Let us compare two equal distributions.

>x=random(1000); y=random(400); ...
 plot2d("x",0,1,style="--"); ...
 plot2d(sort(medianexpand(x,400)),sort(y),>points,color=red,style="x",>add):

12 - Statistics

Regression and Correlation

For another example we read a survey of students, their ages, the ages of their parents and the number of siblings from a file.

This table contains "m" and "f" in the second column. We use the variable tok2 to set the proper translations instead of letting readtable() collect the translations.

>{MS,hd}:=readtable("table1.dat",tok2:=["m","f"]);  ...
 writetable(MS,labc=hd,tok2:=["m","f"]);
    Person       Sex       Age    Mother    Father  Siblings
         1         m        29        58        61         1
         2         f        26        53        54         2
         3         m        24        49        55         1
         4         f        25        56        63         3
         5         f        25        49        53         0
         6         f        23        55        55         2
         7         m        23        48        54         2
         8         m        27        56        58         1
         9         m        25        57        59         1
        10         m        24        50        54         1
        11         f        26        61        65         1
        12         m        24        50        52         1
        13         m        29        54        56         1
        14         m        28        48        51         2
        15         f        23        52        52         1
        16         m        24        45        57         1
        17         f        24        59        63         0
        18         f        23        52        55         1
        19         m        24        54        61         2
        20         f        23        54        55         1

How do the ages depend on each other? A first impression comes from a pairwise scatterplot.

>scatterplots(tablecol(MS,3:5),hd[3:5]):

12 - Statistics

It is clear that the age of the father and mother depend on each other. Let us determine and plot the regression line.

>cs:=MS[,4:5]'; ps:=polyfit(cs[1],cs[2],1)
[17.3789,  0.740964]

This is obviously the wrong model. The regression line would be s=17+0.74t, where t is the age of the mother and s the age of the father. The age difference may depend a little bit on the age, but not that much.

Rather, we suspect a function like s=a+t. Then a is the mean of the s-t. It is the average age difference between fathers and mothers.

>da:=mean(cs[2]-cs[1])
3.65

Let us plot this into one scatter plot.

>plot2d(cs[1],cs[2],>points);  ...
 plot2d("evalpoly(x,ps)",color=red,style=".",>add);  ...
 plot2d("x+da",color=blue,>add):

12 - Statistics

Here is a box plot of the two ages. This only shows, that the ages are different.

>boxplot(cs,["mothers","fathers"]):

12 - Statistics

It is interesting that the difference in medians is not as large as the difference in means.

>median(cs[2])-median(cs[1])
1.5

The correlation coefficient suggests a positive correlation.

>correl(cs[1],cs[2])
0.7588307236

The correlation of the ranks is a measure for the same order in both vectors. It is also quite positive.

>rankcorrel(cs[1],cs[2])
0.758925292358

Creating new Functions

Of course, the EMT language can be used to program new functions. E.g., we define the skewness function.

12 - Statistics

where m is the mean of x.

>function skew (x:vector) ...
 m=mean(x);
 return sqrt(cols(x))*sum((x-m)^3)/(sum((x-m)^2))^(3/2);
 endfunction

As you see, we can easily use the matrix language to get a very short and efficient implementation. Let us try this function.

>data=normal(20); skew(normal(10))
-0.198710316203

Here is another function, called the Pearson skewness coefficient.

>function skew1 (x) := 3*(mean(x)-median(x))/dev(x)
>skew1(data)
-0.0801873249135

Monte Carlo Simulation

Euler can be used to simulate random events. We have already seen simple examples above. Here is another one, which simulates 1000 times 3 dice throws, and asks for the distribution of the sums.

>ds:=sum(intrandom(1000,3,6))';  fs=getmultiplicities(3:18,ds)
[5,  17,  35,  44,  75,  97,  114,  116,  143,  116,  104,  53,  40,
22,  13,  6]

We can plot this now.

>columnsplot(fs,lab=3:18):

12 - Statistics

To determine the expected distribution is not so easy. We use an advanced recursion for this.

The following function counts the number of ways the number k can be represented as the sum of n numbers in the range of 1 to m. It works recursively in an obvious way.

>function map countways (k; n, m) ...
   if n==1 then return k>=1 && k<=m
   else
     sum=0; 
     loop 1 to m; sum=sum+countways(k-#,n-1,m); end;
     return sum;
   end;
 endfunction

Here is the result for three throws of dices.

>cw=countways(3:18,3,6)
[1,  3,  6,  10,  15,  21,  25,  27,  27,  25,  21,  15,  10,  6,  3,
1]

We add the expected values to the plot.

>plot2d(cw/6^3*1000,>add); plot2d(cw/6^3*1000,>points,>add):

12 - Statistics

For another simulation, the deviation of the mean value of n 0-1-normal distributed random variables is 1/sqrt(n).

>longformat; 1/sqrt(10)
0.316227766017

Let us check this with a simulation. We produce 10000 times 10 random vectors.

>M=normal(10000,10); dev(mean(M)')
0.319493614817
>plot2d(mean(M)',>distribution):

12 - Statistics

The median of 10 0-1-normal distributed random numbers has a larger deviation.

>dev(median(M)')
0.374460271535

Since we can easily generate random walks, we can simulate the Wiener process. We take 1000 steps of 1000 processes. We then plot the standard deviation and the mean of the n-th step of these processes together with the expected values in red.

>n=1000; m=1000; M=cumsum(normal(n,m)/sqrt(m)); ...
 t=(1:n)/n; figure(2,1); ...
 figure(1); plot2d(t,mean(M')'); plot2d(t,0,color=red,>add); ...
 figure(2); plot2d(t,dev(M')'); plot2d(t,sqrt(t),color=red,>add); ...
 figure(0):

12 - Statistics

Tests

Tests are an important tool in statistics. In Euler, many tests are implemented. All of these tests returns the error that we accept if we reject the zero hypothesis.

For an example, we test dice throws for uniform distribution. At 600 throws, we got the following values, which we plug into the chi-square test.

>chitest([90,103,114,101,103,89],dup(100,6)')
0.498830517952

The chi-square test also has a mode, which uses a Monte Carlo simulation to test the statistics. The result should be almost the same. The parameter >p interprets the y-vector as a vector of probabilities.

>chitest([90,103,114,101,103,89],dup(1/6,6)',>p,>montecarlo)
0.526

This error is much too large. So we cannot reject uniform distribution. This does not prove that our dice was fair. But we cannot reject our hypothesis.

Next we generate 1000 dice throws using the random number generator, and do the same test.

>n=1000; t=random([1,n*6]); chitest(count(t*6,6),dup(n,6)')
0.528028118442

Let us test for the mean value 100 with the t-test.

>s=200+normal([1,100])*10; ...
 ttest(mean(s),dev(s),100,200)
0.0218365848476

The function ttest() needs the mean value, the deviation, the number of data, and the mean value to test for.

Now let us check two measurements for the same mean. We reject the hypothesis that they have the same mean, if the result is <0.05.

>tcomparedata(normal(1,10),normal(1,10))
0.38722000942

If we add a bias to one distribution, we get more rejections. Repeat this simulation several times to see the effect.

>tcomparedata(normal(1,10),normal(1,10)+2)
5.60009101758e-07

In the next example, we generate 20 random dice throws 100 times and count the ones in it. There must be 20/6=3.3 ones on average.

>R=random(100,20); R=sum(R*6<=1)'; mean(R)
3.28

We now compare the number of ones with the binomial distribution. First we plot the distribution of ones.

>plot2d(R,distribution=max(R)+1,even=1,style="\/"):

12 - Statistics

>t=count(R,21);

Then we compute the expected values.

>n=0:20; b=bin(20,n)*(1/6)^n*(5/6)^(20-n)*100;

We have to collect several numbers to get categories, which are big enough.

>t1=sum(t[1:2])|t[3:7]|sum(t[8:21]); ...
 b1=sum(b[1:2])|b[3:7]|sum(b[8:21]);

The chi-square test rejects the hypothesis that our distribution is a binomial distribution, if its result is <0.05.

>chitest(t1,b1)
0.53921579764

The following example contains results of two groups of persons (male and female, say) voting for one out of six parties.

>A=[23,37,43,52,64,74;27,39,41,49,63,76];  ...
   writetable(A,wc=6,labr=["m","f"],labc=1:6)
           1     2     3     4     5     6
     m    23    37    43    52    64    74
     f    27    39    41    49    63    76

We wish to test for independence of the votes from the sex. The chi^2 table test does this. The result is way to large to reject independence. So we cannot say, if the voting depends on the sex from these data.

>tabletest(A)
0.990701632326

The following is the expected table, if we assume the observed frequencies of voting.

>writetable(expectedtable(A),wc=6,dc=1,labr=["m","f"],labc=1:6)
           1     2     3     4     5     6
     m  24.9  37.9  41.9  50.3  63.3  74.7
     f  25.1  38.1  42.1  50.7  63.7  75.3

We can compute the corrected contingency coefficient. Since it very close to 0, we conclude that the voting does not depend on the sex.

>contingency(A)
0.0427225484717

Writing and Reading Raw Data

Let us demonstrate how to read and write a vector of reals to a file.

>a=random(1,100); mean(a), dev(a),
0.521542031668
0.279035800623

To write the data to a file, we use the function writematrix().

Since this introduction is most likely in a directory, where the user has no write access, we write the data to the user home directory. For own notebooks, this is not necessary, since the data file will be written into the same directory.

>filename=userhome()|"test.dat";

Now we write the column vector a' to the file. This yields one number in each line of the file.

>writematrix(a',filename);

To read the data, we use readmatrix().

>a=readmatrix(filename)';

And remove the file.

>fileremove(filename);
>mean(a), dev(a),
0.521542031668
0.279035800623

The functions writematrix() or writetable() can be configured for other languages.

E.g., if you have a German system, your Excel needs values with decimal commas separated by semicolons in a csv (comma separated values) file. The following file "test.csv" should appear on your desktop.

>filename=userhome()|"Desktop\test.csv"; ...
 writematrix(random(5,3),file=filename,>comma,separator=";");

You can now open this file with a German Excel directly.

>fileremove(filename);

Sometimes we have strings with tokens like the following.

>s1:="f m m f m m m f f f m m f";  ...
 s2:="f f f m m f f";

To tokenize this, we define a vector of tokens.

>tok:=["f","m"]
f
m

Then we can count the number of times each token appears in the string, and put the result into a table.

>M:=getmultiplicities(tok,strtokens(s1))_ ...
   getmultiplicities(tok,strtokens(s2));

Write the table with the token headers.

>writetable(M,labc=tok,labr=1:2,wc=8)
               f       m
       1       6       7
       2       5       2

Some More Tests

Next we use a variance analysis (F-test) to test three samples of normally distributed data for same mean value. The method is called ANOVA (analysis of variance). In Euler, the function varanalysis() is used.

>x1=[109,111,98,119,91,118,109,99,115,109,94]; mean(x1),
106.545454545
>x2=[120,124,115,139,114,110,113,120,117]; mean(x2),
119.111111111
>x3=[120,112,115,110,105,134,105,130,121,111]; mean(x3)
116.3
>varanalysis(x1,x2,x3)
0.0138048221371

This means, we reject the hypothesis of same mean value. We do this with an error probability of 1.3%.

There is also the median test, which rejects data samples with different mean distribution testing the median of the united sample.

>a=[56,66,68,49,61,53,45,58,54];
>b=[72,81,51,73,69,78,59,67,65,71,68,71];
>mediantest(a,b)
0.0241724220052

Another test on equality is the rank test. It is much sharper than the median test.

>ranktest(a,b)
0.00199969612469

In the following example, both distributions have the same mean.

>ranktest(random(1,100),random(1,50)*3-1)
0.127932482268

Let us now try to simulate two treatments a and b applied to different persons.

>a=[8.0,7.4,5.9,9.4,8.6,8.2,7.6,8.1,6.2,8.9];
>b=[6.8,7.1,6.8,8.3,7.9,7.2,7.4,6.8,6.8,8.1];

The signum test decides, if a is better than b.

>signtest(a,b)
0.0546875

This is too much of an error. We cannot reject that a is as good as b.

The Wilcoxon test is sharper than this test, but relies on the quantitative value of the differences.

>wilcoxon(a,b)
0.0296680599405

Let us try two more tests using generated series.

>wilcoxon(normal(1,20),normal(1,20)-1)
0.0261110211933
>wilcoxon(normal(1,20),normal(1,20))
0.52975950705

Linear Regression

Linear regression can be done with the functions polyfit() or various fit functions.

For a start we find the regression line for univariate data with polyfit(x,y,1).

>x=1:10; y=[2,3,1,5,6,3,7,8,9,8]; writetable(x'|y',labc=["x","y"])
         x         y
         1         2
         2         3
         3         1
         4         5
         5         6
         6         3
         7         7
         8         8
         9         9
        10         8

We want to compare non-weighted and weighted fits. First the coefficients of the linear fit.

>p=polyfit(x,y,1)
[0.733333333333,  0.812121212121]

Now the coefficients with a weight that emphasizes the last values.

>w&="exp(-(x-10)^2/10)"; pw=polyfit(x,y,1,w=w(x))
[4.71566356925,  0.383190230089]

We put everything into one plot for the points and the regression lines, and for the weights used.

>figure(2,1);  ...
 figure(1); statplot(x,y,"b",xl="Regression"); ...
   plot2d("evalpoly(x,p)",>add,color=blue,style="--"); ...
   plot2d("evalpoly(x,pw)",5,10,>add,color=red,style="--"); ...
 figure(2); plot2d(w,1,10,>filled,style="/",fillcolor=red,xl=w); ...
 figure(0):

12 - Statistics

Random Numbers

The following is a test for the random number generator. Euler uses a very good generator, so we need not expect any problems.

First we generate ten millions of random numbers in [0,1].

>n:=10000000; r:=random(1,n);

Next we count the distances between two numbers less than 0.05.

>a:=0.05; d:=differences(nonzeros(r<a));

Finally, we plot the number of times, each distance occurred, and compare with the expected value.

>m=getmultiplicities(1:100,d); plot2d(m); ...
   plot2d("n*(1-a)^(x-1)*a^2",color=red,>add):

12 - Statistics

Clear the data.

>remvalue n;

Further Examples

There are some more examples in the examples directory of Euler.

Examples/Black Swan Distribution
Examples/US Population Forecast
Examples/Sunspot Data
Examples/Econometrics
Examples/Benfords Law

Euler Home