Re: Student's t Test

JohnNichols · ‎01-10-2021

Is the Student's t Test included in mkl?

Gennady_F_Intel · ‎01-10-2021

what do you mean by a Student Test?

mecej4 · ‎01-10-2021

It would be useful to have a function in MKL to obtain the inverse of the cumulative Student t-distribution function:

t = tinv(p, n_f)

where p is the probability and n_f is the number of degrees of freedom. This function is available in Matlab, R (tinv), IMSL (tin), etc. It is useful in obtaining the p-percent confidence intervals of the parameter values obtained from fitting an expression to data using nonlinear least squares (MKL TRNLSP).

Gennady_F_Intel · ‎01-10-2021

Thanks for the clarification, mecej. I see what John wanted to see. We will try to add this topic to further plans and will keep this thread informed.

Pavel_D_Intel1 · ‎01-11-2021

Hi John,

in oneMKL we have Random Number Generators component where we provide distribution generators (routines to generate random numbers sequences with the specific statistical properties).

Could you please clarify – are you interested in tinv(p, n_f) like function only? Or also in Student t-distribution generator?

Best regards,
Pavel

JohnNichols · ‎01-11-2021

Pavel:

The most common statistical test in experimental work after the mean and the standard deviation is the Student's t Test. It is a very simple test to use - essentially it takes 2 sets of numbers and tells you if they are different and how different. Guinness Brewery invented for quality control.

An example, the average age of a kindergarten class is 5. And the average age of the people who use this forum would be 30+. The two are distinct. Student's t Test allows you to quantify it. It has a fairly simple number >2 different <2 same, is a good approximation.

For large data sets it is the best test. The reason we need it in Fortran is where there are more than 1.2 million data points, EXCEL will not do an analysis. At the moment I am looking at sets with 6.5 million records, only Fortran or C will handle this quickly.

In searching for MKL - I was surprised how many of the many standard programs like MAPLE, NASTRAN, AutoCAD use mkl.

JohnNichols · ‎01-11-2021

A quick and dirty attempt at the Student's t Test for million element sets, where I pick out the first peak between indices of the count array from 89 to 101 provides a set test data to determine if the natural frequency changes with time. It appears to - each million elements is 100 days and each million elements contains about 4% density in the required range.

  subroutine mean(i,j,countFZ, meanA, stdevA,countFZA,meanFZA, standevFZA, flag)
    implicit none

    integer i, j, k, l
    integer countFZ(kl)
    integer countFZA(kl, mn_9)
    REAL (KIND=dp) meanFZA(mn_9),standevFZA(mn_9)
    integer flag
    REAL (KIND=dp) counter, count, counterV, countold,t, sp,sp1
    REAL (KIND=dp) meanA,  variance, stdevA, meanold, varianceold

    counter = 0.0
    count = 0.0
    counterV = 0.0
    meanA = 0.0
    variance = 0.0

    do 100 k = i,j

        counter = counter + real(countFZ(k)) * real(k)
        count = count + real(countFZ(k))
        ! write(*,1000)k, countFZ(k), counter, count
1000    Format(i7,3('  ',f12.0) )

100 end do


    meanA = real(counter)/real(count)
    count = 0.0
    do 101 k = i,j


        counterV = counterV + (real(countFZ(k))) * ((real(k) - meanA) ** 2.0)
        count = count + real(countFZ(k))

        ! write(*,1001)k, countFZ(k), counter, counterV, count
1001    Format(i7,4('  ',f12.0) )

101 end do

    variance = real(counterV)/real(count)

    write(*,500)0, meanA, variance, sqrt(variance), count
    meanold = meanA
    varianceold = variance
    countold = count

500 Format(" Data set  :: ", i4,3('  ',f10.3), ' Count :: ',f7.0, ' Sp : ', f10.3, ' sp1 : ',f10.3,' t-Stat : ',f10.3)
    stdevA = sqrt(variance)

    write(*,120)flag
120 Format("       Flag for the data size :: ", i3)

    do 300 l = 1, flag
        counter = 0
        count = 0
        do 200 k = i,j

            counter = counter + real(countFZA(k,l)) * real(k)
            count = count + real(countFZA(k,l))
            !  write(*,1000)k, countFZA(k,l), counter, count

200     end do
        meanFZA(l) = counter/count

        count = 0.0
        counterV = 0.0
        do 400 k = i,j


            counterV = counterV + (real(countFZA(k,l))) * ((real(k) - meanFZA(l)) ** 2.0)
            count = count + real(countFZA(k,l))

            !   write(*,1001)k, countFZA(k,l), counter, counterV, count

400     end do

        variance = counterV/count
        standevFZA(l) = sqrt(variance)
        sp = sqrt((((countold-1.0)*varianceold)+((count-1.0)*variance))/(countold+count-2))
        sp1 = sp*sqrt((1.0/countold)+(1.0/count))
        t = (meanFZA(l) - meanold)/sp1

        write(*,500)l,meanFZA(l),variance,sqrt(variance),count,sp,sp1,t
        meanold = meanFZA(l)
        varianceold = variance
        countold = count

300 end do



    return
    end subroutine

I took the standard algorithm from wikipedia -- not equal means or variances.

A t-Stat of 31 to 80 represents an almost zero chance they are the same - mean is 95 column and the variance is the 1.15 column. The counts make is so accurate.

Got to check it and put in error checking.

There is a temperature dependence on the stiffness, but it is impossible to discern at 100 day blocks with any accuracy.

Pavel_D_Intel1 · ‎01-12-2021

John,

thanks a lot for the detailed description.

In oneMKL we haven’t the specific routine for Student’s t-test.
But oneMKL contains Summary Statistics component which can be useful here.

oneMKL Summary Statistics provides routines that compute basic statistical estimates for single and double precision multi-dimensional datasets.

As I understand mean and variance computation routines (supported in oneMKL Summary Statistics) could help to reduce the code and increase performance of the Student’s t-test application.

Fortran APIs description can be found here: https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-fortran/top/statistical-functions/summary-statistics.html

Please let me know if I could help you more here.

Best regards,
Pavel

JohnNichols · ‎01-12-2021

Pavel:

Thank you for the explanation.

The data is in a format where is is simpler to take the first moment and then the second moment to get the variance.

The interesting issue is the amplitude of the t-Stat -- it is way beyond the published tables. So I will need to encode estimating the probability as well. A 1:1000 p value has a t-Stat of about 3.5 for this number of elements, but we are at 37. Interestingly the same problem arises in EXCEL with large data sets when you start looking for 5 and 6 standard deviations p estimates.

I am pretty sure you have the Box Mueller Random Number generator. There is a standard reference that some have used for discussing the Box Mueller Theory from the 1950s. When I was writing a paper, I thought I should get a copy of this original paper if I wanted to reference it - you are supposed to read what you reference. It came from a famous statistical group at a major Eastern University. I sent them a message and asked for a pdf of the paper as I could not find it online. I wanted to see the original math, so I could follow the Fortran.

They came back and said they had no idea, but they finally tracked down a copy and scanned it for me. Apparently I was the first person to ask for it in a long time. There is a brief version in a journal in 1958, but not the entire mathematics, if I remember correctly.

Everyone assumes that someone else has seen it so it is ok to reference.

The same thing happens with a famous book on Seismology in Japan. This book is quoted often, a friend once found there is one acknowledged copy of this book in a Scottish University and no others. There are no pdf's.

JohnNichols · ‎01-12-2021

List of Applied Statistics Algorithms ASA Algorithm 5 is the Student's test - with a non-central parameter.

It is under a Public License - https://people.sc.fsu.edu/~jburkardt/f_src/asa005/asa005.html

Professor Burkardt publishes a lot of interesting code example.

jimdempseyatthecove · ‎01-15-2021

John,

I am not sure that this has anything to do with your issue...

>>A quick and dirty attempt at the Student's t Test for million element sets

integer countFZ(kl)
...
REAL (KIND=dp) counter, count, counterV, countold,t, sp,sp1
...
counter = counter + real(countFZ(k)) * real(k)
...
meanA = real(counter)/real(count)
...
counterV = counterV + (real(countFZ(k))) * ((real(k) - meanA) ** 2.0)
count = count + real(countFZ(k))
...
counter = counter + real(countFZA(k,l)) * real(k)
count = count + real(countFZA(k,l))
...
counterV = counterV + (real(countFZA(k,l))) * ((real(k) - meanFZA(l)) ** 2.0)
count = count + real(countFZA(k,l))

The code is using real(dp) for variable types of the results (lhs) contained within loops where k iterates a million (millions?) of times.

the use of real(something) as opposed to real(something, kind=dp) may introduce undesirable round off errors.

Jim Dempsey

Gennady_F_Intel · ‎02-02-2021

This request will not be implemented. The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.