Fisher’s conditional exact test on 2×2 matrix
Filed under: Algorithms, Probability, Statistics
Fisher’s conditional exact test on 2×2 matrix
Fisher’s exact test is a statistical significance test used in the analysis of contingency tables where sample sizes are small. It is named after its inventor, R. A. Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. Fisher is said to have devised the test following a comment from Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup.
The test is useful for categorical data that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. So in Fisher’s original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Ms Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated – that is, whether Ms Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 x 2 contingency table. The p-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Ms. Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.
Consider a general 2×2 matrix
| Category 1 | Category 2 |
Row sum | |
| Condition 1 |
A | B | RS1 |
| Condition 2 |
C | D | RS2 |
| Column sum | CS1 | CS2 | total = N |
What does the Fisher’s algorithm plan?
- Enumerate all possible tables, given RS1 RS2 CS1 CS2;
- For each table compute p-value using hypergeometric distribution;
- Sum all p-values≤p-value of the observed table
Now we’ll analyze the algorithm step-by-step, observing the optimizations that can be implemented using MatLab.
Step 1: Enumerate all possible tables.
A 2×2 matrix has only one degree of freedom. What does this mean? This means that if you change the cell A all the other cells are determined: B=RS1-A; C=CS1-A; D=RS2-CS1+A.
Question: Which is the range of variation for A?
Answer: 0≤A≤min(RS1,CS1)
First optimization: rearrange the matrix (if needed) so as to obtain: RS1≤RS2 and CS1≤CS2
1 2 3 4 5 6 7 8 9 10 11 12 13 | Rs=sum(x,2); %rows sum Cs=sum(x); %columns sum N=sum(Rs); %sum of all elements %Rearrange the matrix if necessary. if ~issorted(Rs) x=flipud(x); Rs=sort(Rs); end if ~issorted(Cs) x=fliplr(x); Cs=sort(Cs); end |
Now you can enumerate all the possible tables:
1 2 | A=0:1:min(Rs(1),Cs(1)); %all possible values for the first cell z=[A;Rs(2)-Cs(1)+A;Rs(1)-A;Cs(1)-A;]; %all possible tables |
Note that in the z matrix the cells are disposed in this order [A D B C] (in the next step you will understand the reason…).
Step 2: For each table compute the p-value using the hypergeometric distribution
First of all, now we know how many tables we wrote down and so we can use the preallocation (see http://www.advancedmcode.org/writing-fast-matlab-code-3-preallocation.html).
1 | np=zeros(1,length(A)); %p-values vector preallocation |
In this step we must compute the p-value for each table using the hypergeometric distribution. In general, the p-value is computed as:
Considering that we are using a 2×2 matrix:
Remember that:
and x is a natural number.
As you can see, the factorial (!) growth very quickly. Moreover the computation of a factorial is very time expensive. Luckly, there is a very important relation between the factorial and another function: the Gamma function.
Matlab can compute the Gamma function using a matrix as input. Anyway, we didn’t yet solve the factorial growing (20!=2.4329e+18). The management of so big number is very troubling. The solution is to use the logarithm function: in fact, log(20!)=42.3356 a more tractable number.
Matlab, of course, can compute the gammaln function. Now, we are computing log(p-value), so when we’ll finish we must convert into p-value using exp function.
Remember two log properties (they will be useful later): LOG(A*B)=LOG(A)+LOG(B) and LOG(A/B)=LOG(A)-LOG(B).
Now, it is the time to observe another useful property of 2×2 matrix.
When you go from the Table Ti to the Table Ti+1:
Substituting you will find a recursive relation:
And, more in general:
At this point, the only problem is to compute the p-value of the first table. Consider the first Table:
| Category 1 | Category 2 |
Row sum | |
| Condition 1 |
0 | RS1 | RS1 |
| Condition 2 |
CS1 | D=CS2-RS1 | RS2 |
| Column sum | CS1 | CS2 | total = N |
And so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | np=zeros(1,length(A)); %p-values vector preallocation lz=log(z); %LOG(X!)=GAMMALN(x+1) np(1)=sum(gammaln([Rs(2)+1 Cs(2)+1])-gammaln([N+1 z(2)+1])); %remember that %np(i+1)=np(i)*[B(i)*C(i)]/(A(i+1)*D(i+1)]=np(i)*f(i) %This formula is vectorizable and is recursive! %using Logarithm to improve computation %log(np(i+1))=log(np(i))+[log(B(i))+log(C(i))]-[log(A(i+1))+log(D(i+1))]=log(np(i))+f(i) %J>1 %log(np(J))=log(np(1))+sum(f(1:J-1))=log(np(1))+cumsum(f) f=sum(lz(3:4,1:end-1))-sum(lz(1:2,2:end)); np(2:end)=np(1)+cumsum(f); np=exp(np); |
Step 3: Sum all p-values≤p-value of the observed table
Now, to obtain the 2-tail p-value we must sum all p-values≤p-value of the observed table. In the z matrix, the observed table will be in the A+1 row (because you must also consider the 0) and, of course, in the np vector the p-value of the observed table will be in A+1 row.
On the contrary, if you desire the 1-tail p-value you must sum all A+1 p-values.
The code:
1 2 3 4 5 6 | W=x(1)+1; if x(1)<=round(A(end)/2) %choose direction P=[sum(np(1:W)) sum(np(W:end)) sum(np(np<=np(W)))]; else P=[sum(np(W:end)) sum(np(1:W)) sum(np(np<=np(W)))]; end |
Finally, there is another “little” question. The Fisher’s exact test is a “conditional” method and is too conservative in comparison to an “unconditional” method like the Barnard’s test; in other words is less powerful than Barnard’s test. On the contrary, the Barnard’s test is so difficult to implement that none uses it (but me… see mi post on Barnard’s unconditional exact test).
The Lancaster’s correction (mid-p) to Fisher’s exact test makes it less conservative and more powerful and the results are similar to Barnard’s Test.
The Lancaster’s correction envisages that the p-values of all tables more extreme than the observed are summed to one half of the observed p-value. In code
1 | P-mid=0.5*np(W)+sum(np(np<np(W))); |
Don’t forget to display the results!
Step 4: Compute the power
The routine computes the Power and, if necessary, the sample sizes needed to achieve a power=0.80 using a modified asymptotic normal method with continuity correction as described by Hardeo Sahai and Anwer Khurshid in Statistics in Medicine, 1996, Vol. 15, Issue 1: 1-21. Because there aren’t interesting tricks, if you are interested you can directly read the code.
If any problems occurs in execution, or if you found a bug, have a suggestion or question just contact me at: giuseppe dot cardillo-edta at poste dot it
You can visit my homepage http://home.tele2.it/cardillo
My profile on XING http://www.xing.com/go/invita/13675097
My profile on LinkedIN http://it.linkedin.com/in/giuseppecardillo
Download Now
Popularity: 1% [?]

























































