ecoli Posted April 12, 2011 Posted April 12, 2011 Hi all, I'm writing this code in R and it should be relative straight forward, but I keep running into negative numbers, which MI should NOT have! Not sure if the code is buggy or if I'm getting the math wrong. Am I correct in thinking there should be 1 value for MI between each number in an associated list? First the pseudocode: 1 initialize parameters 2 Generate 2 lists of random numbers btw 0,1 called x and y 3 cut number list into bins 4 create frequency table for number of counts in each bin 5 create function to calculate marginal distribution of binned x for every x find the frequency of that value in x save as ki Do the same for y and call it kj 6 create function to calculate joint distribution of x and y a. generate frequency table for every x and y combo for every x and associated y find the frequency of that combination in the data call it kij 7 calculate mutual information (mi) mi = 0 for each value taken by x { sum = 0 for each value taken by y: { kij = find freq where x and y are associated ki = find freq of x kj = find freq of y sum = sum + kij * log2(kij/(ki*kj)) } mi = mi + sum } Here's the actual code in R [font="Courier New"]#calculate MI of random variables. Notation is taken from Steuer, et al. 2002 #Set sample size & bin number N <- 30 bins <- 2 #Generate N random "data" points for x and y in the space [0,1] x <- runif(N, 0, 1) y <- runif(N, 0, 1) #Cut x and y into bins xcut <- cut(x, bins) #ai ycut <- cut(y, bins) #bj #create frequency table for counts per bin xfreq <- data.frame(table(xcut)) #table for kis yfreq <- data.frame(table(ycut)) #table for kjs #script for reading freq table into kis/kjs #b/c cutting data into bins leads to weird character types ki.find <- function(n) { #### n <- xcut[i] for (k in 1:bins) { if (as.character(n) == xfreq$xcut[k]) { ki <- xfreq$Freq[k] return(ki) } } } kj.find <- function(m) { #### n <- ycut[i] for (l in 1:bins) { if (as.character(m) == yfreq$ycut[l]) { kj <- yfreq$Freq[l] return(kj) } } } #kij is the joint prob distribution, #AKA freqency that 2 bins are associated at a (x,y) point cut.table <- as.data.frame(table(xcut,ycut)) kij.find <- function(p) { for (g in 1:(2*bins)) { if (as.character(xcut[p]) == cut.table$xcut[g] && as.character(ycut[p]) == cut.table$ycut[g]) { kij <- cut.table$Freq[g] } } return(kij) } #start with assumption of no MI #then calculate MI MI <- 0 for (i in 1:N) { sum.temp <- 0 for (j in 1:N) { ki <- ki.find(xcut[j]) kj <- kj.find(ycut[j]) kij <- kij.find(j) sum.temp <- sum.temp + (kij * log2((kij/(ki*kj)))) sum.temp2 <- (sum.temp/N) } MI <- c(MI, log2(N) + sum.temp2) } [/font] sample output is: > MI [1] 82.288187 44.607130 36.576861 28.348042 20.119224 11.890405 [7] 3.661586 -4.567232 -12.796051 -21.024870 -29.253688 Please help me!!!
Xittenn Posted April 13, 2011 Posted April 13, 2011 I think if you adjust your last bit of code as follows: MI <- 0 MI.run <- function() { for (i in 1:N) { sum.temp <- 0 for (j in 1:N) { ki <- ki.find(xcut[j]) kj <- kj.find(ycut[j]) kij <- kij.find(j) sum.temp <- sum.temp + (kij * log2((kij/(ki*kj)))) sum.temp2 <- (sum.temp/N) print(ki) print(kj) print(kij) print(sum.temp) print(sum.temp2) } MI <- c(MI, log2(N) + sum.temp2) } } and maybe set N<-5 you will quickly see why kij * log2((kij/(ki*kj))) is giving you a negative number :/ 2
ecoli Posted April 13, 2011 Author Posted April 13, 2011 I think if you adjust your last bit of code as follows: and maybe set N<-5 you will quickly see why kij * log2((kij/(ki*kj))) is giving you a negative number :/ So it's because i'm taking a log of a fraction < 1, which I guess means i'm doing the calculation incorrectly.
Xittenn Posted April 13, 2011 Posted April 13, 2011 You're applying a variation of Mutual Information that involves discrete variables and you are using continuous. Your cut is being made with no break and as such your bins are not all formed the same so if you are looking for the members of two sets that intersect but you are using a top half bottom half index to decide membership there might be some inconsistency. I wasn't really sure what "#b/c cutting data into bins leads to weird character types" meant exactly. Do you have a link/reference to this "#calculate MI of random variables. Notation is taken from Steuer, et al. 2002"? 2
ecoli Posted April 13, 2011 Author Posted April 13, 2011 You're applying a variation of Mutual Information that involves discrete variables and you are using continuous. Your cut is being made with no break and as such your bins are not all formed the same so if you are looking for the members of two sets that intersect but you are using a top half bottom half index to decide membership there might be some inconsistency. I wasn't really sure what "#b/c cutting data into bins leads to weird character types" meant exactly. Do you have a link/reference to this "#calculate MI of random variables. Notation is taken from Steuer, et al. 2002"? http://bioinformatics.oxfordjournals.org/content/18/suppl_2/S231.full.pdf I'm not sure that it matters the bins are not formed the same. For example if your comparing two lists of different types the only thing that should matter is the probability distribution, not the exact numbers. In R, the cut function creates a list that looks something that replaces an exact number with a range, like (0.43,0.616], which is not numeric type. I can create a frequency table that counts how many times each range is present in the list. I have to do a character match to find the associated frequency. in the table, which I think is the marginal distribution (divided by N). The probability of the intersection is the frequency that both ranges appear at the same time in both lists. Unless I've misunderstood. Thanks a bunch for your help so far, btw. 1
Xittenn Posted April 14, 2011 Posted April 14, 2011 (edited) After reading the paper the idea has formed much clearer in my head. The entire purpose of binning, as you have been, is to discretize the continuous function(which I'm sure you already knew ) so that a discrete model may be applied. What I had said before becomes even more important in that you will only have discrete values if the bins form uniformly. In case you may have thought they were uniform, having directly observed these values myself they are in fact not(sometimes.) This requires a partitioning function as described in the three sections following 'NUMERICAL ESTIMATION' and this would be applied to the break value and correspondingly to the number of breaks produced. Thanks for the paper, I think I might play with this for a little while! break: either a vector of cut points or number giving the number of intervals which x is to be cut into. Edited April 14, 2011 by Xittenn 2
ecoli Posted April 14, 2011 Author Posted April 14, 2011 After reading the paper the idea has formed much clearer in my head. The entire purpose of binning, as you have been, is to discretize the continuous function(which I'm sure you already knew ) so that a discrete model may be applied. What I had said before becomes even more important in that you will only have discrete values if the bins form uniformly. In case you may have thought they were uniform, having directly observed these values myself they are in fact not(sometimes.) This requires a partitioning function as described in the three sections following 'NUMERICAL ESTIMATION' and this would be applied to the break value and correspondingly to the number of breaks produced. Thanks for the paper, I think I might play with this for a little while! break: either a vector of cut points or number giving the number of intervals which x is to be cut into. ah shit. I hope this is the major problem I've been having. I'm looking for an easy solution in R to cut a vector into equal bins. Histogram does it, but only to plot data. weird.
Xittenn Posted April 14, 2011 Posted April 14, 2011 [math] k_i \neq freq_x \; k_i = \frac{freq_x}{N} [/math] Correct me if I'm wrong but this is in fact the issue, if you hadn't already deduced it. I had already, yesterday, but I wasn't sure what you were doing.
ecoli Posted April 14, 2011 Author Posted April 14, 2011 [math] k_i \neq freq_x \; k_i = \frac{freq_x}{N} [/math] Correct me if I'm wrong but this is in fact the issue, if you hadn't already deduced it. I had already, yesterday, but I wasn't sure what you were doing. Then is equation 17 in the paper incorrect?
Xittenn Posted April 14, 2011 Posted April 14, 2011 (edited) or better yet [math] k_i = freq_x \land p(a_i) = \frac{k_i}{N} [/math] which you weren't doing! Edited April 14, 2011 by Xittenn
ecoli Posted April 14, 2011 Author Posted April 14, 2011 or better yet [math] k_i = freq_x \land p(a) = \frac{k_i}{N} [/math] which you weren't doing! sorry i think we crossposted, but this just doesn't seem like what's going on in equation 17. He's taken N out of each k term, but adds it into the MI calculation separately.
Xittenn Posted April 14, 2011 Posted April 14, 2011 (edited) It isn't the same, I think multiple forms are being presented and you are equating them to being the same. Your code looks like it is of the form that is taken from wolfram which is not the form of 17 and where certain definitions seem to have taken different meaning on account of various mathematical assumptions. My approach on this would be to start back at the beginning from scratch and build up again and to this point of knowledge. Sometimes it takes less time to figure out a problem if you throw away everything you think you know and start from square one. This is just my humble opinion however. Sorry I can't give you a better explanation, I'm just trying this out for the first time myself. I took interest because it is something I was already working on, but in different areas. Edited April 14, 2011 by Xittenn 1
ecoli Posted April 14, 2011 Author Posted April 14, 2011 It isn't the same, I think multiple forms are being presented and you are equating them to being the same. Your code looks like it is of the form that is taken from wolfram which is not the form of 17 and where certain definitions seem to have taken different meaning on account of various mathematical assumptions. My approach on this would be to start back at the beginning from scratch and build up again and to this point of knowledge. Sometimes it takes less time to figure out a problem if you throw away everything you think you know and start from square one. This is just my humble opinion however. Sorry I can't give you a better explanation, I'm just trying this out for the first time myself. I took interest because it is something I was already working on, but in different areas. Ahh. damn I see what's going on a bit better now, and rebinning into equal portions definitely helped.
Xittenn Posted April 14, 2011 Posted April 14, 2011 (edited) >cross posting there< Yeah sorry, I see where I went off track .... :/ Edited April 14, 2011 by Xittenn
ecoli Posted April 14, 2011 Author Posted April 14, 2011 >cross posting there< Yeah sorry, I see where I went off track .... :/ Yeah but you are right that he's estimating the probability distribution differently because he's using a histogram (fixed that problem, btw, my data points are being split up into equal bins now). That assumption probably won't work with real data though.
Xittenn Posted April 14, 2011 Posted April 14, 2011 (edited) I have two code samples, one fulfills Kullback Entropy equivalent of Mutual Information and the other fulfills the histogram model. The results of each leave me asking the value of the information provided by each value for which each model results. The problem I am seeing with your original post is again that you have mixed the two concepts together. You say 'fixed' I'm curious how and if we have come to the same conclusion. I have concluded that for (i in 1:N) { <------ ****** sum.temp <- 0 for (j in 1:N) { <------ ****** ki <- ki.find(xcut[j]) <------ ****** kj <- kj.find(ycut[j]) was the cause of the problem and that in fact kij * log2((kij/(ki*kj))) should be negative and that the sum of ij kij * log2((kij/(ki*kj))) should never be greater than log(N). You should be stepping through the bin divisions not the samples .... :/ note the little note on S233 and modify this appropriately if ( ( as.character( ycut[i] ) == cut.table$xcut[g] ) && ( as.character( ycut[j] ) == cut.table$ycut[g] ) ) { Edited April 14, 2011 by Xittenn
Xittenn Posted April 14, 2011 Posted April 14, 2011 I hope you don't mind I posted my results in my blog .... w/ credits
ecoli Posted April 14, 2011 Author Posted April 14, 2011 I hope you don't mind I posted my results in my blog .... w/ credits link? edit: sorry. I'm an idiot. you meant your SFN blog.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now