Ernst Posted May 23, 2012 Posted May 23, 2012 Recently I have been trying to understand machine learning. I have a problem that machine learning maybe can solve. In principle this is my problem( the number of individuals in the real case is more than 10 in each set ) Everyday I get 9 data sets that look like this: 1 4 9 3 2 0 2 1 4 6 8 1 3 2 1 4 3 0 4 10 3 1 5 0 5 3 2 5 1 0 6 5 6 8 7 0 . . 9 . . . . 0 10 . . . . 0 The first column is always identical and let's call them identities. In the 9 data sets I want to get a solution that will predict the postive outcome(1) in column 6. So in this single data set, row 2 is of interest. Column 2 to 5 could be regarded as ranks. 9 sets give 10 ^ 9 possible combinations and only one is correct. 1 / 10 ^ 9 is a small number. I could make a logistic classification solution. I think that would give a poor solution. On the other hand there are very strong indications that for each combination of the 9 data sets, it nearly always is 4-7 out of the 9, among the values 1 to 3 in the first column, so a typical result would be the following numbers in column 1, with 1:s in column 6, for the combination of 9 sets: (2, 7, 1, 3, 2, 6, 1, 5, 3), i.e. 6 individuals in the interval [1- 3]. If I regard the logistic classification, which I don't think implicitly will capture "the 4-7-knowledge", as horisontal modeling, how could I get "the 4-7-knowledge", which I regard as vertical modeling, into the complete model? Best regards Ernst
khaled Posted May 24, 2012 Posted May 24, 2012 Here are machine learning models I'm aware of: - Hidden Markov Model: Wikipedia:HMM - Bayesian_network: Wikipedia:BN - Neural Networks: Wikipedia:NN
phillip1882 Posted May 24, 2012 Posted May 24, 2012 i'm afraid your problem is poorly worded, i can't make much heads or tails of what you're looking for. it sounds like you're looking for a probabilistic model for determining what numbers occur with what frequency in each column. here would be one way you could write such a program. File = open("data.txt","r") data =[] for line in File: data += [line.split()] max = 0 for i in range(0,len(data)): for j in range(0,len(data[0])): data[i][j] = int(data[i][j]) if data[i][j] > max: max = data[i][j] array = list(range(0,max)) x = input("which column would you like to investigate?") for i in range(0,len(data)): array[data[i][x]] += 1 total = 0.0 for i in range(0,len(array)): total += array[i] for i in range(0,len(array)): array[i] = array[i] /total print("the probabilities are:") for i in range(0,len(array)): print(i,array[i]*100)
Ernst Posted May 24, 2012 Author Posted May 24, 2012 (edited) You can regard the 9 sets as 9 horse races and the 10 individuals as 10 different horses, with start number from 1 to 10. You are supposed to find the 9 winning start numbers and there is 10 ^ 9 possible combinations. The data are historical data and I want to predict the outcome when I get the next 9 races with 10 horses, i.e. 9 sets with 5(five) columns of data on each of the 10 horses in each race. There are only numbers, so no real identity on the horses. Column 1: Start numbers Columns 2 - 5: Ranking of the 10 horses in each race , rank 1 would be the best and 10 the worst, based on certain given data concerning each horse. Column 6: Winner = 1, loosers = 0. Then I think you understand that I could do logistic classification on the historical data and use that model to predict the winners(1) and losers(0). I regard that as "horisontal modeling". But I have plenty of data that strongly predicts that for the start numbers; column 1, very often 4 - 7 of the winning start numbers, out of 9, comes from the interval [ 1, 3]. I regard that as "vertical knowledge". I hope this makes it more clear what I want. I don't think a logistic classification could give a model that could predict, with any accuracy, which one of the 10 ^ 9 possible outcomes would be the right one. The "vertical knowledge" could maybe narrow the number of combinations that could be of interest. I would be happy if I had 7 of the races correct and if the model could pick out the 5 000 000 ( 0.5 % )best combinations to (randomly) choose among, since I of course can't afford to pay for 5 000 000 combinations. Best regards Ernst Edited May 24, 2012 by Ernst
phillip1882 Posted May 25, 2012 Posted May 25, 2012 okay that makes much more sense. so basically what you're looking for is the number of times a "horse" gets ranked 1-3, with the best horses given the highest chances of winning. for example, horse 3 in the data block at the initial post should receive a high probability, as should horse 5. in particular, the first non-identity column gives the best chances for winning, so 2 should also receive a high probability. with the reference to learning algorithms, I'm guessing you want the computer to figure out how valuable each column is, based on the number of wins and losses each horse receives with the given data. give me a few days, let me see what i can drum up.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now