How to use machine learning for this problem.

Ernst · May 23, 2012

Recently I have been trying to understand machine learning. I have a problem that machine learning maybe can solve.

In principle this is my problem( the number of individuals in the real case is more than 10 in each set )

Everyday I get 9 data sets that look like this:

1 4 9 3 2 0

2 1 4 6 8 1

3 2 1 4 3 0

4 10 3 1 5 0

5 3 2 5 1 0

6 5 6 8 7 0

.

9 . . . . 0

10 . . . . 0

The first column is always identical and let's call them identities. In the 9 data sets I want to get a solution that will predict the postive outcome(1) in column 6. So in this single data set, row 2 is of interest.

Column 2 to 5 could be regarded as ranks.

9 sets give 10 ^ 9 possible combinations and only one is correct. 1 / 10 ^ 9 is a small number.

I could make a logistic classification solution. I think that would give a poor solution.

On the other hand there are very strong indications that for each combination of the 9 data sets, it nearly always is 4-7 out of the 9, among the values 1 to 3 in the first column, so a typical result would be the following numbers in column 1, with 1:s in column 6, for the combination of 9 sets:

(2, 7, 1, 3, 2, 6, 1, 5, 3), i.e. 6 individuals in the interval [1- 3].

If I regard the logistic classification, which I don't think implicitly will capture "the 4-7-knowledge", as horisontal modeling, how could I get "the 4-7-knowledge", which I regard as vertical modeling, into the complete model?

Best regards

Ernst

khaled · May 24, 2012

Here are machine learning models I'm aware of:

- Hidden Markov Model: Wikipedia:HMM

- Bayesian_network: Wikipedia:BN

- Neural Networks: Wikipedia:NN

phillip1882 · May 24, 2012

i'm afraid your problem is poorly worded, i can't make much heads or tails of what you're looking for.

it sounds like you're looking for a probabilistic model for determining what numbers occur with what frequency in each column.

here would be one way you could write such a program.

File = open("data.txt","r")
data =[]
for line in File:
  data += [line.split()]
max = 0
for i in range(0,len(data)):
  for j in range(0,len(data[0])):
 	data[i][j] = int(data[i][j])
 	if data[i][j] > max:
		max = data[i][j]

array = list(range(0,max))
x = input("which column would you like to investigate?")
for i in range(0,len(data)):
  array[data[i][x]] += 1
total = 0.0
for i in range(0,len(array)):
  total += array[i]
for i in range(0,len(array)):
  array[i] = array[i] /total
print("the probabilities are:")
for i in range(0,len(array)):
  print(i,array[i]*100)

Ernst · May 24, 2012

You can regard the 9 sets as 9 horse races and the 10 individuals as 10 different horses, with start number from 1 to 10. You are supposed to find the 9 winning start numbers and there is 10 ^ 9 possible combinations.

The data are historical data and I want to predict the outcome when I get the next 9 races with 10 horses, i.e. 9 sets with 5(five) columns of data on each of the 10 horses in each race. There are only numbers, so no real identity on the horses.

Column 1: Start numbers

Columns 2 - 5: Ranking of the 10 horses in each race , rank 1 would be the best and 10 the worst, based on certain given data concerning each horse.

Column 6: Winner = 1, loosers = 0.

Then I think you understand that I could do logistic classification on the historical data and use that model to predict the winners(1) and losers(0). I regard that as "horisontal modeling".

But I have plenty of data that strongly predicts that for the start numbers; column 1, very often 4 - 7 of the winning start numbers, out of 9, comes from the interval [ 1, 3]. I regard that as "vertical knowledge".

I hope this makes it more clear what I want.

I don't think a logistic classification could give a model that could predict, with any accuracy, which one of the 10 ^ 9 possible outcomes would be the right one. The "vertical knowledge" could maybe narrow the number of combinations that could be of interest. I would be happy if I had 7 of the races correct and if the model could pick out the 5 000 000 ( 0.5 % )best combinations to (randomly) choose among, since I of course can't afford to pay for 5 000 000 combinations.

Best regards

Ernst

Edited May 24, 2012 by Ernst

phillip1882 · May 25, 2012

okay that makes much more sense.

so basically what you're looking for is the number of times a "horse" gets ranked 1-3, with the best horses given the highest chances of winning.

for example, horse 3 in the data block at the initial post should receive a high probability, as should horse 5.

in particular, the first non-identity column gives the best chances for winning, so 2 should also receive a high probability.

with the reference to learning algorithms, I'm guessing you want the computer to figure out how valuable each column is, based on the number of wins and losses each horse receives with the given data.

give me a few days, let me see what i can drum up.

Sign In

How to use machine learning for this problem.

Recommended Posts

Ernst

khaled

phillip1882

Ernst

phillip1882

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information