I am trying to use k-means clustering to profile usage behaviour for mobile device users. My data consists of different system and user level variable/readings like number of calls/sms, cpu/memory usage, number of users and system applications/services etc. The readings are taken every 5 minutes from mobile device and scaled between 0-100. The clustering is done in MatLab on computer.
The idea I have is to use say 1 month's data for training, i.e. clustering, and then use the future data to compare with existing clusters and try to find (dis)similarity between the two. The assumption is different users will have different usage; hence readings from USER B will not fit into clusters from USER A.
Now two questions I have:
After training (clustering), how do I compare new data with existing clusters to determine (dis)similarity, i.e. new data belongs to same user or not? I am thinking of finding nearest cluster and then checking if the point lies within this cluster's boundary.
I am using Silhouettes plot to determine the clustering quality. I get some negative values e.g see the attached figure.. I have read that A negative value means that the record is more similar to the records of its neighbouring cluster than to other members of its own cluster.
Shall I be concerned with my results? or Is it normal to have some negative values? If it needs to be fixed How do I detect the readings causing this problem.