Xittenn Posted January 20, 2011 Posted January 20, 2011 (edited) I am needing to create a 128bit hashing function for a look up table of running objects that are indexed by UUID! The typical hash<T> is not liking the 128bit unsigned integer I am passing it as it cannot be converted to the 64bit size_t internally specified in the algorithm. Why not just store a pointer to the UUID in question? Well this would be terribly convenient wouldn't it? Regardless of the underlying problem I wish to learn the best design policies of developing a hashing function as this is something I haven't directly attempted as of yet. I am analyzing this from a purely Set and Group Theoretic standpoint and am using the very few Abstract Algebra tools available to my disposal. I am not really interested in the code itself as the code tends to hold little to no consequence to my ability in understanding. I guess I am starting with, I wish to find a class [math] S [/math] a binary relation whose [math] Dom(S) [/math] is the class [math] \left \{ x | \exists y \left ( \left \langle x, \; y \right \rangle \in S \right ) \iff \left | \left \langle x, \; S \right \rangle \right | = 1 \right \} [/math] .... I'm sure there are a number of ways to represent this. I just wish to know how something like this is approached on a deeper level first by defining it logically and then by associating the proper algorithms to the underlying class. As for the algorithm, modulo prime seems to be the birthing place of the hashing function but is far from adequate for my purposes. This is where my question of design policy comes into play, is there one? I wasn't sure if this should go in math I just assumed a greater likely hood of understanding in computer sciences .... -------------- I'm pretty sure my iff is wrong and am trying to revise to make it right ... Edited January 20, 2011 by Xittenn
khaled Posted January 20, 2011 Posted January 20, 2011 hashing functions are based on what you will use them for, if you use them for security, then you can use MD5, or MD6 hashing which is already implemented and can be found on the web ... but if you use them for a Hash Table, then you need to work out one for yourself, or lookup the internet for an implemented one, Hash Table is a 1Dimensional Array Of K Pointers To Data Structures, example: Array of K Binary-Tree Pointers, initially empty Hash Function is a function that maps any input to one of the K Data Structures, by returning an output (the index) from 1 to K ... The hash function have to be: 1. secure: given the same input, it returns the same index 2. distributed: given H entries, it should balance H over K Data Structures where the average of every Data Structure Size becomes H/K .. it's then normalized .. Good luck on writing a good hash function and by the way, you look cute,
Xittenn Posted January 20, 2011 Author Posted January 20, 2011 hashing functions are based on what you will use them for, Well as I mentioned I am hashing the UUIDs of running objects! More specifically I am trying to make a substitution for the default supplied hasher for VC10s <unordered_map> .... I guess the above is really just a decoy though for my hidden agenda which is developing my ability to discuss such matters in a more formal manner :/
khaled Posted January 20, 2011 Posted January 20, 2011 (edited) The unordered_map Class: template < class Key, class Ty, class Hash = std::hash<Key>, class Pred = std::equal_to<Key>, class Alloc = std::allocator< std::pair<const Key, Ty> > > class unordered_map; where: + Key: The key type + Ty: The mapped type + Hash: The hash function object type + Pred: The equality comparison function object type. + Alloc: The allocator class. .. So, to Create a Standard unordered_map, that works with your hash function: unordered_map m = new unordered_map<long,MyDataStructure,MyHash<long>>(); where template <class t> class MyHash { MyHash(){} size_t operator()(const t& x) { size_t h = 0; // calculate hash return h; } }; good luck ... I guess the above is really just a decoy though for my hidden agenda which is developing my abilityto discuss such matters in a more formal manner :/ hmm, I know a good formal way of having a discussion ;] Edited January 20, 2011 by khaled
Xittenn Posted January 20, 2011 Author Posted January 20, 2011 (edited) I need a standard hash function that will hash a 128bit unsigned integer .... and when I have finally created it I wish to be capable of formally presenting it from declaration through testing and solution. I have never seen something like this formally tackled and am trying to assess my situation. I am reading through some online literature and it is suggested that Poisson Distribution is a good place to start when testing the collision rate of such algorithms! I'm probably shooting pretty big with this as I am looking for the perfect hasher for a UUID into a 64bit pointer which is probably a pretty ugly thing to ask .... Edited January 20, 2011 by Xittenn
khaled Posted January 20, 2011 Posted January 20, 2011 listen cutie, when we talk about Algorithms, we talk about simplicity, So, how can we represent a 128bit unsigned integer, that would unsigned long so, you simple put your hash function code in here unordered_map m = new unordered_map<unsigned long,MyDataStructure,MyHash<unsigned long>>(); with, template <class t> class MyHash { MyHash(){} size_t operator()(const t& x) { size_t h = 0; // calculate hash as if X is unsigned long return h; } }; about your hash function, it can anything ... it can be MD5, SHA1, or anything you can make up, I once wrote a simple hash function: size_t sum_digits(size_t a) { size_t b = 0; while(a > 0) { b += a % 10; a = a / 10; } return b; } // this hash function is based on digits sum loop // author: khaled khunaifer, 2008 size_t hash(unsigned long n) { size_t h = n; while(h > 9) h = sum_digits(h); return h; } Good luck, .. note that my real name is written in the code
Xittenn Posted January 20, 2011 Author Posted January 20, 2011 /* ----- with a bucket size ~ million up to five ----- */ So, how can we represent a 128bit unsigned integer, that would unsigned long That sentence makes no sense to me .. If you are suggesting I truncate the UUID I think this would be bad because someone might not random generate and might produce a series of UUIDs with a segment of common numbers .... noted :|
khaled Posted January 20, 2011 Posted January 20, 2011 /* ----- with a bucket size ~ million up to five ----- */ That sentence makes no sense to me .. If you are suggesting I truncate the UUID I think this would be bad because someone might not random generate and might produce a series of UUIDs with a segment of common numbers .... noted :| I was afraid of misunderstanding, what I meant simply is ... If the UUID is an unsigned long, then simply convert it to unsigned long to be a valid input to the Hash function ... UUID u; unsigned long n = Convert_UUID_to_unsigned_long(u); size_t h = hash(n); and If the UUID is a structure that contain the unsigned number, then just take it off the structure ... UUID MyUUID(); unsigned long n = MyUUID.get(); size_t h = hash(n); This shouldn't be that difficult ...
Xittenn Posted January 20, 2011 Author Posted January 20, 2011 What if I make a set m which represents the UUID as 32 4bit numbers and find the cartesian product of this set and then multiply that by the 4bit number that is formed when the bits numbered 32, 64, 96 and 128 are assembled? I would generate an index of a rough size of 5million? but I don't see any input in the function for the size of the array :/
khaled Posted January 20, 2011 Posted January 20, 2011 What if I make a set m which represents the UUID as 32 4bit numbers and find the cartesian product of this set and then multiply that by the 4bit number that is formed when the bits numbered 32, 64, 96 and 128 are assembled? I would generate an index of a rough size of 5million? but I don't see any input in the function for the size of the array :/ So, you are working with an array, you should've told me that it was an array from the beginning, This is a complex hash function for a given array of 8-bit array, // this means hash() return values from 0 ~ 1000 const unsigned long MAX_PARAMETER = 1000; // hash function i wrote for you Xittenn size_t hash(unsigned char[] key, size_t size) { unsigned long t = 0; size_t h = 0; // step 1: summation for(int i=0; i<size; i++) t += key[i]; // step 2: reduction while(t > MAX_PARAMETER) t = t MOD (t % (10 * ( t % 10))); h = t; return h; } Good luck,
Xittenn Posted January 20, 2011 Author Posted January 20, 2011 (edited) I could operate on two 64 bit unsigned integers masking &0x000000FF000000FF and shifting >> to optimize?? or other better combinations ... So, you are working with an array, you should've told me that it was an array from the beginning, no, an <unordered_map> I'm assuming they are implementing internally with an array but I will have to figure that out ... now I'm just lost! Edited January 20, 2011 by Xittenn
khaled Posted January 20, 2011 Posted January 20, 2011 (edited) I could operate on two 64 bit unsigned integers masking &0x000000FF000000FF and shifting >> to optimize?? That seem like you are using the final byte of 128-bit number, by masking &0x000000FF000000FF and shifting to retrieve that bytes ... but it's a waste of time, you can use a 1-byte number from the begining, but if for example you are using 128-bit numbers, you can hash it down into the 1-byte number, because if you do masking then all entries with same digits on the F-masked parts are considered to be the same, and that is a main leak in the hash function ... good luck, no, an <unordered_map> I'm assuming they are implementing internally with an array but I will have to figure that out ... now I'm just lost! don't be sad, sometimes it's easier to implement everything from scratch, if your project doesn't depend on some library that does something you cannot implement by yourself. so why not designing your own hash table, your own data structure, and your own hash function ... If standard methods makes your work complex with their constraints, then make your own version with your own constraints ... Edited January 20, 2011 by khaled
Xittenn Posted January 20, 2011 Author Posted January 20, 2011 (edited) I was referring to the optimization of performing the Cartesian product over a 4bit value set of 32 .... which would have been wrong but the idea was present ... ---------------------------------------------------------------------------------------- I think you are right I will make my own container, implemented in the same way as <unordered_map> but optimized for my purposes. I will follow these steps: 1) I will break the UUID into 4 32bit integers and form a set M 2) I will find the Cartesian Product of the set M with itself and this will be the number of buckets 295 147 905 179 352 825 856 3.a) each bucket will be capable of containing 1 152 921 504 606 846 976 entries 3.b) I will hash to a map of pointers each member when allocated will represent the first element of a bucket 4.a) No member will be allocated until there is a subsequent member allocated within a given bucket 4.b) Each bucket will contain a dynamically allocated linked list of pair<key,value> which will be sorted through using an equality function I think this algorithm is promising for my purposes ... pair<key,value> will not have to move when a new pair<key,value> is inserted or removed and memory can be managed rather quickly as well I am however very obviously wrong in assuming the uniqueness of any given product which is a good thing because I am going to want multiple entries per bucket in the case above and preferably with uniform distribution, but this is a problem because I am also not guaranteed to have buckets with more than one entry with the Cardinality of the bucket set as is :/ .... I'll stop editing my post now ... Ok an array of buckets produced by the Cartesian Product of the set of 32 4bit members with itself as originally stated sounds pretty ideal ..... I don't know what kind of performance gain I would have on this though. I feel I have totally missed my mark on why I started this thread. No group specific anything has been used yet and my entries completely lost all formal presentation. I guess I will try harder next time .... Edited January 21, 2011 by Xittenn
khaled Posted January 21, 2011 Posted January 21, 2011 (edited) this happen always with me, but we have the key "try something different" ! and since you are looking for Optimality, I think you should use a Binary Tree instead of a Linked List since Keys are numeric and can be ordered ! If you change your LinkedList into a BinaryTree, Insertion Time and Search Time will be reduced from O(n) to O(log n), it's like the Optimal solution in an NP-complete problem ... I usually give a course to my college's students called "Create My Own Data Structure", We show that creating your own data structure is fun, and makes you feel like building a Castle ! Anyway, the algorithm of your program is simple as you stated: 1. generate whatever you need to store 2. define a new MyHashTable, which will implicitly define new K MyBinaryTree(s) 3. Iterate over your buckets\array\matrix\..etc and add them to the MyHashTable using for(int i=0; i<bucket.length; i++){ MyHashTable.add(bucket); } Good luck .. one more thing, i found out what is your problem, you try to solve everything at once, that will make you lose everything, .. so take things one by one, and them take your time to organize your work, Edited January 21, 2011 by khaled
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now