Dak Posted July 7, 2007 Posted July 7, 2007 I have a large list of words, with several duplicates, so i wrote a small python script to generate a list of all the dupes: dicnohash = the list dups = the list of duplicates in the list dups = [] for a in range(0,len(dicnohash)): for b in range(a+1,len(dicnohash)): if dicnohash[a] == dicnohash[b]: dups.append(dicnohash[b]) break it just compares each item with each item after it, and if they match, it logs the match in the dups list (which is later written to file), and then breaks to the next item in the list. however, there are nearly 100,000 items in the list, so i guess it's going to do a max of somewhere in the range of 5,000,000,000 comparisons that's probably why it's taking so long my question is, how could this be done faster? I honestly can't think of any way of doing it that'd take less comparisons than the way above. it only needs to be done once, so i'm using the above code + patience, but i'm still interested
the tree Posted July 7, 2007 Posted July 7, 2007 O.k. real computer science whilst still far more interesting isn't exactly my strong point, but I'm thinking that if you ran a bubble sort algorithm (augmented to record duplicates) then less comparisons would be made, possibly. Also, you would get the list sorted as a bonus.
Sayonara Posted July 7, 2007 Posted July 7, 2007 Look into hash tables: the time spent searching for the required data is independent of the number of items stored http://en.wikipedia.org/wiki/Hash_table#Time_complexity_and_common_uses_of_hash_tables
timo Posted July 7, 2007 Posted July 7, 2007 Things you could try: 1) eliminate words from the text that were already found as duplicates so they're not checked again later. That's little effort with probably little gain; in fact, the question whether you get a performance gain from that will be strongly dependent on how the text structure is stored and how it performs deletion of intermediate words. 2) read-in the whole text in a better-suited structure (e.g. alphabetically sorted binary tree where each node counts the number of occurences of the word). That changes complexity from O(n²) to O(n*X)+Y, where X is the complexity of writing to the structure (i think that typically log(n) for trees) and Y is the complexity of reading out your data. A lot of effort with questionable outcome but could be worthwhile investigating - even if it's just for getting a feeling for the advantages and disadvantages of the different approaches.
theCPE Posted July 7, 2007 Posted July 7, 2007 Its all about data structure. If your list is just a simple file or an array then without re-structuring you have to do it just how you are, checking every word with all duplicates. As far as suggestions for deleting duplicates, that would take just as long as finding since in order to delete you have to first find. I suppose the best method of attack would be like someone else mentioned, first alphabetize the list before removing duplicates that way you significantly reduce the compares that must be done with the duplicate words.
timo Posted July 7, 2007 Posted July 7, 2007 As far as suggestions for deleting duplicates, that would take just as long as finding since in order to delete you have to first find. If you take a close look at Dak's code then you'll see that he finds a duplicate ("if dicnohash[a] == dicnohash:"), correctly copies it to the list of duplicates ("dups.append(dicnohash)") and later checks it again over the range [a'+1,len(list)] when a'=b.
theCPE Posted July 7, 2007 Posted July 7, 2007 Oopps. I assumed by deleting duplicates you were referring to within the search list. Yeh, adding already added duplicates to the check list is a waste of time.
Dak Posted July 8, 2007 Author Posted July 8, 2007 he was. eg, if theres a duplicate at points 3 and 7, then, once you've compared 3 with 7 and id'd the duplicate, it's pointless to then compare 4,5,and6 with 7 (as you've allready compared them with 3, which == 7), and it's pointless to compare 7 with everything past 7, as you allready know the item entered at point 7 is a duplicate. if, once you've compared 3 with 7 and id'd the duplication, you then delete item 7, you prevent the above. essentially, you'd save len(list)-currentListItem number of comparisons (tho i think what atheist was originally getting at was that deleting items from a list could take alot of re-arranging of data too, so might not be worth it) oh, by-the-by, it took 3 hours in the end O.k. real computer science whilst still far more interesting isn't exactly my strong point, but I'm thinking that if you ran a bubble sort algorithm (augmented to record duplicates) then less comparisons would be made, possibly. Also, you would get the list sorted as a bonus. ta. i didn't think of sorting them first. but a quick google suggests that bubble-sort isn't very good: http://en.wikipedia.org/wiki/Bubble_sort http://en.wikipedia.org/wiki/Quicksort http://www.cs.ubc.ca/spider/harrison/Java/sorting-demo.html Look into hash tables will do, but i don't think python has them. guess this would be a good time to learn C... ===edit=== dups = [] dicnohash.sort() for a in range(0,len(dicnohash)): if dicnohash[a] == dicnohash[a+1]: dups.append(dicnohash[a]) took, like, a few seconds
timo Posted July 8, 2007 Posted July 8, 2007 There's a 4-line implementation of a quicksort for Python on Wikibooks. It might not be the best possible solution due to relatively large stack usage, but you could try that ... *looks it up* ... http://en.wikibooks.org/wiki/Algorithm_implementation/Sorting/Quicksort#Python Considering that and the relatively large programming overhead in creating a usefull tree-structure I think doing a sort on the list would be option #1.
the tree Posted July 8, 2007 Posted July 8, 2007 ta. i didn't think of sorting them first. but a quick google suggests that bubble-sort isn't very good: Oh you're right that bubble-sort is a piece of crap for actually sorting anything, but I was suggesting finding duplicates as you sort.
Aeternus Posted July 8, 2007 Posted July 8, 2007 Assuming they are lists, the easiest way would be just to use python lists sort() method (which I think uses a variation on quicksort called samplesort (ref)) and then go through linearly to find the duplicates. You will still get O(nlogn) performance although the constant involved will be slightly bigger due to that additional linear search. You'll still get much better performance than you are getting with fairly minimal effort.
Ndi Posted July 11, 2007 Posted July 11, 2007 Not to say this is best, but if I were you and I'd need a fast algo I'd build 24 lists (call them LA, LB, LC) and do one pass splitting them by first letter. That would give you a distribution of 24 and a weight of an average ~4000). Each list can be sorted (sounds odd but a normal PC takes some 1-2 seconds to do that). Once that is done, scan each list from 0 to Size-2 and if LA=LA[i+1] then delete LA. (In real world this is done best counting backwards). This equates to 2 passes of the list and 24 QSorts, overall it should be quite fast. This eliminates the overhead of a more complex (yet effective) hash. You can tweak it in real live by timing QSort versus the scanning of the 24 lists. If the sorting is slower, do a wider spread, say first 2 letters. It all depends on the input data. Note QSort performs VERY poorly if applied to small or sorted lists. If you expect semi-sorted lists then perhaps a double bubble would be better. Though the simplest approach would most likely be sorting then scanning. String comparison is slower so you could do some dirty(er) tricks, like sorting by word length. That should speed things up a bit.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now