Eliminating Duplicates

Multiplicity

Yesterday I ran into a programming problem and I didn’t like any of the solutions I came up with. I went to bed kind of grouchy about it.

Here’s the problem: Given a huge number of strings, return an array that has each string appear just once. Eliminate all duplicates.

This is a common enough problem in programming that there exists a large body of literature about it.

The solutions I found, though, were pretty ugly. They either involved nested loops and a lot of comparisons, or brain-busting regular expressions.

Now, I’m always willing to look at the regular expression solution to a problem, but for me a RegEx solution is a lot of work. It’s like trying to dredge up my high school Latin. You know the joke–there are 10 kinds of programmers, those who understand Regular Expressions, and those who don’t.

I wanted small, simple code. I couldn’t find anything that really felt right.

Going with the Flow

What kept bugging me? In the back of my head, I kept thinking that all these solutions are coming from other languages. They’re not taking advantage of JavaScript.

Isn’t there something in JavaScript that naturally removes duplicates? And finally, it hit me: object keys.

Objects in JavaScript are hashes–they are made of two parts: the left part and the right part.

{"left":right}

The key is unique, but of course the value can be duplicated.

Then I had it: The “key” is the key. All I have to do is loop through the strings and assign them to the keys of an object. Hashes are an automatic duplicate removal machine.

And it works. JavaScript does the work of eliminating the duplicates naturally. I’m not doing the heavy lifting in JavaScript. Instead, JavaScript is doing the heavy lifting internally in some kick-ass compiled C loop.

My code stays clean and minty fresh because I’m relying on heavily tested and optimized code inside the JavaScript interpreter of the browser.

function eliminateDuplicates(arr) {
  var i,
      len=arr.length,
      out=[],
      obj={};

  for (i=0;i<len;i++) {
    obj[arr[i]]=0;
  }
  for (i in obj) {
    out.push(i);
  }
  return out;
}

var a=[];
var b=[];

a[0]="fish";
a[1]="fishes";
a[2]="fish";
a[3]="1";
a[4]="fishes";

b=eliminateDuplicates(a);
console.log(a);
console.log(b);

---

["fish", "fishes", "fish", "1", "fishes"]
["fish", "fishes", "1"]

Further

Armed with this concept, I was able to simplify my code even more. My data didn't start out as an array. Instead, it was coming from a file, so I just grabbed the strings as they came in and stuffed them right into the object as keys.

Every language has its own strengths and weaknesses. You get a nice feeling when you finally see the problem you're trying to solve in the context of the tool you're using to do the job.

About these ads

17 Comments

  1. Nosredna said,

    August 22, 2008 at 8:11 pm

    The germ for this idea is here:

    http://dreaminginjavascript.wordpress.com/2008/07/08/you-could-lose-your-mind/

  2. alsanan said,

    August 25, 2008 at 2:02 pm

    Smart! …like always ;-)

  3. Nosredna said,

    August 25, 2008 at 3:56 pm

    >>Smart!

    Yeah, maybe. But now I’m thinking all the extra fuss I went to with the asterisk only applied to an earlier revision of the code, and that I can probably just do this…

    function eliminateDuplicates(arr) {
    var i;
    var len=arr.length;
    var out=[];
    var obj={};

    for (i=0;i<len;i++) {
    obj[arr[i]]=0;
    }
    for (i in obj) {
    out.push(i);
    }
    return out;
    }

  4. Nosredna said,

    August 25, 2008 at 4:19 pm

    Yeah, I tested the simplified code in all 4 major browsers. I’m going to change the article to reflect the simplified code.

  5. Nosredna said,

    August 28, 2008 at 11:55 pm

    My buddy Ryan suggested this:

    function eliminateDuplicates(arr) {
    var i;
    var len=arr.length;
    var out=[];
    var obj={};

    for (i=0;i<len;i++) {

    if (!obj[arr[i]])

    {

    obj[arr[i]]={};

    out.push(arr[i]);

    }

    }
    return out;
    }

    This way you check for a key, and if it evaluates to falsey (because null is falsey) it will add the key and add to the return array in one shot. It doesn’t save any orders of magnitude in execution time (just takes it from 2nLogn to nlogn), but it should be twice as fast which may make a difference when huge arrays are passed into it.

  6. breton said,

    September 27, 2008 at 2:48 pm

    I find this works pretty damn well for eliminating duplicates:

    myarray.filter(function(o,i,a){ return a.indexOf(o) === i; });

    • Tomasz B. said,

      April 4, 2010 at 2:21 pm

      brilliant!
      That is exactly what JS was invented for!

  7. October 25, 2008 at 6:44 am

    Great solution. I like your thought process. It is always important, when searching for optimizations, to consider the features that the language you are working with has to offer.

    Also, thanks for your great comments.

  8. vjeux said,

    August 30, 2009 at 6:17 pm

    function eliminateDuplicates(input) {
    var curr, last, output = [], len = 0;

    input.sort();
    for (i in input) {
    curr = input[i];
    if (curr !== last) {
    output[len++] = curr;
    }
    last = curr;
    }
    return output;
    }

    Instead of using the array as a hash table i made a version that first sort all the strings and then goes to that sorted list and keep only strings that are not the same as their previous.

    However, this is not as fast as your last version. As for breton, this may be the shortest one but performs really bad.

    I’ve made a little bench script : http://blog.vjeux.com/wp-content/uploads/2009/08/javascript.removeDuplicates.html

    Chrome 3
    Method #2 [Nosredna] (100 000) : 47ms
    Method #0 [vjeux] (100 000) : 463ms
    Method #1 [breton] (100 000) : 39022ms

    Firefox 3.5
    Method #2 [Nosredna] (100 000) : 112ms
    Method #0 [vjeux] (100 000) : 205ms
    Method #1 [breton] (100 000) : 27352ms

  9. unscriptable said,

    December 8, 2009 at 2:00 am

    Most modern browsers are likely to have pretty quick Array sort() algorithms. Therefore, I use the following method (copied right out of my forthcoming js project and tweaked to remove dojo dependencies). By pre-sorting the items, you only need one additional loop to remove adjacent duplicates. It works for any data type as long as you can supply a comparison function:

        function dedup (/* Array */ array, /* Function */ compareFunc, /* Object? */ context, /* Boolean? */ preemptive) {
            //  summary: removes duplicate items from an array. compareFunc should behave
            //      exactly like the lambda function used in array.sort(), returning:
            //       0 - if the two items to compare are equivalent
            //      -1 - if the first item should be sorted to an earlier position than the second
            //       1 - if the first item should be sorted to a later position than the second
            //      Returns an array with no duplicates.
            //      Note: items found later in the array are favored over earlier items. This means
            //      that if am earlier item is found to be a duplicate of an later item, it is not
            //      included in the retrned array.  You might ask, "Who cares which one is favored?"
            //      Glad you asked! It depends upon how you define "duplicate". If you wanted to
            //      remove all nodes with duplicate ids, you could supply a compareFunc that inspected
            //      node ids.  The nodes are not identical in this case, just the ids.
            //      If you wish to favor the earlier items, set preemptive = true;
            //      Note: undefined values are omitted from the output array.
            //  array: Array - the array to be deduped
            //  compareFunc: Function - see above
            //  context: Object - context on which to run compareFunc (i.e. the "this" reference from
            //      within compareFunc). defaults to null/window
            //  preemptive: Boolean - set to true to favor earlier occuring items (see above)
            var comparator = function () { return compareFunc.apply(context, arguments); },
                prev,
                keepers = [];
            // by first sorting the array, we know that dups will be adjacent to each other
            array.sort(comparator).forEach(function (item, pos) {
                if (typeof item != 'undefined') {
                    if (typeof prev == 'undefined') {
                        keepers.push(pos);
                    }
                    else {
                        // is the previous item different from this one?
                        var eq = comparator(prev, item) == 0;
                        if (preemptive) {
                            if (!eq)
                                keepers.push(pos); // add this one
                        }
                        else {
                            if (eq)
                                keepers.pop(); // remove prev one
                            keepers.push(pos); // always add current one (it may be deleted in the next iteration)
                        }
                    }
                    prev = item;
                }
            });
            return keepers.map(function (pos) { return array[pos]; });
        }
    

    I don’t know if it performs better or worse than your methods. I just thought I’d present it to see if it spurs any further ideas.

    What I like about this method is that – no matter what the data looks like — if you can sort it, you can dedup it, too.

    Note: this version of the algorithm uses Javascript 1.6 array iterators so only try it in the latest browsers.

  10. December 8, 2009 at 5:05 am

    [...] I ran across this blog post that uses the hash-map method and works only on strings. (Actually, it works on any objects that [...]

  11. Sergei said,

    March 14, 2011 at 1:44 pm

    I found your solution through stackoverflow, its great, thank you.

  12. Nitin said,

    September 12, 2011 at 6:51 am

    Hi,

    I know this is a pretty old blog. But this would fail when case-sensitiveness comes into play.
    So if i have the input arr as: ["fish", "fishes", "fish", "1", "FISHES"]
    O/p would be of the form: ["1", "fish", "fishes", "FISHES"]

    Thanks,
    Nitin

  13. Andrea said,

    March 8, 2012 at 1:20 pm

    so long and thanks for all the fish

  14. deungeon said,

    April 7, 2012 at 6:43 am

    thanks, very usefull

  15. Ankit said,

    April 16, 2012 at 2:13 am

    awesome man … nice one


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: