Eliminating Duplicates

Multiplicity

Yesterday I ran into a programming problem and I didn’t like any of the solutions I came up with. I went to bed kind of grouchy about it.

Here’s the problem: Given a huge number of strings, return an array that has each string appear just once. Eliminate all duplicates.

This is a common enough problem in programming that there exists a large body of literature about it.

The solutions I found, though, were pretty ugly. They either involved nested loops and a lot of comparisons, or brain-busting regular expressions.

Now, I’m always willing to look at the regular expression solution to a problem, but for me a RegEx solution is a lot of work. It’s like trying to dredge up my high school Latin. You know the joke–there are 10 kinds of programmers, those who understand Regular Expressions, and those who don’t.

I wanted small, simple code. I couldn’t find anything that really felt right.

Going with the Flow

What kept bugging me? In the back of my head, I kept thinking that all these solutions are coming from other languages. They’re not taking advantage of JavaScript.

Isn’t there something in JavaScript that naturally removes duplicates? And finally, it hit me: object keys.

Objects in JavaScript are hashes–they are made of two parts: the left part and the right part.

{"left":right}

The key is unique, but of course the value can be duplicated.

Then I had it: The “key” is the key. All I have to do is loop through the strings and assign them to the keys of an object. Hashes are an automatic duplicate removal machine.

And it works. JavaScript does the work of eliminating the duplicates naturally. I’m not doing the heavy lifting in JavaScript. Instead, JavaScript is doing the heavy lifting internally in some kick-ass compiled C loop.

My code stays clean and minty fresh because I’m relying on heavily tested and optimized code inside the JavaScript interpreter of the browser.

function eliminateDuplicates(arr) {
  var i,
      len=arr.length,
      out=[],
      obj={};

  for (i=0;i<len;i++) {
    obj[arr[i]]=0;
  }
  for (i in obj) {
    out.push(i);
  }
  return out;
}
var a=[];
var b=[];

a[0]="fish";
a[1]="fishes";
a[2]="fish";
a[3]="1";
a[4]="fishes";

b=eliminateDuplicates(a);
console.log(a);
console.log(b);

["fish", "fishes", "fish", "1", "fishes"]
["fish", "fishes", "1"]

Further

Armed with this concept, I was able to simplify my code even more. My data didn’t start out as an array. Instead, it was coming from a file, so I just grabbed the strings as they came in and stuffed them right into the object as keys.

Every language has its own strengths and weaknesses. You get a nice feeling when you finally see the problem you’re trying to solve in the context of the tool you’re using to do the job.

7 Comments

  1. Nosredna said,

    August 22, 2008 at 8:11 pm

    The germ for this idea is here:

    http://dreaminginjavascript.wordpress.com/2008/07/08/you-could-lose-your-mind/

  2. alsanan said,

    August 25, 2008 at 2:02 pm

    Smart! …like always ;-)

  3. Nosredna said,

    August 25, 2008 at 3:56 pm

    >>Smart!

    Yeah, maybe. But now I’m thinking all the extra fuss I went to with the asterisk only applied to an earlier revision of the code, and that I can probably just do this…

    function eliminateDuplicates(arr) {
    var i;
    var len=arr.length;
    var out=[];
    var obj={};

    for (i=0;i<len;i++) {
    obj[arr[i]]=0;
    }
    for (i in obj) {
    out.push(i);
    }
    return out;
    }

  4. Nosredna said,

    August 25, 2008 at 4:19 pm

    Yeah, I tested the simplified code in all 4 major browsers. I’m going to change the article to reflect the simplified code.

  5. Nosredna said,

    August 28, 2008 at 11:55 pm

    My buddy Ryan suggested this:

    function eliminateDuplicates(arr) {
    var i;
    var len=arr.length;
    var out=[];
    var obj={};

    for (i=0;i<len;i++) {

    if (!obj[arr[i]])

    {

    obj[arr[i]]={};

    out.push(arr[i]);

    }

    }
    return out;
    }

    This way you check for a key, and if it evaluates to falsey (because null is falsey) it will add the key and add to the return array in one shot. It doesn’t save any orders of magnitude in execution time (just takes it from 2nLogn to nlogn), but it should be twice as fast which may make a difference when huge arrays are passed into it.

  6. breton said,

    September 27, 2008 at 2:48 pm

    I find this works pretty damn well for eliminating duplicates:

    myarray.filter(function(o,i,a){ return a.indexOf(o) === i; });

  7. October 25, 2008 at 6:44 am

    Great solution. I like your thought process. It is always important, when searching for optimizations, to consider the features that the language you are working with has to offer.

    Also, thanks for your great comments.


Post a Comment