Eliminating Duplicates

Multiplicity

Yesterday I ran into a programming problem and I didn’t like any of the solutions I came up with. I went to bed kind of grouchy about it.

Here’s the problem: Given a huge number of strings, return an array that has each string appear just once. Eliminate all duplicates.

This is a common enough problem in programming that there exists a large body of literature about it.

The solutions I found, though, were pretty ugly. They either involved nested loops and a lot of comparisons, or brain-busting regular expressions.

Now, I’m always willing to look at the regular expression solution to a problem, but for me a RegEx solution is a lot of work. It’s like trying to dredge up my high school Latin. You know the joke–there are 10 kinds of programmers, those who understand Regular Expressions, and those who don’t.

I wanted small, simple code. I couldn’t find anything that really felt right.

Going with the Flow

What kept bugging me? In the back of my head, I kept thinking that all these solutions are coming from other languages. They’re not taking advantage of JavaScript.

Isn’t there something in JavaScript that naturally removes duplicates? And finally, it hit me: object keys.

Objects in JavaScript are hashes–they are made of two parts: the left part and the right part.

{"left":right}

The key is unique, but of course the value can be duplicated.

Then I had it: The “key” is the key. All I have to do is loop through the strings and assign them to the keys of an object. Hashes are an automatic duplicate removal machine.

And it works. JavaScript does the work of eliminating the duplicates naturally. I’m not doing the heavy lifting in JavaScript. Instead, JavaScript is doing the heavy lifting internally in some kick-ass compiled C loop.

My code stays clean and minty fresh because I’m relying on heavily tested and optimized code inside the JavaScript interpreter of the browser.

function eliminateDuplicates(arr) {
  var i,
      len=arr.length,
      out=[],
      obj={};

  for (i=0;i
var a=[];
var b=[];

a[0]="fish";
a[1]="fishes";
a[2]="fish";
a[3]="1";
a[4]="fishes";

b=eliminateDuplicates(a);
console.log(a);
console.log(b);

---

["fish", "fishes", "fish", "1", "fishes"]
["fish", "fishes", "1"]

Further

Armed with this concept, I was able to simplify my code even more. My data didn't start out as an array. Instead, it was coming from a file, so I just grabbed the strings as they came in and stuffed them right into the object as keys.

Every language has its own strengths and weaknesses. You get a nice feeling when you finally see the problem you're trying to solve in the context of the tool you're using to do the job.

Quickie: Nameless functions in arrays

In JavaScript, you can write functions that look like this:

    function add(a,b) {
        return a+b;
    }

    function multiply(a,b) {
        return a*b;
    }

It’s the C way of writing functions, so it should look pretty familiar and comfortable. But it’s really just a short cut. There’s another way to write named functions.

    add = function(a,b) {
        return a+b;
    }

    multiply = function(a,b) {
        return a*b;
    }

If we like, we can put functions into objects.

ops={add: function(a,b) {return a+b;},
     multiply: function(a,b) {return a*b;}};

and now we can call the functions ops.add(a,b) and ops.multiply(a,b). Or, you can use another notation and call the functions with ops[“add”](a,b) and ops[“multiply”](a,b);

In a similar manner, because arrays can hold objects, we can do this:

ops=[function(a,b) {return a+b;},
     function(a,b) {return a*b;}];

I’ve stripped out the names of the functions and now they look like the anonymous functions that are so common in JavaScript. In this case we still have a way to call the function–by indexing into the array. Normally we create anonymous functions on the fly and bind them to an event like like a button click and never call the function directly in our code.

To get to these almost-anonymous functions I’ve defined in the array, call ops[0](a,b) or ops[1](a,b).

How Weird are JavaScript Arrays?

How weird are JavaScript arrays? Really weird.

In most programming languages that I’ve worked with, arrays and objects are two different types. An array is a pie, and an object is a cake.

JavaScript is so object-oriented that it’s difficult to draw a line between the pie and the cake. In JavaScript, arrays are cheesecake. This can be disturbing, It can make some programmers angry. I’m going to ask you to try to remain calm. If I upset you, please, just walk away.

It all starts out normally (like the first scene of a Hitchcock film). No surprises yet. Let’s look at an object and an array.

var obj={"species":"dog","name":"buster"};
var arr=["dog","buster"];
alert(obj.species); //dog
alert(arr[0]); //dog

Pretty close. The object uses braces, rather than square brackets. While the object labels its elements, the array numbers them. You use dot notation to get at the info in an object, and the square brackets (array notation) to get at the info in an array.

Now it Starts to Get Interesting

We can also get at object values with this array-like syntax:

alert(obj["species"]); //dog

Now, our object looks almost like an array.

In most languages, arrays are optimized for speed. Each element is the same type and size and can be quickly accessed. In JavaScript, arrays are so loose that there’s really no speeding them up. Each array element can hold a different type.

arr[0]='pig';
arr[1]=true;
arr[2]=[5,2,7,3];
arr[3]={'a':1,'b':2};

An array element can even hold a function (or an array of functions)!

This is massively handy, if unusual.

Sit Down, I Have Something Scary to Tell You

I’m sorry, but it’s time for me to do something that may weird you out. Look at this…

arr["state"]=0;

That’s upsetting. Arrays can do more than just hold their numbered elements, they can also act just like objects, holding named values. If you want to kick something, or yell, I don’t blame you. If you need to take a few deep breaths, I understand.

You have a choice to make. You need to get all cold, steely, and rational and decide if it makes sense to take advantage of what JavaScript offers, or if you’re better off backing off and doing things the way you’d do them in other languages. That depends on your situation. Are you the only coder who will ever see the code? If you’re working with others, you may very well blow their minds if you treat an array like an object. Or maybe you’re working with some like-minded cheesecake aficionados.

Are You Insane?

But there are some very nice things you can do with this. Having all information that’s associated with the array (maybe even the array’s methods) live inside the array can make your code easier to deal with and understand, but only if you understand that an array is a cheesecake.

If you don’t take advantage of this, but you do want to keep everything associated with the array together, you’ll use an object to wrap everything up, and the array will be part of the object. This is more clear to the traditional thinker, but it does have a cost–the overhead of all the extra symbols you’ll be typing to access things.

Here are an array and an object, both designed to hold the same things.

var arr=[],obj={};
arr.state=0;
arr[0]='dog';
arr[1]='cat';
obj.state=0;
obj.data=[];
obj.data[0]='dog';
obj.data[1]='cat';

I can make a pretty good case that the array is “better” in this case, assuming you’re a big fan of cheesecake.

True Story.

I once took a cat into the vet. It was my first time there so I had to fill out a form. One of the things they wanted to know was the pet’s species.

Crap. I dug deep into the part of my brain that was supposed to be holding on to my high school Latin, but I just couldn’t quite get it. I made a stab–felis domesticatus–no, can’t be right.

I walked up to the lady at the desk.

“Do people really know this?”

“Know what?”

“The species of their pet.”

“Yeah, usually.”

I pointed at my cat. “What’s this?”

She blinked, then answered, “It’s a cat.”