Eliminating Duplicates

Multiplicity

Yesterday I ran into a programming problem and I didn’t like any of the solutions I came up with. I went to bed kind of grouchy about it.

Here’s the problem: Given a huge number of strings, return an array that has each string appear just once. Eliminate all duplicates.

This is a common enough problem in programming that there exists a large body of literature about it.

The solutions I found, though, were pretty ugly. They either involved nested loops and a lot of comparisons, or brain-busting regular expressions.

Now, I’m always willing to look at the regular expression solution to a problem, but for me a RegEx solution is a lot of work. It’s like trying to dredge up my high school Latin. You know the joke–there are 10 kinds of programmers, those who understand Regular Expressions, and those who don’t.

I wanted small, simple code. I couldn’t find anything that really felt right.

Going with the Flow

What kept bugging me? In the back of my head, I kept thinking that all these solutions are coming from other languages. They’re not taking advantage of JavaScript.

Isn’t there something in JavaScript that naturally removes duplicates? And finally, it hit me: object keys.

Objects in JavaScript are hashes–they are made of two parts: the left part and the right part.

{"left":right}

The key is unique, but of course the value can be duplicated.

Then I had it: The “key” is the key. All I have to do is loop through the strings and assign them to the keys of an object. Hashes are an automatic duplicate removal machine.

And it works. JavaScript does the work of eliminating the duplicates naturally. I’m not doing the heavy lifting in JavaScript. Instead, JavaScript is doing the heavy lifting internally in some kick-ass compiled C loop.

My code stays clean and minty fresh because I’m relying on heavily tested and optimized code inside the JavaScript interpreter of the browser.

function eliminateDuplicates(arr) {
  var i,
      len=arr.length,
      out=[],
      obj={};

  for (i=0;i
var a=[];
var b=[];

a[0]="fish";
a[1]="fishes";
a[2]="fish";
a[3]="1";
a[4]="fishes";

b=eliminateDuplicates(a);
console.log(a);
console.log(b);

---

["fish", "fishes", "fish", "1", "fishes"]
["fish", "fishes", "1"]

Further

Armed with this concept, I was able to simplify my code even more. My data didn't start out as an array. Instead, it was coming from a file, so I just grabbed the strings as they came in and stuffed them right into the object as keys.

Every language has its own strengths and weaknesses. You get a nice feeling when you finally see the problem you're trying to solve in the context of the tool you're using to do the job.

Advertisements

Default Parameters, Episode V

I keep a pretty close eye on the searches that bring people to this blog. A lot of you are searching for how to sort in JavaScript (I’ll have to go into more detail on sorting soon). And a whole mess of you want to know about missing and default parameters.

I’m going to take another shot at the default parameter problem. I gave a solution that adds just one line of code per function, but it requires you to pass all parameters in a single object. I’ll adapt that solution to better handle the normal way of passing JavaScript parameters.

Really quickly, let’s talk about JavaScript parameters.

  1. When you pass in fewer than the expected number of parameters, the missing ones come in as undefined.
  2. When you pass in more than the expected number of parameters, the excess ones can be slurped out of the arguments pseudo-array. This array-like variable has the length method, but is missing other array methods.
  3. A common way to implement default parameters is with a series of statements like this: a=a || 1; The problem with such a statement is that so many different values evaluate to false that it’s easy to make a mistake and prevent a valid number like 0 from coming through.
  4. A better way to implement default parameters is like this: if (a===undefined) a=1;
  5. Another common way is if (a==null) a=1; This works pretty well because null and undefined are loosely equal to each other, but other falsy values aren’t equal to either of them.

Let’s modify my defaultHandler from my previous post on the subject so that it works with arrays instead of objects.

    defaultHandler=function(defaults,params) {
        var i;
        for (i=0;i<defaults.length;i++) {
            if (params[i]!==undefined) {
                defaults[i]=params[i];
            }
        }
        return defaults;
    }

Choose Your Poison

You now have a couple choices on how to use it. After the default handler is called, the array params will hold all your parameters in order.

    function test(a,b,c) {
        var params=defaultHandler([1,"banana",3],arguments);

        console.log(params);
    }

If you really want to, you can stuff the values back into the original named parameters…

    function test(a,b,c) {
        var params=defaultHandler([1,"banana",3],arguments);
        a=params[0];
        b=params[1];
        c=params[2];

        console.log(a,b,c);
    }

It’s a shame there’s no way to automate that stuffing of the parameters back into the original variables, isn’t it?

Ah, but there is. In fact, the whole process can be cleaned up with a bit of metaprogramming and some introspection. But that’s for Episode VI.

Note: This is the second post on default parameters. I took a cue from George Lucas on its numbering to try to get some search hits from the Star Wars zealots.

JavaScript Functions are Weird, Too

Remember when I said that JavaScript arrays are weird because they are also objects? Aside from their numbered elements, they can have, in addition, hash elements (a key and a value).

Believe it or not, functions in JavaScript are just as weird. They also are objects. If you have a function named bar, it can have a property named foo.

Doesn’t seem very useful, does it? After all, functions can already have variables declared inside them.

The difference? These special function elements persist from one call of the function to the next. You can use them to replace global variables that might otherwise cause unsightly clutter.

For instance, if you want a function to keep track of how many times it’s been called, this does the trick:

function add(a,b) {
  if (add.called===undefined) {
    add.called=0;
  }
  add.called++;
  return a+b;
}

or, more tersely but confusingly,

function add(a,b) {
  add.called=add.called+1 || 1;
  return a+b;
}

You don’t have to call the function by its name when you’re in the function, you can use arguments.callee instead.

How about a real world example. [Warning: You probably won’t get your money’s worth unless you compare the before-and-after code. I suggest copying and pasting them into a diff program so you can see the changes highlighted. Or you could just print out this whole blog post and take it to the bathroom.]

Alert! Unresponsive Script!

Sometimes, you might have some code that takes so long to execute that the browser alerts the user that a script has become unresponsive. That tends to take a long time for new browsers (they assume you might be running a Rich Internet Application and that such messages may be worse than the trouble they’re trying to solve), but for old browsers (like my nemesis IE6), the time is short. Something like 3 seconds.

Yummy Morsels

To solve this problem we have to break up work into bite-sized pieces.

Here’s our before case, the brute force (slow) solution to finding prime numbers. Adapted from a C version here.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html lang="en">
<head>
    <title><!-- Insert your title here --></title>
    <script type="text/javascript">
        function findPrimes(topCandidate) {
            var primes=[];

            candidate = 2;
            while (candidate <= topCandidate) {
                trialDivisor = 2;
                       prime = true;
                while (trialDivisor * trialDivisor <= candidate) {
                    if (candidate % trialDivisor === 0) {
                        prime = false;
                        break;
                    }
                    trialDivisor++;
                }
                if (prime) {
                    primes.push(candidate);
                }
                candidate++;
            }
            return primes.length;
        }
    </script>
</head>
<body>
    <!-- Insert your content here -->
    <form action="">
        <input type="button"
               onclick='alert(findPrimes(100000))'
               value="Find Primes" />
    </form>
</body>
</html>

Now here’s the version where we only let 1000 numbers be checked at once. After 1000 have been done, we fall out of the code, but before we go we trigger a setTimeout with a delay time of 0. That’s all we need to do to let the browser catch its breath.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html lang="en">
<head>
<title><!-- Insert your title here --></title>
<script type="text/javascript">
    function findPrimes2(topCandidate) {
        var persist=arguments.callee;
        var tries=0;
        if (persist.candidate===undefined) {
            persist.candidate=2;
            persist.primes=[];
            persist.topCandidate=topCandidate;
        }
        keepTrying = true;
        while (persist.candidate <= persist.topCandidate && keepTrying) {
            trialDivisor = 2;
            prime = true;
            while (trialDivisor * trialDivisor <= persist.candidate) {
                if (persist.candidate % trialDivisor === 0) {
                    prime = false;
                    break;
                }
                trialDivisor++;
            }
            if (prime) {
                persist.primes.push(persist.candidate);
            }
            persist.candidate++;
            tries++;
            if (tries>1000) {
                keepTrying=false;
            }
        }
        if (keepTrying) {
            persist.candidate=undefined;
            alert(persist.primes.length);
        } else {
            setTimeout(persist,0);
        }
    }
    </script>
</head>
<body>
    <!-- Insert your content here -->
    <form action="">
        <input type="button"
               onclick='findPrimes2(100000)'
               value="Find Primes" />
    </form>
</body>
</html>

Compare the two versions. In the second, I’ve short-circuited the loop so that it only checks 1000 numbers to see if they are prime. To keep comparisons easy, I’ve tried to make the least number of changes I could. In real-world cases, you start with the idea that you’re going to work in chunks, and your code ends up cleaner.

Some Are Sticky, Some Aren’t

We’ve made some of the variables persistent. These are the ones that have to be kept across invocations of the function. All other variables dry up and blow away.

Since arguments.callee is such a long name, I’ve set the variable persist to hold the current function name.

This is not a recursive function. We let the function finish. It’s set up to be called again with setTimeout. You don’t want to wait around in busy loops, making the browser all gummy and unresponsive.

If you do have responsiveness problems, put a longer delay and check fewer candidates at a time. This will make the process take longer, of course. It’s a trade-off.

Any wait longer than a couple seconds is going to make the user wonder what’s going on. A progress bar is the best solution for long waits. For short waits, a throbberis OK.

Program flow is not obvious when using setTimeout to control flow. In larger programs when you’re using this technique, you may want some kind of handler that’s in charge of passing control from one section of the code to another, instead of hardcoded setTimeout chaining.