Natural Sorting in JavaScript

Problem

Natural sort ordering is mainly for alpha numeric strings, where multiple digits are treated and ordered as a single character, and this helps us to look at the URLs as the same way, how we manually sort it. This way is more human-friendly, natural than how machine sees as pure alphabetical order.

For example, in alphabetical sorting "z11" would be sorted before "z2" because "1" is sorted as smaller than "2", while in natural sorting "z2" is sorted before "z11" because "2" is sorted as smaller than "11".

Contents

  1. Problem
  2. Example
  3. Solution
    1. Code Pen
    2. Other Solution
    3. An IE Issue Appears
    4. A Faster Prototype Version
    5. HumanSort Version
  4. Summary

Example

A best example for this would be, consider the following data set:

[
  "123asd",
  "19asd",
  "12345asd",
  "asd123",
  "asd12"
]

When we do our normal JavaScript's array sort method, it gives this output:

[
  "12345asd",
  "123asd",
  "19asd",
  "asd12",
  "asd123"
]

This is because, the JavaScript's default array sort looks the string as a whole without considering the alphanumeric nature of the strings. There should be something that says that 19asd is smaller than 123asd and also, 123asd is smaller than 12345asd. The desired output will be:

[
  "19asd",
  "123asd",
  "12345asd",
  "asd12",
  "asd123"
]

The above example would have clearly shown what would be the expected output. I want the parser to read the whole 19 and 123 as a number and not the letter by letter comparison.

Solution

There are multiple ways to approach this problem. Most of them are either too complex in their algorithm or there's a lack of browser support for the code. There's a new possibility that's available in the modern browsers. It's localeCompare and by passing the numeric: true option, it will smartly recognize numbers. You can do case-insensitive using sensitivity: 'base'. This method has been tested in Chrome, Firefox, and IE11.

Since we are using a Node JS application, it uses V8 JavaScript engine and since it's by Chrome, we can blindly assume that the support exists. A best example, comparing "10" and "2", the normal JavaScript will do it based on the first "character", but the following method will try to parse the number as string and then does string comparison, if there are more characters to the item.

'10'.localeCompare('2', undefined, {  
  numeric: true,
  sensitivity: 'base'
})

When you are worried about performance, which we should definitely think of, when sorting large number of strings, the article says:

When comparing large numbers of strings, such as in sorting large arrays, it is better to create an Intl.Collator object and use the function provided by its compare property.

Code Pen

Here's a working snippet to demonstrate two different examples:

Other Solution

This one is a complex algorithm. Well, it's too complex to even understand for me. The reason not being complex, but this kind of algorithm is not cross browser compatible at all. For example, the main issue comes with Internet Explorer.

To reiterate, Natural sorting, or Lexicographical order, is sorting a list of strings which contain letters and numbers, in such a way that makes sense to humans rather than computers. Given a list of filenames, a computer would sort them this way:

z1.doc, z10.doc, z17.doc, z2.doc, z23.doc, z3.doc  

While a human would sort it like so:

z1.doc, z2.doc, z3.doc, z10.doc, z17.doc, z23.doc  

Obviously, the second list makes more sense, no? Anyway, I needed an algorithm that would sort any list of strings in that fashion.

Right off the bat, I found two promising candidates, but both failed quite spectacularly on my test array (which included bunches of numbers up to 1000). More searching only seemed to find blog posts which merely linked to these faulty examples.

So, giving up that angle, I decided the best way to get a natural sorting algorithm in javascript was to examine one from another language (preferably one which worked) and port it to javascript.

I took the Perl implementation and ported it to JavaScript resulting in the function below:

function alphanum(a, b) {  
  function chunkify(t) {
    var tz = [], x = 0, y = -1, n = 0, i, j;

    while (i = (j = t.charAt(x++)).charCodeAt(0)) {
      var m = (i == 46 || (i >=48 && i <= 57));
      if (m !== n) {
        tz[++y] = "";
        n = m;
      }
      tz[y] += j;
    }
    return tz;
  }

  var aa = chunkify(a);
  var bb = chunkify(b);

  for (x = 0; aa[x] && bb[x]; x++) {
    if (aa[x] !== bb[x]) {
      var c = Number(aa[x]), d = Number(bb[x]);
      if (c == aa[x] && d == bb[x]) {
        return c - d;
      } else return (aa[x] > bb[x]) ? 1 : -1;
    }
  }
  return aa.length - bb.length;
}

Note that this code is case sensitive, so "A" will sort before "a". You can turn it into a case insensitive sort by lowercasing the incoming a and b test strings. Just change the chunkify lines to:

var aa = chunkify(a.toLowerCase());  
var bb = chunkify(b.toLowerCase());  

An IE Issue Appears

After first posting this script, I came across an issue in IE which caused the script to fail. Don't worry! The lines of code above are the modified - and working - alphanum function. The problem occurred in the script as originally posted; specifically with the two lines:

a = chunkify(a);  
b = chunkify(b);  

Note the variable names, where the assignment tries to place the result back into the original variable. Actually, all that's required for the bug to manifest is to assign any array to the incoming a and b variables. For some strange reason this would fail in IE6 and IE7 while sorting arrays with larger than 23 or 24 elements. In these cases, at seemingly random times during the sort, "undefined" would be passed back to a, causing the next reference to it to trigger an exception.

The following is example code (from a testcase) which triggers the bug in IE. This sort will fail in IE6 and IE7.

function customsort(a, b) {  
  a = ['a']; // assign an array to a
  b = ['b']; // assign an array to b

  // IE will bail out when either a or b is undefined
  return a.length - b.length;
}
var arr = [  
  'a','b','c','d','e','f','g','h','i','j','k','l','m',
  'n','o','p','q','r','s','t','u','v','w','x'
];
arr.sort(customsort);  

To solve this, all it took was to assign the result to new variables (aa and bb) and change all further references to a and b to the new set. Definitely a JS bug in IE, and as it involves the creation and destruction of many arrays within a short time, it is likely some kind of memory issue.

A Faster Prototype Version

For those who value speed over flexibility, I also made this function into an Array method. Because it breaks up all the array strings first, then does the actual sorting, then joins all the sub-arrays back into strings, it is a great deal faster than the version above. That version breaks up two strings for each sort iteration rather than only once for each array element.

It's harder to modify this method to sort multi-dimensional arrays, or arrays of objects with string properties, but the speed gain is hard to ignore. As an added bonus, you can specify case sensitivity on a per-call basis with the caseInsensitive argument (true|false). If you're just going to sort plain arrays of strings, use this version.

Array.prototype.alphanumSort = function(caseInsensitive) {  
  for (var z = 0, t; t = this[z]; z++) {
    this[z] = [];
    var x = 0, y = -1, n = 0, i, j;

    while (i = (j = t.charAt(x++)).charCodeAt(0)) {
      var m = (i == 46 || (i >=48 && i <= 57));
      if (m !== n) {
        this[z][++y] = "";
        n = m;
      }
      this[z][y] += j;
    }
  }

  this.sort(function(a, b) {
    for (var x = 0, aa, bb; (aa = a[x]) && (bb = b[x]); x++) {
      if (caseInsensitive) {
        aa = aa.toLowerCase();
        bb = bb.toLowerCase();
      }
      if (aa !== bb) {
        var c = Number(aa), d = Number(bb);
        if (c == aa && d == bb) {
          return c - d;
        } else return (aa > bb) ? 1 : -1;
      }
    }
    return a.length - b.length;
  });

  for (var z = 0; z < this.length; z++)
    this[z] = this[z].join("");
}

HumanSort Version

Another version is by splitting the digits and comparing them first and then comparing the strings. Not sure if this works with alpha-numeric-alpha strings. But this is worth a try.

Array.prototype.humanSort = function() {  
  return this.sort(function(a, b) {
    aa = a.split(/(\d+)/);
    bb = b.split(/(\d+)/);

    for(var x = 0; x < Math.max(aa.length, bb.length); x++) {
      if(aa[x] != bb[x]) {
        var cmp1 = (isNaN(parseInt(aa[x],10)))? aa[x] : parseInt(aa[x],10);
        var cmp2 = (isNaN(parseInt(bb[x],10)))? bb[x] : parseInt(bb[x],10);
        if(cmp1 == undefined || cmp2 == undefined)
          return aa.length - bb.length;
        else
          return (cmp1 < cmp2) ? -1 : 1;
      }
    }
    return 0;
  });
}

Summary

We did have multiple options to choose from for a good natural sort algorithm. Since our code is interpreted by the V8 browser, it's always better to go for the highly performant localeCompare one and I did go with it for my HAR Reader project. Do have a look at the project on GitHub at praveenscience/har-reader and let me know what you think of the "product" in the comments below.



comments powered by Disqus