Creating my own HAR Reader

I generally use a complex rendering engine to create a few content heavy websites locally. When I do, the original files that generate the website will be mostly using a server side language like PHP or Node JS (along with more complexity). What I might need would be a static site with a few assets and HTML files to host it in GitHub or other similar platforms.

What I generally do is to run them locally using http://localhost/ and just filter out the only assets that are called while rendering the page. This is done by capturing all the network requests in a HTTP Archive and then getting the URLs that are needed and rebuilding the target static website.

Contents

  1. HAR & HAR Reader
  2. Motivation & Current Issues
  3. Node JS to the rescue!
    1. Initialising your Node JS project
    2. Reading the HAR file
    3. Parsing JSON
    4. Structure of HAR
    5. Getting the URL Entries
    6. Join and transform the array entries
    7. Writing to a file
  4. Complete File
    1. GitHub Repository
  5. Future Improvements
    1. Sort the lines naturally
    2. Remove the duplicate URLs
    3. Create a cURL request to get it to right path
    4. Remove the query strings for local file names
    5. Read & write the whole lot to a file of our choice
    6. Extra Runtime Options

HAR & HAR Reader

So what's this HAR? HAR (HTTP Archive) is a file format used by several HTTP session tools to export the captured data. The format is basically a JSON object with a particular field distribution. In any case, please note that not all the fields are mandatory, and many times some information won't be saved to the file.

Also beware, HAR files contain sensitive data!

  • content of the pages you downloaded while recording
  • your cookies, which will allow anyone with the HAR file to impersonate your account
  • all the information that you submitted while recording: personal details, passwords, credit card numbers...

I have been using Google HAR Analyzer, a part of G Suite toolbox. This is a nice app by Google to analyse the captured HAR files, but there's a big downside to it.

Motivation & Current Issues

The HAR Analyzer by Google works on client side and depending on your machine and browser, it may not handle huge files. I am using a MacBook Pro 15" Retina with 2.8 GHz CPU and 16 GB of RAM, but unfortunately, even such a system is not capable of handling more than 50 MB of JSON file processing. I had to think of a way and had a thought, if it's just processing JSON, all I need would be a JavaScript engine!

Node JS to the rescue!

Google Chrome has been praised as containing one of the best JavaScript engines, which is V8 and that's the same engine being used in Node JS. All I need is just the engine and not a browser. Now that I have got the JavaScript engine, these are the steps I thought of:

  1. Read the HAR file as string and store it locally.
  2. Parse the string as JavaScript, as HAR files is just a big JSON file.
  3. Get the path of the URL entries and push it to an array.
  4. Join the contents of the array in the way I want.
  5. Writing it to a new file as showing it is too mainstream.

Initialising your Node JS project

If you haven't done, please initiate the project using npm init and then give some good details about your app. Once you have done, give the starting script as app.js or anything you want and start writing the code. A heads up, I didn't do this way, but it will help you in the long run.

➜  har-reader git:(master) ✗ npm init
This utility will walk you through creating a package.json file.  
It only covers the most common items, and tries to guess sensible defaults.

See `npm help json` for definitive documentation on these fields  
and exactly what they do.

Use `npm install <pkg>` afterwards to install a package and  
save it as a dependency in the package.json file.

Press ^C at any time to quit.  
package name: (har-reader)  
version: (1.0.0)  
description: A quick & easy HAR Reader with Node JS.  
entry point: (index.js) app.js  
test command:  
git repository: (https://[email protected]/praveenscience/har-reader.git)  
keywords: HAR Reader, HAR, HTTP Archive, Node JS, JSON  
author: Praveen Kumar Purushothaman  
license: (ISC)  
About to write to /Users/praveen/Downloads/Works/HAR-Reader/har-reader/package.json:

{
  "name": "har-reader",
  "version": "1.0.0",
  "description": "A quick & easy HAR Reader with Node JS.",
  "main": "app.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "repository": {
    "type": "git",
    "url": "git+https://github.com/praveenscience/har-reader.git"
  },
  "keywords": [
    "HAR Reader",
    "HAR",
    "HTTP Archive",
    "Node JS",
    "JSON"
  ],
  "author": "Praveen Kumar Purushothaman",
  "license": "ISC",
  "bugs": {
    "url": "https://github.com/praveenscience/har-reader/issues"
  },
  "homepage": "https://github.com/praveenscience/har-reader#readme"
}

Is this OK? (yes) yes  

Once you do that, it generates your package.json file. I need to create a new file called app.js and start my coding and this app.js rests next to our package.json file. Many developers might want to create the main file inside src folder, but well, this is a very small app, so I planned to keep it simple. And with the above configuration, npm start will most probably will not work and throw something like:

➜  har-reader git:(master) ✗ npm start
npm ERR! missing script: start

npm ERR! A complete log of this run can be found in:  
npm ERR!     ~/.npm/_logs/time-debug.log  

To mitigate that issue, add one item under the scripts object to match this:

"scripts": {
  "start": "node app",
  "test": "echo \"Error: no test specified\" && exit 1"
},

Now you will be able to run the application using npm start without any problems.

Reading the HAR file

Reading files using Node JS is very easy as you just need to import the fs package. To use fs in our project, we start by adding the following code:

const fs = require("fs");  

This gives us the access to the file system library and this library has some handy file reading and manipulation functions. We will be using two of its methods:

To use these functions, you will need the fs package installed and written to the package.json file. The easy way to do is to use npm this way:

➜  har-reader git:(master) ✗ npm install fs --save
npm notice created a lockfile as package-lock.json. You should commit this file.  
+ [email protected]
added 1 package and audited 1 package in 0.694s  
found 0 vulnerabilities  

That would be a pretty quick process and you will have access to fs library using the require("fs"). The function to read the file synchronously will be fs.readFileSync() and we don't require an asynchronous function as we have nothing else to do. Let's read the file by thinking we have an arbitrary file around, say file.har and let's store it inside a local constant called fileContents.

const fileContents = fs.readFileSync("file.har");  

The above constant will have the complete content of the file as string value. The readFileSync() function reads data from a file in a synchronous manner. This means, when this function is executed, it blocks the rest of the code from executing until all the data is read from a file. The function is particularly useful when our application has to load the data from the file before it can perform any other tasks.

Parsing JSON

To convert any string value that contains a valid JSON, we can reliably use the JSON library that's built in JavaScript (and V8, of course). The JSON library contains a lot of handy methods for handling JSON in JavaScript, although JSON is JavaScript object notation. To parse the JSON, we'll be using the JSON.parse() method, which converts a string containing valid JSON into a JavaScript object.

const jsonContents = JSON.parse(fileContents);  

We have now created a new constant jsonContents with JavaScript object that we read and parsed from the HAR file.

Structure of HAR

According to HAR Spec, the structure of the file we just read looks like this:

{
  "log": {
    "version": "1.2",
    "creator": {
      "name": "WebInspector",
      "version": "537.36"
    },
    "pages": [
      {
        "startedDateTime": "2019-04-02T13:21:52.343Z",
        "id": "page_36",
        "title": "http://localhost/index.html",
        "pageTimings": {
          // ...
        }
      }
    ],
    "entries": [
      {
        // ...
        "request": {
          // ...
          "method": "GET",
          "url": "http://localhost/index.html"
          // ...
        }
        // ...
      },
      {
        // ...
        "request": {
          // ...
          "method": "GET",
          "url": "http://localhost/css/style.css"
          // ...
        }
        // ...
      }
    ]
  }
}

Let's say, if I have a main file named index.html, and a CSS file that's in a path of /css/style.css, then the index.html will be calling in the style.css and a typical HAR file for the requests (leaving out the other unnecessary stuff) will look like the above. We are mainly interested in the entries object of log, specifically, the URL that is called, which is given by log.entries[n].request.url.

We can show the user, how many URLs will be expected by a quick console.log():

console.log(`Entries: ${jsonContents.log.entries.length}`);  

Getting the URL Entries

If we do a .map() on the data.log.entries, we will be able to get the entry.request.url in an array by a simple function this way. To get the individual entry, we will be using the first parameter of the callback function to the .map() function. The speciality of this function is that we don't need to explicitly push() to the arrays as this returns an array with the required value. This gives us with:

const urls = jsonContents.log.entries.map(entry => entry.request.url);  

Now the urls will have a content similar to:

const urls = [  
  "http://localhost/index.html",
  "http://localhost/css/style.css"
];

Join and transform the array entries

This is the final part here. We just need to get the URLs, each per line. This is easy with the built-in function of join(), which is available for all the array variables. If we specify a delimiter to join the individual entries with, it does it for you. The delimiter can be anything from a space, a new line, to something that's complex. Since we just need a new line, I am going to use \n:

const urlLines = urls.join("\n");  

This will translate the above array into:

const urlLines = `http://localhost/index.html  
http://localhost/css/style.css`;  

Or, for the old browser people, it's going to be like:

const urlLines = "http://localhost/index.html\nhttp://localhost/css/style.css";  

Writing to a file

The main reason of having created this app is when the number of URLs are far bigger than we can imagine. Last night I was working with a HAR file of 175 MB and it had about 8500 URLs. Obviously, when the HAR file is that huge, it might contain many URLs that are duplicated, not from the domain we need, etc. they need to be filtered out.

Again, the fs library comes to our rescue. Similar to the readFile() and readFileSync() functions, there are two functions for writing data to files: writeFile() and writeFileSync(). As the names suggest, the writeFile() method writes data to a file in an asynchronous way while writeFileSync() function writes data to a file in a synchronous manner.

The writeFileSync() function accepts 2-3 parameters: The path of the file to write data to, the data to write, and an optional parameter. Note that if the file doesn't already exist, then a new file is created for you.

fs.writeFileSync("file.json", urlLines);  

The above function will write the contents of the urlLines to file.json. I have specified a JSON output is because for some reason, I am unable to write the contents to any file type other than .js or .json. I'll need to verify the accuracy of this statement and I'll update this accordingly.

Complete File

With all the above being said, the complete app.js will look like:

const fs = require("fs");  
const fileContents = fs.readFileSync("file.har");  
const jsonContents = JSON.parse(fileContents);  
const urls = jsonContents.log.entries.map(entry => entry.request.url);  
const urlLines = urls.join("\n");  
fs.writeFileSync("file.json", urlLines);  

And it reads file.har and transforms all the URLs from it to file.json. I am planning to improve my script with the ideas given in the next section.

GitHub Repository

I have also created a GitHub repository for the same and updated the files there. Do check it out to play around with it and feel free to contribute to the repository too! You can find the above code at praveenscience/har-reader.

Future Improvements

After making the first version, I found a few things that can be improved a lot. Here are a few thoughts.

Sort the lines naturally

Natural sort ordering is mainly for alpha numeric strings, where multiple digits are treated and ordered as a single character, and this helps us to look at the URLs as the same way, how we manually sort it. This way is more human-friendly, natural than how machine sees as pure alphabetical order.

For example, in alphabetical sorting "z11" would be sorted before "z2" because "1" is sorted as smaller than "2", while in natural sorting "z2" is sorted before "z11" because "2" is sorted as smaller than "11".

I have written a detailed post on How I achieved Natural Sorting. This is worth a read with various implementations compared.

Remove the duplicate URLs

Something that can be done easily with the help of the new Set data structure.

Create a cURL request to get it to right path

A best way to create a local script to use cURL to fetch the assets.

Remove the query strings for local file names

Something hard here, but definitely possible, removing the ? and everything after that when saving it to a local file name.

Read & write the whole lot to a file of our choice

The ability to save the output to a file of our choice.

Extra Runtime Options

We can think of extra runtime options on the command line like specifying the root or common domain and excluding others.



comments powered by Disqus