Crawl you website including login form with Phantomjs

With PhantomJS, we start a headless WebKit and pilot it with our own scripts. Said differently, we write a script in JavaScript or CoffeeScript which controls an Internet browser and manipulates the webpage loaded inside. In the past, I’ve used a similar solution called Selenium. PhantomJS is much faster, it doesn’t start a graphical browser (that’s what headless stands for) and you can inject your own JavaScript inside the page (I can’t remember that we could do such a thing with Selenium).

The project is now suspended but there exists multiple headless browser alternatives.

PhantomJS is commonly used for testing websites and HTML-based applications which content is dynamically updated with JavaScript events and Ajax requests. The product is also popular to generate screenshot of webpages and build website previews, a usage illustrated below.

The official website presents PhantomJS as:

Headless Website Testing: Run functional tests with frameworks such as Jasmine, QUnit, Mocha, Capybara, WebDriver, and many others.
Screen Capture: Programmatically capture web contents, including SVG and Canvas. Create web site screenshots with thumbnail preview.
Page Automation: Access and manipulate webpages with the standard DOM API, or with usual libraries like jQuery.
Network Monitoring: Monitor page loading and export as standard HAR files. Automate performance analysis using YSlow and Jenkins.

In my case, I’ve used it to simulate users behaviors under high load to create user logs and populate a system like Google Analytics. More specifically, I will introduce a project architecture composed of 3 components:

User-written PhantomJS scripts that I later call “actions”. An action simulates user interactions and could be chained with other actions. For example a first action could login a user and a second one could update its personal information.
A generic PhantomJS script to run sequencially multiple actions passed as arguments.
A Node.js script to pilot PhantomJS and simulate concurrent user loads.

To make things more interesting, the user-written scripts will show you how to simulate a user login, or any form submission. Please don’t use it as a basis to login into your (boy|girl) friend Gmail account.

The user-written scripts

I will write 2 scripts for illustration purpose. The first will login the user on a fake website and the second will go to two user information pages. Those scripts are written in CoffeeScript and interact with the PhantomJS API which borrow a lot from the CommonJs specification. Keep in mind that even if it looks a lot like Node.js, it’s JavaScript after all, it will run in a completely different environment.

webpage = require 'webpage'
module.exports = (callback) ->
  page = webpage.create()
  url = 'https://mywebsite.com/login'
  count = 0
  page.onLoadFinished = ->
    console.log '** login', count
    page.render "login_#{count}.png"
    if count is 0
      page.evaluate ->
        jQuery('#login').val('IDTMAASP15')
        jQuery('#pass').val('azerty1')
        jQuery('[name="loginForm"] [name="submit"]').click()
    else if count is 1
      callback()
    count++
  page.open url, (status) ->
    return new Error "Invalid webage" if status isnt 'success'

The information action

webpage = require 'webpage'
module.exports = (callback) ->
  page = webpage.create()
  count = 0
  page.onLoadFinished = ->
    console.log 'info', count
    page.render "donnees_perso_#{count}.png"
    if count is 0
      page.evaluate ->
        window.location = jQuery('.boxSection [href*=info]')[0].href
    else if count is 1
      page.evaluate ->
        window.location = jQuery('.services [href*=info_perso]')[0].href
    else if count is 2
      page.goBack()
    else if count is 3
      page.evaluate ->
        window.location = jQuery('.services [href*=info_login]')[0].href
    else if count is 4
      callback()
    count++
  page.open 'https://domain/path/to/login', (status) ->
    return callback new Error "Invalid webage" if status isnt 'success'

There are a few things in this code which are interesting and that I will comment.

On line 9, the call to page.render generates a screenshot of the webpage at the time of the call. Generating website screen captures is a common use of PhantomJS.

The code is run inside the PhantomJS execution engine with the exception of the one inside the page.evaluate running inside the loaded webpage. This simplify the writing of your PhantomJS script but is a little awkward in the sense that you won’t be able to share context between those two sections. It is like if the webpage code is evaluated withpage.evaluate.toString and run inside a separate engine.

Finally, the page object represents all the pages we will load. It is more appropriate to conceive it as a tab inside your browser inside which multiple pages are loaded. The function page.onLoadFinished is called every time a page is loaded.

2. The action runner

This script is also run inside PhantomJS. Its purpose is to run multiple actions sequentially (one after the other) in a generic manner.

The action runner takes a list of actions provided as arguments, load the JavaScript scripts named after the actions and run those scripts sequentially.

# Grab arguments
args = require('system').args
# Convert to an array
args = Array.prototype.slice.call(args, 0)
# Remove the script filename
args.shift()
# Callback when all action have been run
done = (err) ->
  phantom.exit if err then 1 else 0
# Run the next action
next = (err) ->
  n = args.shift()
  return done err if err or not n
  n = require "./#{n}"
  n next
next()

3. The pilot

The pilot is a Node.js application responsible for Managing and Monitoring PhantomJS. It is able to simulate concurrent load by running multiple instances of PhanomJs in parallel. To achieve concurrency, I used the Node.js each module. The each.prototype.parallel indicates how many instances of PhantomJS will run at the same time. The each.prototype.repeat indicate how many times each action will run.

fs = require 'fs'
util = require 'util'
phantomjs = require 'phantomjs'
each = require 'each'
child = require 'child_process'
cookies = "#{__dirname}/cookies.txt"

run = (actions, callback) ->
  args = [
    "--ignore-ssl-errors=yes"
    "--cookies-file=#{cookies}"
    "#{__dirname}/run.js"
  ]
  for action in actions then args.push action
  util.print "\x1b[36m..#{actions.join(' ')} start..\x1b[39m\n"
  web = child.spawn phantomjs.path, args
  web.stdout.on 'data', (data) ->
    util.print "\x1b[36m#{data.toString()}\x1b[39m"
  web.stderr.on 'data', (data) ->
    util.print "\x1b[35m#{data.toString()}\x1b[39m"
  web.on 'close', (code) ->
    util.print "\x1b[36m..#{actions.join(' ')} done..\x1b[39m\n"
    if callback
      err = if code isnt 0 then new Error "Invalid exit code #{code}" else null
      callback err

each([
  ['login','information']
  ['login','another_action'] ])
.parallel(2)
.repeat(20)
.on 'item', (scripts, next) ->
  fs.unlink cookies, (err) ->
    run scripts, next

Put it all together

In the end, you might create a Node.js project (simply a directory with a package.json file inside), place all the files described above inside the new directory, declare your “phantomjs” and “each” module dependencies (inside the package.js file), install them with npm install and run your “run.js” script with the command node run.js.

Note about PhantomJs cookies

This is a personal section covering my experience on using the cookies support. PhantomJS accepts a “cookies-file” argument with a file path as a value. Basically, a PhantomJS commands would look like phantomjs --cookies-file=#{cookies} {more_arguments} {script_path} {script_arguments}.

After a few trials, I wasn’t able to use the cookies file efficiently. Trying to run a second script will not honored the persisted session. However, if I don’t exit PhantomJS with phantom.exit() and force quit the application instead, then the cookie file will work as expected.

This is one of the two reasons why I came up with such an architecture in which I can chain multiple actions. The other reason is speed since the headless Webkit instance is started fewer times. I don’t blame PhantomJS, it could be something I pass over in the documentation.

Share this article