Page Automation with PhantomJS

Because PhantomJS can load and manipulate a web page, it is perfect to carry out various page automation tasks.

DOM Manipulation

Since the script is executed as if it is running on a web browser, standard DOM scripting and CSS selectors work just fine.

The following useragent.js example demonstrates reading the textContent property of the element whose id is qua:

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function(status) {
  if (status !== 'success') {
    console.log('Unable to access network');
  } else {
    var ua = page.evaluate(function() {
      return document.getElementById('qua').textContent;
    });
    console.log(ua);
  }
  phantom.exit();
});

The above example also shows the approach to customize the User-Agent string seen by the remote web server.

Use jQuery and Other Libraries

As of version 1.6, you are also able to include jQuery into your page using page.includeJs as follows:

var page = require('webpage').create();
page.open('http://www.sample.com', function() {
  page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
    page.evaluate(function() {
      $("button").click();
    });
    phantom.exit()
  });
});

The above snippet will open up a web page, include the jQuery library into the page, and then click on all buttons using jQuery. It will then exit from the web page.

Make sure to put the phantom.exit() statement within the page.includeJs or else it may exit prematurely before the JavaScript library is included.

The Webpage instance

Suppose you have an instance of the webpage:

var page = require('webpage').create();

What can be extracted and executed on it?

Attributes

page.canGoForward -> boolean

If window.history.forward would be a valid action

page.canGoBack -> boolean

If window.history.back would be a valid action

page.clipRect -> object

Can be set to an object of the following form:

{ top: 0, left: 0, width: 1024, height: 768 }

It specifies which part of the screen will be taken in the screenshot

page.content -> string

The whole HTML content of the page

page.cookies -> object

The cookies. They have this form:

{
  'name' : 'Valid-Cookie-Name',
  'value' : 'Valid-Cookie-Value',
  'domain' : 'localhost',
  'path' : '/foo',
  'httponly' : true,
  'secure' : false
}

page.customHeaders -> object

TODO

page.event -> object

Contains modifiers and keys TODO

page.libraryPath -> string

The current library path, usually it’s the directory where the script is executed from

page.loading -> boolean

If the page is loading or not

page.loadingProgress -> number

The percentage that has been loaded. 100 means that the page is loaded.

TODO

page.offlineStoragePath -> string

Where the sqlite3 localstorage and other offline data are stored.

page.offlineStorageQuota, ‘number

The quota in bytes that can be stored offline

page.paperSize -> object

Similar to clipRect but takes real paper sizes such as A4. For an in depth example, check this example: printheaderfooter.js.

page.plainText -> string

The elements that are plain text in the page

page.scrollPosition -> object

The current scrolling position as an object of the following form:

{ left: 0, top: 0 }

page.settings -> object

The settings which currently only has the useragent string, e.g page.settings.userAgent = 'SpecialAgent';

page.title -> string

The page title

page.url -> string

The page url

page.viewportSize -> object

The browser size which is in the following form:

{ width: 1024, height: 768 }

page.windowName -> string

The name of the browser window that is assigned by the WM.

page.zoomFactor -> number

The zoom factor. 1 is the normal zoom.

Functions

Handlers/Callbacks

List of all the page events:

For more information check this in depth example: page_event.js.