Page Automation with PhantomJS

Because PhantomJS can load and manipulate a web page, it is perfect to carry out various page automation tasks.

DOM Manipulation

Since the script is executed as if it is running on a web browser, standard DOM scripting and CSS selectors work just fine.

The following useragent.js example demonstrates reading the textContent property of the element whose id is qua:

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function(status) {
  if (status !== 'success') {
    console.log('Unable to access network');
  } else {
    var ua = page.evaluate(function() {
      return document.getElementById('qua').textContent;
    });
    console.log(ua);
  }
  phantom.exit();
});

The above example also shows the approach to customize the User-Agent string seen by the remote web server.

Use jQuery and Other Libraries

As of version 1.6, you are also able to include jQuery into your page using page.includeJs as follows:

var page = require('webpage').create();
page.open('http://www.sample.com', function() {
  page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
    page.evaluate(function() {
      $("button").click();
    });
    phantom.exit()
  });
});

The above snippet will open up a web page, include the jQuery library into the page, and then click on all buttons using jQuery. It will then exit from the web page.

Make sure to put the phantom.exit() statement within the page.includeJs or else it may exit prematurely before the JavaScript library is included.

The Webpage instance

Suppose you have an instance of the webpage:

var page = require('webpage').create();

What can be extracted and executed on it?

Attributes

page.canGoForward -> boolean

If window.history.forward would be a valid action

page.canGoBack -> boolean

If window.history.back would be a valid action

page.clipRect -> object

Can be set to an object of the following form:

{ top: 0, left: 0, width: 1024, height: 768 }

It specifies which part of the screen will be taken in the screenshot

page.content -> string

The whole HTML content of the page

page.cookies -> object

The cookies. They have this form:

{
  'name' : 'Valid-Cookie-Name',
  'value' : 'Valid-Cookie-Value',
  'domain' : 'localhost',
  'path' : '/foo',
  'httponly' : true,
  'secure' : false
}

page.customHeaders -> object

TODO

page.event -> object

Contains modifiers and keys TODO

page.libraryPath -> string

The current library path, usually it’s the directory where the script is executed from

page.loading -> boolean

If the page is loading or not

page.loadingProgress -> number

The percentage that has been loaded. 100 means that the page is loaded.

page.navigationLocked -> boolean

TODO

page.offlineStoragePath -> string

Where the sqlite3 localstorage and other offline data are stored.

page.offlineStorageQuota, ‘number

The quota in bytes that can be stored offline

page.paperSize -> object

Similar to clipRect but takes real paper sizes such as A4. For an in depth example, check this example: printheaderfooter.js.

page.plainText -> string

The elements that are plain text in the page

page.scrollPosition -> object

The current scrolling position as an object of the following form:

{ left: 0, top: 0 }

page.settings -> object

The settings which currently only has the useragent string, e.g page.settings.userAgent = 'SpecialAgent';

page.title -> string

The page title

page.url -> string

The page url

page.viewportSize -> object

The browser size which is in the following form:

{ width: 1024, height: 768 }

page.windowName -> string

The name of the browser window that is assigned by the WM.

page.zoomFactor -> number

The zoom factor. 1 is the normal zoom.

Functions

page.childFramesCount
page.childFramesName
page.close
page.currentFrameName
page.deleteLater
page.destroyed
page.evaluate
page.initialized
page.injectJs
page.javaScriptAlertSent
page.javaScriptConsoleMessageSent
page.loadFinished
page.loadStarted
page.openUrl
page.release
page.render
page.resourceError
page.resourceReceived
page.resourceRequested
page.uploadFile
page.sendEvent
page.setContent
page.switchToChildFrame
page.switchToMainFrame
page.switchToParentFrame
page.addCookie
page.deleteCookie
page.clearCookies

Handlers/Callbacks

List of all the page events:

onInitialized
onLoadStarted
onLoadFinished
onUrlChanged
onNavigationRequested
onRepaintRequested
onResourceRequested
onResourceReceived
onResourceError
onResourceTimeout
onAlert
onConsoleMessage
onClosing

For more information check this in depth example: page_event.js.