Splash HTTP API¶

Consult with Installation to get Splash up and running.

Splash is controlled via HTTP API. For all endpoints below parameters may be sent either as GET arguments or encoded to JSON and POSTed with Content-Type: application/json header.

Most versatile endpoints that provide all Splash features are execute and run; they allow to execute arbitrary Lua rendering scripts.

Other endpoints may be easier to use in specific cases - for example, render.png returns a screenshot in PNG format that can be used as img src without any further processing, and render.json is convenient if you don’t need to interact with a page.

render.html¶

Return the HTML of the javascript-rendered page.

Arguments:

url : string : required: The url to render (required)

baseurl : string : optional

The base url to render the page with.

Base HTML content will be fetched from the URL given in the url argument, while relative referenced resources in the HTML-text used to render the page are fetched using the URL given in the baseurl argument as base. See also: render.html result looks broken in a browser.

timeout : float : optional

A timeout (in seconds) for the render (defaults to 30).

By default, maximum allowed value for the timeout is 90 seconds. To override it start Splash with --max-timeout command line option. For example, here Splash is configured to allow timeouts up to 5 minutes:

$ docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300

resource_timeout : float : optional

A timeout (in seconds) for individual network requests.

See also: splash:on_request and its request:set_timeout(timeout) method; splash.resource_timeout attribute.

wait : float : optional

Time (in seconds) to wait for updates after page is loaded (defaults to 0). Increase this value if you expect pages to contain setInterval/setTimeout javascript calls, because with wait=0 callbacks of setInterval/setTimeout won’t be executed. Non-zero wait is also required for PNG and JPEG rendering when doing full-page rendering (see render_all).

Wait time must be less than timeout.

proxy : string : optional

Proxy profile name or proxy URL. See Proxy Profiles.

A proxy URL should have the following format: [protocol://][user:password@]proxyhost[:port])

Where protocol is either http or socks5. If port is not specified, the port 1080 is used by default.

js : string : optional: Javascript profile name. See Javascript Profiles.

js_source : string : optional: JavaScript code to be executed in page context. See Executing custom Javascript code within page context.

filters : string : optional: Comma-separated list of request filter names. See Request Filters

allowed_domains : string : optional: Comma-separated list of allowed domain names. If present, Splash won’t load anything neither from domains not in this list nor from subdomains of domains not in this list.

allowed_content_types : string : optional: Comma-separated list of allowed content types. If present, Splash will abort any request if the response’s content type doesn’t match any of the content types in this list. Wildcards are supported using the fnmatch syntax.

forbidden_content_types : string : optional: Comma-separated list of forbidden content types. If present, Splash will abort any request if the response’s content type matches any of the content types in this list. Wildcards are supported using the fnmatch syntax.

viewport : string : optional

View width and height (in pixels) of the browser viewport to render the web page. Format is “<width>x<height>”, e.g. 800x600. Default value is 1024x768.

‘viewport’ parameter is more important for PNG and JPEG rendering; it is supported for all rendering endpoints because javascript code execution can depend on viewport size.

For backward compatibility reasons, it also accepts ‘full’ as value; viewport=full is semantically equivalent to render_all=1 (see render_all).

images : integer : optional

Whether to download images. Possible values are 1 (download images) and 0 (don’t download images). Default is 1.

Note that cached images may be displayed even if this parameter is 0. You can also use Request Filters to strip unwanted contents based on URL.

headers : JSON array or object : optional

HTTP headers to set for the first outgoing request.

This option is only supported for application/json POST requests. Value could be either a JSON array with (header_name, header_value) pairs or a JSON object with header names as keys and header values as values.

“User-Agent” header is special: is is used for all outgoing requests, unlike other headers.

body : string : optional: Body of HTTP POST request to be sent if method is POST. Default content-type header for POST requests is application/x-www-form-urlencoded.

http_method : string : optional: HTTP method of outgoing Splash request. Default method is GET. Splash also supports POST.

save_args : JSON array or a comma-separated string : optional

A list of argument names to put in cache. Splash will store each argument value in an internal cache and return X-Splash-Saved-Arguments HTTP header with a list of SHA1 hashes for each argument (a semicolon-separated list of name=hash pairs):

name1=9a6747fc6259aa374ab4e1bb03074b6ec672cf99;name2=ba001160ef96fe2a3f938fea9e6762e204a562b3

Client can then use load_args parameter to pass these hashes instead of argument values. This is most useful when argument value is large and doesn’t change often (js_source or lua_source are often good candidates).

load_args : JSON object or a string : optional

Parameter values to load from cache. load_args should be either {"name": "<SHA1 hash>", ...} JSON object or a raw X-Splash-Saved-Arguments header value (a semicolon-separated list of name=hash pairs).

For each parameter in load_args Splash tries to fetch the value from the internal cache using a provided SHA1 hash as a key. If all values are in cache then Splash uses them as argument values and then handles the request as usual.

If at least on argument can’t be found Splash returns HTTP 498 status code. In this case client should repeat the request, but use save_args and send full argument values.

load_args and save_args allow to save network traffic by not sending large arguments with each request (js_source and lua_source are often good candidates).

Splash uses LRU cache to store values; the number of entries is limited, and cache is cleared after each Splash restart. In other words, storage is not persistent; client should be ready to re-send the arguments.

html5_media : integer : optional

Whether to enable HTML5 media (e.g. <video> tags playback). Possible values are 1 (enable) and 0 (disable). Default is 0.

HTML5 media is currently disabled by default because it may cause instability. Splash may enable it by default in future, so pass html5_media=0 explicitly if you don’t want HTML5 media.

Examples¶

Curl example:

curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

The result is always encoded to utf-8. Always decode HTML data returned by render.html endpoint from utf-8 even if there are tags like

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

in the result.

render.png¶

Return an image (in PNG format) of the javascript-rendered page.

Arguments:

Same as render.html plus the following ones:

width : integer : optional: Resize the rendered image to the given width (in pixels) keeping the aspect ratio.

height : integer : optional: Crop the rendered image to the given height (in pixels). Often used in conjunction with the width argument to generate fixed-size thumbnails.

render_all : int : optional: Possible values are 1 and 0. When render_all=1, extend the viewport to include the whole webpage (possibly very tall) before rendering. Default is render_all=0.

Note

render_all=1 requires non-zero wait parameter. This is an unfortunate restriction, but it seems that this is the only way to make rendering work reliably with render_all=1.

scale_method : string : optional: Possible values are raster (default) and vector. If scale_method=raster, rescaling operation performed via width parameter is pixel-wise. If scale_method=vector, rescaling is done element-wise during rendering.

Note

Vector-based rescaling is more performant and results in crisper fonts and sharper element boundaries, however there may be rendering issues, so use it with caution.

Examples¶

Curl examples:

# render with timeout
curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-javascript.html&timeout=10'

# 320x240 thumbnail
curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-javascript.html&width=320&height=240'

render.jpeg¶

Return an image (in JPEG format) of the javascript-rendered page.

Arguments:

Same as render.png plus the following ones:

quality : integer : optional: JPEG quality parameter in range from 0 to 100. Default is quality=75.

Note

quality values above 95 should be avoided; quality=100 disables portions of the JPEG compression algorithm, and results in large files with hardly any gain in image quality.

Examples¶

Curl examples:

# render with default quality
curl 'http://localhost:8050/render.jpeg?url=http://domain.com/'

# render with low quality
curl 'http://localhost:8050/render.jpeg?url=http://domain.com/&quality=30'

render.har¶

Return information about Splash interaction with a website in HAR format. It includes information about requests made, responses received, timings, headers, etc.

You can use online HAR viewer to visualize information returned from this endpoint; it will be very similar to “Network” tabs in Firefox and Chrome developer tools.

Currently this endpoint doesn’t expose raw request contents; only meta-information like headers and timings is available. Response contents is included when ‘response_body’ option is set to 1.

Arguments for this endpoint are the same as for render.html, plus the following:

response_body : int : optional: Possible values are 1 and 0. When response_body=1, response content is included in HAR records. Default is response_body=0.

render.json¶

Return a json-encoded dictionary with information about javascript-rendered webpage. It can include HTML, PNG and other information, based on arguments passed.

Arguments:

Same as render.jpeg plus the following ones:

html : integer : optional: Whether to include HTML in output. Possible values are 1 (include) and 0 (exclude). Default is 0.

png : integer : optional: Whether to include PNG in output. Possible values are 1 (include) and 0 (exclude). Default is 0.

jpeg : integer : optional: Whether to include JPEG in output. Possible values are 1 (include) and 0 (exclude). Default is 0.

iframes : integer : optional: Whether to include information about child frames in output. Possible values are 1 (include) and 0 (exclude). Default is 0.

script : integer : optional: Whether to include the result of the executed javascript final statement in output (see Executing custom Javascript code within page context). Possible values are 1 (include) and 0 (exclude). Default is 0.

console : integer : optional: Whether to include the executed javascript console messages in output. Possible values are 1 (include) and 0 (exclude). Default is 0.

history : integer : optional

Whether to include the history of requests/responses for webpage main frame. Possible values are 1 (include) and 0 (exclude). Default is 0.

Use it to get HTTP status codes and headers. Only information about “main” requests/responses is returned (i.e. information about related resources like images and AJAX queries is not returned). To get information about all requests and responses use ‘har’ argument.

har : integer : optional

Whether to include HAR in output. Possible values are 1 (include) and 0 (exclude). Default is 0. If this option is ON the result will contain the same data as render.har provides under ‘har’ key.

By default, response content is not included. To enable it use ‘response_body’ option.

response_body : int : optional: Possible values are 1 and 0. When response_body=1, response content is included in HAR records. Default is response_body=0. This option has no effect when both ‘har’ and ‘history’ are 0.

Examples¶

By default, URL, requested URL, page title and frame geometry is returned:

{
    "url": "http://crawlera.com/",
    "geometry": [0, 0, 640, 480],
    "requestedUrl": "http://crawlera.com/",
    "title": "Crawlera"
}

Add ‘html=1’ to request to add HTML to the result:

{
    "url": "http://crawlera.com/",
    "geometry": [0, 0, 640, 480],
    "requestedUrl": "http://crawlera.com/",
    "html": "<!DOCTYPE html><!--[if IE 8]>....",
    "title": "Crawlera"
}

Add ‘png=1’ to request to add base64-encoded PNG screenshot to the result:

{
    "url": "http://crawlera.com/",
    "geometry": [0, 0, 640, 480],
    "requestedUrl": "http://crawlera.com/",
    "png": "iVBORw0KGgoAAAAN...",
    "title": "Crawlera"
}

Setting both ‘html=1’ and ‘png=1’ allows to get HTML and a screenshot at the same time - this guarantees that the screenshot matches the HTML.

By adding “iframes=1” information about iframes can be obtained:

{
    "geometry": [0, 0, 640, 480],
    "frameName": "",
    "title": "Scrapinghub | Autoscraping",
    "url": "http://scrapinghub.com/autoscraping.html",
    "childFrames": [
        {
            "title": "Tutorial: Scrapinghub's autoscraping tool - YouTube",
            "url": "",
            "geometry": [235, 502, 497, 310],
            "frameName": "<!--framePath //<!--frame0-->-->",
            "requestedUrl": "http://www.youtube.com/embed/lSJvVqDLOOs?version=3&rel=1&fs=1&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent",
            "childFrames": []
        }
    ],
    "requestedUrl": "http://scrapinghub.com/autoscraping.html"
}

Note that iframes can be nested.

Pass both ‘html=1’ and ‘iframes=1’ to get HTML for all iframes as well as for the main page:

 {
    "geometry": [0, 0, 640, 480],
    "frameName": "",
    "html": "<!DOCTYPE html...",
    "title": "Scrapinghub | Autoscraping",
    "url": "http://scrapinghub.com/autoscraping.html",
    "childFrames": [
        {
            "title": "Tutorial: Scrapinghub's autoscraping tool - YouTube",
            "url": "",
            "html": "<!DOCTYPE html>...",
            "geometry": [235, 502, 497, 310],
            "frameName": "<!--framePath //<!--frame0-->-->",
            "requestedUrl": "http://www.youtube.com/embed/lSJvVqDLOOs?version=3&rel=1&fs=1&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent",
            "childFrames": []
        }
    ],
    "requestedUrl": "http://scrapinghub.com/autoscraping.html"
}

Unlike ‘html=1’, ‘png=1’ does not affect data in childFrames.

When executing JavaScript code (see Executing custom Javascript code within page context) add the parameter ‘script=1’ to the request to include the code output in the result:

{
    "url": "http://crawlera.com/",
    "geometry": [0, 0, 640, 480],
    "requestedUrl": "http://crawlera.com/",
    "title": "Crawlera",
    "script": "result of script..."
}

The JavaScript code supports the console.log() function to log messages. Add ‘console=1’ to the request to include the console output in the result:

{
    "url": "http://crawlera.com/",
    "geometry": [0, 0, 640, 480],
    "requestedUrl": "http://crawlera.com/",
    "title": "Crawlera",
    "script": "result of script...",
    "console": ["first log message", "second log message", ...]
}

Curl examples:

# full information
curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&png=1&html=1&iframes=1'

# HTML and meta information of page itself and all its iframes
curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&html=1&iframes=1'

# only meta information (like page/iframes titles and urls)
curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&iframes=1'

# render html and 320x240 thumbnail at once; do not return info about iframes
curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&html=1&png=1&width=320&height=240'

# Render page and execute simple Javascript function, display the js output
curl -X POST -H 'content-type: application/javascript' \
    -d 'function getAd(x){ return x; } getAd("abc");' \
    'http://localhost:8050/render.json?url=http://domain.com&script=1'

# Render page and execute simple Javascript function, display the js output and the console output
curl -X POST -H 'content-type: application/javascript' \
    -d 'function getAd(x){ return x; }; console.log("some log"); console.log("another log"); getAd("abc");' \
    'http://localhost:8050/render.json?url=http://domain.com&script=1&console=1'

execute¶

Execute a custom rendering script and return a result.

render.html, render.png, render.jpeg, render.har and render.json endpoints cover many common use cases, but sometimes they are not enough. This endpoint allows to write custom Splash Scripts.

Arguments:

lua_source : string : required: Browser automation script. See Splash Scripts Tutorial for more info.

timeout : float : optional: Same as ‘timeout’ argument for render.html.
allowed_domains : string : optional: Same as ‘allowed_domains’ argument for render.html.
proxy : string : optional: Same as ‘proxy’ argument for render.html.
filters : string : optional: Same as ‘filters’ argument for render.html.
save_args : JSON array or a comma-separated string : optional: Same as ‘save_args’ argument for render.html. Note that you can save not only default Splash arguments, but any other parameters as well.
load_args : JSON object or a string : optional: Same as ‘load_args’ argument for render.html. Note that you can load not only default Splash arguments, but any other parameters as well.

You can pass any other arguments. All arguments passed to execute endpoint are available in a script in splash.args table.

run¶

This endpoint is the same as execute, but it wraps lua_source in function main(splash, args) ... end automatically. For example, if you’re sending this script to execute:

function main(splash, args)
    assert(splash:go(args.url))
    assert(splash:wait(1.0))
    return splash:html()
end

equivalent script for run endpoint would be

assert(splash:go(args.url))
assert(splash:wait(1.0))
return splash:html()

Executing custom Javascript code within page context¶

Note

Splash supports executing JavaScript code within the context of the page. The JavaScript code is executed after the page finished loading (including any delay defined by ‘wait’) but before the page is rendered. This allows to use the javascript code to modify the page being rendered.

To execute JavaScript code use js_source parameter. It should contain JavaScript code to be executed.

Note that browsers and proxies limit the amount of data that can be sent using GET, so it is a good idea to use content-type: application/json POST request.

Curl example:

# Render page and modify its title dynamically
curl -X POST -H 'content-type: application/json' \
    -d '{"js_source": "document.title=\"My Title\";", "url": "http://example.com"}' \
    'http://localhost:8050/render.html'

Another way to do it is to use a POST request with the content-type set to ‘application/javascript’. The body of the request should contain the code to be executed.

Curl example:

# Render page and modify its title dynamically
curl -X POST -H 'content-type: application/javascript' \
    -d 'document.title="My Title";' \
    'http://localhost:8050/render.html?url=http://domain.com'

To get the result of a javascript function executed within page context use render.json endpoint with script = 1 parameter.

Javascript Profiles¶

Splash supports “javascript profiles” that allows to preload javascript files. Javascript files defined in a profile are executed after the page is loaded and before any javascript code defined in the request.

The preloaded files can be used in the user’s POST’ed code.

To enable javascript profiles support, run splash server with the --js-profiles-path=<path to a folder with js profiles> option:

python3 -m splash.server --js-profiles-path=/etc/splash/js-profiles

Note

Javascript Security¶

If Splash is started with --js-cross-domain-access option

$ docker run -it -p 8050:8050 scrapinghub/splash --js-cross-domain-access

then javascript code is allowed to access the content of iframes loaded from a security origin different to the original page (browsers usually disallow that). This feature is useful for scraping, e.g. to extract the html of a iframe page. An example of its usage:

curl -X POST -H 'content-type: application/javascript' \
    -d 'function getContents(){ var f = document.getElementById("external"); return f.contentDocument.getElementsByTagName("body")[0].innerHTML; }; getContents();' \
    'http://localhost:8050/render.html?url=http://domain.com'

The javascript function ‘getContents’ will look for a iframe with the id ‘external’ and extract its html contents.

Note that allowing cross origin javascript calls is a potential security issue, since it is possible that secret information (i.e cookies) is exposed when this support is enabled; also, some websites don’t load when cross-domain security is disabled, so this feature is OFF by default.

Request Filters¶

Splash supports filtering requests based on Adblock Plus rules. You can use filters from EasyList to remove ads and tracking codes (and thus speedup page loading), and/or write filters manually to block some of the requests (e.g. to prevent rendering of images, mp3 files, custom fonts, etc.)

To activate request filtering support start splash with --filters-path option:

python3 -m splash.server --filters-path=/etc/splash/filters

Note

Proxy Profiles¶

Splash supports “proxy profiles” that allows to set proxy handling rules per-request using proxy parameter.

To enable proxy profiles support, run splash server with --proxy-profiles-path=<path to a folder with proxy profiles> option:

python3 -m splash.server --proxy-profiles-path=/etc/splash/proxy-profiles

Note

If you run Splash using Docker, check Folders Sharing.

Then create an INI file with “proxy profile” config inside the specified folder, e.g. /etc/splash/proxy-profiles/mywebsite.ini. Example contents of this file:

[proxy]

; required
host=proxy.crawlera.com
port=8010

; optional, default is no auth
username=username
password=password

; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP

[rules]
; optional, default ".*"
whitelist=
    .*mywebsite\.com.*

; optional, default is no blacklist
blacklist=
    .*\.js.*
    .*\.css.*
    .*\.png

whitelist and blacklist are newline-separated lists of regexes. If URL matches one of whitelist patterns and matches none of blacklist patterns, proxy specified in [proxy] section is used; no proxy is used otherwise.

Then, to apply proxy rules according to this profile, add proxy=mywebsite parameter to request:

curl 'http://localhost:8050/render.html?url=http://mywebsite.com/page-with-javascript.html&proxy=mywebsite'

If default.ini profile is present, it will be used when proxy argument is not specified. If you have default.ini profile but don’t want to apply it pass none as proxy value.

Other Endpoints¶

_gc¶

To reclaim some RAM send a POST request to the /_gc endpoint:

curl -X POST http://localhost:8050/_gc

It runs the Python garbage collector and clears internal WebKit caches.

_debug¶

To get debug information about Splash instance (max RSS used, number of used file descriptors, active requests, request queue length, counts of alive objects) send a GET request to the /_debug endpoint:

curl http://localhost:8050/_debug

_ping¶

To ping Splash instance send a GET request to the /_ping endpoint:

curl http://localhost:8050/_ping

It returns “ok” status and max RSS used, if instance is alive.