`` elements. First, you would get all ``

`` elements:: >>> divs = response.xpath('//div') At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all ``

`` elements from the document, not only those inside ``

`` elements:: >>> for p in divs.xpath('//p'): # this is wrong - gets all

from the whole document ... print p.extract() This is the proper way to do it (note the dot prefixing the ``.//p`` XPath):: >>> for p in divs.xpath('.//p'): # extracts all

inside ... print p.extract() Another common case would be to extract all direct ``

`` children:: >>> for p in divs.xpath('p'): ... print p.extract() For more details about relative XPaths see the `Location Paths`_ section in the XPath specification. .. _Location Paths: https://www.w3.org/TR/xpath#location-paths .. _topics-selectors-xpath-variables: Variables in XPath expressions ------------------------------ XPath allows you to reference variables in your XPath expressions, using the ``$somevariable`` syntax. This is somewhat similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders like ``?``, which are then substituted with values passed with the query. Here's an example to match an element based on its "id" attribute value, without hard-coding it (that was shown previously):: >>> # `$val` used in the expression, a `val` argument needs to be passed >>> response.xpath('//div[@id=$val]/a/text()', val='images').extract_first() u'Name: My image 1 ' Here's another example, to find the "id" attribute of a ``

`` tag containing five ```` children (here we pass the value ``5`` as an integer):: >>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first() u'images' All variable references must have a binding value when calling ``.xpath()`` (otherwise you'll get a ``ValueError: XPath error:`` exception). This is done by passing as many named arguments as necessary. `parsel`_, the library powering Scrapy selectors, has more details and examples on `XPath variables`_. .. _parsel: https://parsel.readthedocs.io/ .. _XPath variables: https://parsel.readthedocs.io/en/latest/usage.html#variables-in-xpath-expressions Using EXSLT extensions ---------------------- Being built atop `lxml`_, Scrapy selectors also support some `EXSLT`_ extensions and come with these pre-registered namespaces to use in XPath expressions: ====== ===================================== ======================= prefix namespace usage ====== ===================================== ======================= re \http://exslt.org/regular-expressions `regular expressions`_ set \http://exslt.org/sets `set manipulation`_ ====== ===================================== ======================= Regular expressions ~~~~~~~~~~~~~~~~~~~ The ``test()`` function, for example, can prove quite useful when XPath's ``starts-with()`` or ``contains()`` are not sufficient. Example selecting links in list item with a "class" attribute ending with a digit:: >>> from scrapy import Selector >>> doc = """ ...

...

... """ >>> sel = Selector(text=doc, type="html") >>> sel.xpath('//li//@href').extract() [u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html'] >>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract() [u'link1.html', u'link2.html', u'link4.html', u'link5.html'] >>> .. warning:: C library ``libxslt`` doesn't natively support EXSLT regular expressions so `lxml`_'s implementation uses hooks to Python's ``re`` module. Thus, using regexp functions in your XPath expressions may add a small performance penalty. Set operations ~~~~~~~~~~~~~~ These can be handy for excluding parts of a document tree before extracting text elements for example. Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and corresponding itemprops:: >>> doc = """ ...

... Kenmore White 17" Microwave ... Kenmore 17" Microwave

...

... Rated 3.5/5 ... based on 11 customer reviews ...

... ...

... $55.00 ... In stock ...

... ... Product description: ... 0.7 cubic feet countertop microwave. ... Has six preset cooking categories and convenience features like ... Add-A-Minute and Child Lock. ... ... Customer reviews: ... ...

... Not a happy camper - ... by Ellie, ... April 1, 2011 ...

... ... 1/ ... 5stars ...

... The lamp burned out and now I have to replace ... it. ...

... ...

... Value purchase - ... by Lucas, ... March 25, 2011 ...

... ... 4/ ... 5stars ...

... Great microwave for the price. It is small and ... fits in my apartment. ...

... ... ...

... """ >>> sel = Selector(text=doc, type="html") >>> for scope in sel.xpath('//div[@itemscope]'): ... print "current scope:", scope.xpath('@itemtype').extract() ... props = scope.xpath(''' ... set:difference(./descendant::*/@itemprop, ... .//*[@itemscope]/*/@itemprop)''') ... print " properties:", props.extract() ... print current scope: [u'http://schema.org/Product'] properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review'] current scope: [u'http://schema.org/AggregateRating'] properties: [u'ratingValue', u'reviewCount'] current scope: [u'http://schema.org/Offer'] properties: [u'price', u'availability'] current scope: [u'http://schema.org/Review'] properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description'] current scope: [u'http://schema.org/Rating'] properties: [u'worstRating', u'ratingValue', u'bestRating'] current scope: [u'http://schema.org/Review'] properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description'] current scope: [u'http://schema.org/Rating'] properties: [u'worstRating', u'ratingValue', u'bestRating'] >>> Here we first iterate over ``itemscope`` elements, and for each one, we look for all ``itemprops`` elements and exclude those that are themselves inside another ``itemscope``. .. _EXSLT: http://exslt.org/ .. _regular expressions: http://exslt.org/regexp/index.html .. _set manipulation: http://exslt.org/set/index.html Some XPath tips --------------- Here are some tips that you may find useful when using XPath with Scrapy selectors, based on `this post from ScrapingHub's blog`_. If you are not much familiar with XPath yet, you may want to take a look first at this `XPath tutorial`_. .. _`XPath tutorial`: http://www.zvon.org/comp/r/tut-XPath_1.html .. _`this post from ScrapingHub's blog`: https://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/ Using text nodes in a condition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When you need to use the text content as argument to an `XPath string function`_, avoid using ``.//text()`` and use just ``.`` instead. This is because the expression ``.//text()`` yields a collection of text elements -- a *node-set*. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like ``contains()`` or ``starts-with()``, it results in the text for the first element only. Example:: >>> from scrapy import Selector >>> sel = Selector(text='Click here to go to the Next Page') Converting a *node-set* to string:: >>> sel.xpath('//a//text()').extract() # take a peek at the node-set [u'Click here to go to the ', u'Next Page'] >>> sel.xpath("string(//a[1]//text())").extract() # convert it to string [u'Click here to go to the '] A *node* converted to a string, however, puts together the text of itself plus of all its descendants:: >>> sel.xpath("//a[1]").extract() # select the first node [u'Click here to go to the Next Page'] >>> sel.xpath("string(//a[1])").extract() # convert it to string [u'Click here to go to the Next Page'] So, using the ``.//text()`` node-set won't select anything in this case:: >>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract() [] But using the ``.`` to mean the node, works:: >>> sel.xpath("//a[contains(., 'Next Page')]").extract() [u'Click here to go to the Next Page'] .. _`XPath string function`: https://www.w3.org/TR/xpath/#section-String-Functions Beware of the difference between //node[1] and (//node)[1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``//node[1]`` selects all the nodes occurring first under their respective parents. ``(//node)[1]`` selects all the nodes in the document, and then gets only the first of them. Example:: >>> from scrapy import Selector >>> sel = Selector(text=""" ....:

....:

""") >>> xp = lambda x: sel.xpath(x).extract() This gets all first ``

`` elements under whatever it is its parent:: >>> xp("//li[1]") [u'

', u'

'] And this gets the first ``

`` element in the whole document:: >>> xp("(//li)[1]") [u'

'] This gets all first ``

`` elements under an ``

1
4
`` element under an ``
- 1

`` elements from an HTML response body, returning a list of :class:`Selector` objects (ie. a :class:`SelectorList` object):: sel.xpath("//h1") 2. Extract the text of all ``