How to Use Xpath in Python

can we use XPath with BeautifulSoup?

Nope, BeautifulSoup, by itself, does not support XPath expressions.

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree

url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

There is also a dedicated lxml.html() module with additional functionality.

Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:

import lxml.html
import requests

url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.

Using XPath in ElementTree

There are 2 problems that you have.

1) element contains only the root element, not recursively the whole document. It is of type Element not ElementTree.

2) Your search string needs to use namespaces if you keep the namespace in the XML.

To fix problem #1:

You need to change:

element = ET.parse(fp).getroot()

to:

element = ET.parse(fp)

To fix problem #2:

You can take off the xmlns from the XML document so it looks like this:

<?xml version="1.0"?>
<ItemSearchResponse>
<Items>
<Item>
<ItemAttributes>
<ListPrice>
<Amount>2260</Amount>
</ListPrice>
</ItemAttributes>
<Offers>
<Offer>
<OfferListing>
<Price>
<Amount>1853</Amount>
</Price>
</OfferListing>
</Offer>
</Offers>
</Item>
</Items>
</ItemSearchResponse>

With this document you can use the following search string:

e = element.findall('Items/Item/ItemAttributes/ListPrice/Amount')

The full code:

from elementtree import ElementTree as ET
fp = open("output.xml","r")
element = ET.parse(fp)
e = element.findall('Items/Item/ItemAttributes/ListPrice/Amount')
for i in e:
print i.text

Alternate fix to problem #2:

Otherwise you need to specify the xmlns inside the srearch string for each element.

The full code:

from elementtree import ElementTree as ET
fp = open("output.xml","r")
element = ET.parse(fp)

namespace = "{http://webservices.amazon.com/AWSECommerceService/2008-08-19}"
e = element.findall('{0}Items/{0}Item/{0}ItemAttributes/{0}ListPrice/{0}Amount'.format(namespace))
for i in e:
print i.text

Both print:

2260

Iterate on XML tags and get elements' xpath in Python

Here's how to do it with Python's ElementTree class.

It uses a simple list to track an iterator's current path through the XML. Whenever you want the XPath for an element, call gen_xpath() to turn that list into the XPath for that element, with logic for dealing with same-named siblings (absolute position).

from xml.etree import ElementTree as ET

# A list of elements pushed and popped by the iterator's start and end events
path = []

def gen_xpath():
'''Start at the root of `path` and figure out if the next child is alone, or is one of many siblings named the same. If the next child is one of many same-named siblings determine its position.

Returns the full XPath up to the element in the iterator this function was called.
'''
full_path = '/' + path[0].tag

for i, parent_elem in enumerate(path[:-1]):
next_elem = path[i+1]

pos = -1 # acts as counter for all children named the same as next_elem
next_pos = None # the position we care about

for child_elem in parent_elem:
if child_elem.tag == next_elem.tag:
pos += 1

# Compare etree.Element identity
if child_elem == next_elem:
next_pos = pos

if next_pos and pos > 0:
# We know where next_elem is, and that there are many same-named siblings, no need to count others
break

# Use next_elem's pos only if there are other same-named siblings
if pos > 0:
full_path += f'/{next_elem.tag}[{next_pos}]'
else:
full_path += f'/{next_elem.tag}'

return full_path

# Iterate the XML
for event, elem in ET.iterparse('input.xml', ['start', 'end']):
if event == 'start':
path.append(elem)
if elem.tag == 'p':
print(gen_xpath())

if event == 'end':
path.pop()

When I run that on this modified sample XML, input.xml:

<?xml version="1.0" encoding="UTF-8"?>
<article>
<title>Sample document</title>
<body>
<p>This is a <b>sample document.</b></p>
<p>And there is another paragraph.</p>
<section>
<p>Parafoo</p>
</section>
</body>
</article>

I get:

/article/body/p[0]
/article/body/p[1]
/article/body/section/p

How to select element using XPATH syntax on Selenium for Python?

HTML

<div id='a'>
<div>
<a class='click'>abc</a>
</div>
</div>

You could use the XPATH as :

//div[@id='a']//a[@class='click']

output

<a class="click">abc</a>

That said your Python code should be as :

driver.find_element_by_xpath("//div[@id='a']//a[@class='click']")

How to use preceding sibling for XML with xPath in Python?

The reason your code failed is that the axis name concerning preceding siblings
is preceding-sibling (not preceding).

But here you don't need to use XPath expressions, as there is native lxml
method to get the (first) preceding sibling called getprevious.

To check access to previous text node, try the following loop:

for x in tree.xpath('//text'):
bb = x.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
print('This: ', bb)
xPrev = x.getprevious()
bb = None
if xPrev is not None:
bb = xPrev.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
if bb is not None:
print(' Previous: ', bb)
else:
print(' No previous bbox')

It prints bbox for the current text element and for the
immediately preceding sibling if any.

Edit

If you want, you can also directly access bbox attribute in the preceding
text element, calling x.xpath('preceding-sibling::text[1]/@bbox').

But remember that this function returns a list of found nodes and if nothing
has been found, this list is empty (not None).

So before you make any use of this result, you must:

  • check the length of the returned list (should be > 0),
  • retrieve the first element from this list (the text content of bbox attribute,
    in this case this list should contain only 1 element),
  • split it by , (getting a list of fragments),
  • check whether the first element of this result is not empty,
  • convert of to float.

After that you can use it, e.g. compare with the corresponding value from the current bbox.



Related Topics



Leave a reply



Submit