How to Output Cdata Using Elementtree

How to output CDATA using ElementTree

After a bit of work, I found the answer myself. Looking at the ElementTree.py source code, I found there was special handling of XML comments and preprocessing instructions. What they do is create a factory function for the special element type that uses a special (non-string) tag value to differentiate it from regular elements.

def Comment(text=None):
element = Element(Comment)
element.text = text
return element

Then in the _write function of ElementTree that actually outputs the XML, there's a special case handling for comments:

if tag is Comment:
file.write("<!-- %s -->" % _escape_cdata(node.text, encoding))

In order to support CDATA sections, I create a factory function called CDATA, extended the ElementTree class and changed the _write function to handle the CDATA elements.

This still doesn't help if you want to parse an XML with CDATA sections and then output it again with the CDATA sections, but it at least allows you to create XMLs with CDATA sections programmatically, which is what I needed to do.

The implementation seems to work with both ElementTree and cElementTree.

import elementtree.ElementTree as etree
#~ import cElementTree as etree

def CDATA(text=None):
element = etree.Element(CDATA)
element.text = text
return element

class ElementTreeCDATA(etree.ElementTree):
def _write(self, file, node, encoding, namespaces):
if node.tag is CDATA:
text = node.text.encode(encoding)
file.write("\n<![CDATA[%s]]>\n" % text)
else:
etree.ElementTree._write(self, file, node, encoding, namespaces)

if __name__ == "__main__":
import sys

text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""

e = etree.Element("data")
cdata = CDATA(text)
e.append(cdata)
et = ElementTreeCDATA(e)
et.write(sys.stdout, "utf-8")

Keeping CDATA sections while parsing through XML

If you use lxml, you can specify a parser that keeps CDATA:

import lxml.etree

file_name = r'inputData.xml'
parser = lxml.etree.XMLParser(strip_cdata=False)
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
c = lxml.etree.Element("c")
c.text = "3"
root.insert(1, c)
tree.write("outputData.xml")

Get CDATA using xml.etree.ElementTree

You can do -

ertag = child.find("error")
cdatatext = ertag.text
print(cdatatext)

This would print -

"HERES THE DETAILED ERROR"

Adding CDATA to XML fields

The answer is to add CDATA, e.g.:

Column_heading_1.text = et.CDATA(str(row[1]['sku']))

Parsing XML CDATA section and convert it to CSV using ElementTree python

My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child

The below seems to work. The code handle the cases of IMAGE under TEXT and TEXT with no IMAGE under it.

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
<TEXT>
<![CDATA[more text]]>
</TEXT>
</DOC></root>'''

root = ET.fromstring(xml)
texts = root.findall('.//TEXT')
for idx, text in enumerate(texts, start=1):
data = list(text)[0].tail.strip() if list(text) else text.text.strip()
print(f'{idx}) {data}')

output

1) The section I want to access to
2) more text

Python Xml Parsing having CDATA

Below

import xml.etree.ElementTree as ET

xml = '''<config>
<subconfig>
<a>First Cell</a>
<b>Second Cell</b>
<vsDataContainer>
<id>0</id>
<vsData><![CDATA[
<g>
<f>
<f1>10</f1>
<f2>20</f2>
<f3>30</f3>
</f>
</g>
]]></vsData>
</vsDataContainer>
</subconfig>
</config>'''

f1_new_value = '999'
root = ET.fromstring(xml)
vs_data = root.find('.//vsData')
inner_xml = vs_data.text.strip()
inner_root = ET.fromstring(inner_xml)
inner_root.find('.//f1').text = f1_new_value
vs_data.text = '![CDATA[' + ET.tostring(inner_root).decode('utf-8') + ']]'
root_str = ET.tostring(root)
root_str = str(root_str.decode('utf-8').replace('<', '<').replace('>', '>').replace('\\n', ''))
print(root_str)

output

<config>
<subconfig>
<a>First Cell</a>
<b>Second Cell</b>
<vsDataContainer>
<id>0</id>
<vsData>![CDATA[<g>
<f>
<f1>999</f1>
<f2>20</f2>
<f3>30</f3>
</f>
</g>]]</vsData>
</vsDataContainer>
</subconfig>
</config>


Related Topics



Leave a reply



Submit