How to Change Currentculture at Runtime

Parse XML ampersand in Java

The ampersand is a special character in xml (O'reilly Xml: Entities: Handling Special Content) and needs to be encoded. Replace it with & before sending it.

How do I escape ampersands in XML so they are rendered as entities in HTML?

When your XML contains &, this will result in the text &.

When you use that in HTML, that will be rendered as &.

How to parse XML with unescaped ampersand

Is it possible to parse an XML with & in value of NODE ?

No, that means the file is not well-formed XML at all therefore does not really qualify as an XML file and no XML file parser can deal with that otherwise it would not be an XML parser.

However you can pre-process the data before you pass it to an XML parser and fix the issue (& -> &) your own.

How can I include an ampersand (&) character in an XML document?

Use a character reference to represent it: &

See the specification:

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section.

If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings "
&
" and "
<
" respectively. The right angle bracket (>) may be represented using the string "
>
", and MUST, for compatibility, be escaped using either "
>
" or a character reference when it appears in the string "
]]>
" in content, when that string is not marking the end of a CDATA section.

Escaping ampersand and other special characters in XML using Python

Try this:

import os
from xml.etree import ElementTree
from xml.sax.saxutils import escape

fileNum = 0;
saveFile = open('NewYork_1.txt','w')
saveFile.close()
for path, dirs, files in os.walk("NewYork_1"):
for f in files:
fileName = os.path.join(path, f)
with open(fileName,'a', encoding='utf-8') as myFile:
myFile=myFile.read()
if "&" in myFile:
myFile = myFile.replace("&", "&")

I personally would generate a list of files to read and iterate through that list rather than use os.walk (if you're getting the list from a previous function or a separate script, you could always create a text file with each txt file on a separate line and iterate through lines rather than getting from a variable to save RAM), but to each his own.

As I said, though, I'd discard the whole idea of replacing special characters and use bs4 to open the files, search for which elements you're looking for, and grab from there.

import bs4
list_of_USER_IDs=[]
with open(fileName,'r', encoding='utf-8') as myFile:
a=myFile.read()
b=bs4.beautifulSoup(a)
for elem in b.findAll('USERID'):
list_of_USER_IDs.append(elem)

That returns the data between the USERID tags, but it'll work for whatever tag you want data from. No need to directly parse xml. It's really not much different than HTML and beautifulSoup is made for that, so why reinvent the wheel?



Related Topics



Leave a reply



Submit