HTML Tag Parsing

Parse an HTML string with JS

Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";

el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements

Edit: adding a jQuery answer to please the fans!

var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");

$('a', el) // All the anchor elements

Parsing an html tag with JavaScript

Parse the HTML and get value from the tag.

There are 2 methods :

  1. Using DOMParser :

var x="<a>1234</a>";
var parser = new DOMParser();var doc = parser.parseFromString(x, "text/html");
console.log(doc.querySelector('a').innerHTML)

Parse HTML as a plain text via JavaScript

A pretty elegant solution is to use DOMParser.

const parser = new DOMParser()
const virtualDoc = parser.parseFromString(htmlString, 'text/html')

Then, treat virtualDoc like you'd treat any DOM-element,

virtualDoc.getElementById('someid').value

How to parse html tag properties using bash and sed

Use sed:

sed -e 's/.*NAME-\([^"]*\)" value="\([^"]*\)".*/\1\t\2/' -e 's/\.//g' INPUT.HTML
  • .* any character zero or more times
  • [^"]* any character but " repeated 0 or more times
  • \(...\) captures the enclosed part, here the above substring up to the double quote is remembered in \1 and the value is remembered in \2
  • s/PATTERN/REPLACEMENT/ substitutes the pattern with the replacement; here, it extracts the part after NAME- and the value and replaces the whole line with just the two captured parts separated by a tab (\t)
  • s/\.//g deletes all dots (the /g means "global", i.e. all of them)

Is there a way to parse html tags with Python?

You can use a recursive generator function with BeautifulSoup:

import bs4
from bs4 import BeautifulSoup as soup
s = """
<div class="title">
<h1> Hello World </h1>
</div>
"""
def get_tags(d):
ats = " ".join(a+"="+f'"{(b if not isinstance(b, list) else " ".join(b))}"' for a, b in d.attrs.items())
h = f'<{d.name} {ats}>' if ats else f'<{d.name}>'
if (k:=[i for i in d.contents if isinstance(i, bs4.element.Tag)]):
yield h
yield from [j for l in k for j in get_tags(l)]
yield f'</{d.name}>'
else:
yield f'{h}{d.text}</{d.name}>'

print(list(get_tags(soup(s, 'html.parser').contents[1])))

Output:

['<div class="title">', '<h1> Hello World </h1>', '</div>']

How to parse text with html tags into plain text?

It is because tags in the form of <\/foo> are not HTML tags.

HTML tags have the form of <foo> as start tag and </foo> as end tag.

Jsoup as being a HTML parser only parses HTML tags. So, Jsoup only parses <foo> and </foo>.

You have 2 options:

  1. Find and replace e.g. <\/ by </ in your HTML string before feeding it to Jsoup.
  2. Or, if this HTML string does not come from a hardcoded source, then simply figure out why this HTML string is corrupted like this and fix over there. I.e., step back and figure out how exactly this HTML string was entered, modified and/or saved in the place from where you obtained it. For example, if that process is unintentionally changing </foo> to <\/foo> because of e.g. a badly written HTML sanitizer, then that process obviously needs to be fixed accordingly. Or, if the end-user is already specifying literally <\/foo> instead of </foo> during the user input entry, then you need to block the user input entry with a validation error and let the end-user fix it.

NSAttributedString: how to parse html tags and add attributes

Ok, I got the solution, maybe it will help anyone else.

Note, that my strings are strictly multi lined, so it's easy first split them, then add needed font and size to the parts, and then add paragraph styling.

I've played with order of parsing tags/styling fonts/styling paragraph, and at every case something was missed. If you don't need to separate line as multiline in strict order, just don't do mapping. Otherwise, you can miss breaking while styling or parsing tags.
Sample Image

 descriptionLabel.attributedText = getAttributedDescriptionText(for: "Register and get <b>all</b>\n<b>rewards cards</b> of our partners\n<b>in one</b> universal <b>card</b>", fontDescription: "ProximaNova-Regular", fontSize: 15)   

func getAttributedDescriptionText(for descriptionString: String, fontDescription: String, fontSize: Int) -> NSAttributedString? {
let paragraphStyle = NSMutableParagraphStyle()
paragraphStyle.lineSpacing = 1.0
paragraphStyle.alignment = .center
paragraphStyle.minimumLineHeight = 18.0

let attributedString = NSMutableAttributedString()
let splits = descriptionString.components(separatedBy: "\n")
_ = splits.map { string in
let modifiedFont = String(format:"<span style=\"font-family: '\(fontDescription)'; font-size: \(fontSize)\">%@</span>", string)
let data = modifiedFont.data(using: String.Encoding.unicode, allowLossyConversion: true)
let attr = try? NSMutableAttributedString(
data: data ?? Data(),
options: [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
],
documentAttributes: nil
)
attributedString.append(attr ?? NSMutableAttributedString())
if string != splits.last {
attributedString.append(NSAttributedString(string: "\n"))
}
}
attributedString.addAttribute(.paragraphStyle, value: paragraphStyle, range: NSRange(location: 0, length: attributedString.length))
return attributedString
}


Related Topics



Leave a reply



Submit