What Is the Best Practice to Parse HTML in Swift

What is the best practice to parse html in swift?

There are several nice libraries of HTML Parsing using Swift and Objective-C like the followings:

  • hpple
  • NDHpple
  • Kanna( old Swift-HTML-Parser)
  • Fuzi
  • SwiftSoup
  • Ji

Take a look in the following examples in the four libraries posted above, mainly parsed using XPath 2.0:

hpple:

let data = NSData(contentsOfFile: path)
let doc = TFHpple(htmlData: data)

if let elements = doc.searchWithXPathQuery("//a/@href[ends-with(.,'.txt')]") as? [TFHppleElement] {
for element in elements {
println(element.content)
}
}

NDHpple:

let data = NSData(contentsOfFile: path)!
let html = NSString(data: data, encoding: NSUTF8StringEncoding)!
let doc = NDHpple(HTMLData: html)
if let elements = doc.searchWithXPathQuery("//a/@href[ends-with(.,'.txt')]") {
for element in elements {
println(element.children?.first?.content)
}
}

Kanna (Xpath and CSS Selectors):

let html = "<html><head></head><body><ul><li><input type='image' name='input1' value='string1value' class='abc' /></li><li><input type='image' name='input2' value='string2value' class='def' /></li></ul><span class='spantext'><b>Hello World 1</b></span><span class='spantext'><b>Hello World 2</b></span><a href='example.com'>example(English)</a><a href='example.co.jp'>example(JP)</a></body>"

if let doc = Kanna.HTML(html: html, encoding: NSUTF8StringEncoding) {
var bodyNode = doc.body

if let inputNodes = bodyNode?.xpath("//a/@href[ends-with(.,'.txt')]") {
for node in inputNodes {
println(node.contents)
}
}
}

Fuzi (Xpath and CSS Selectors):

let html = "<html><head></head><body><ul><li><input type='image' name='input1' value='string1value' class='abc' /></li><li><input type='image' name='input2' value='string2value' class='def' /></li></ul><span class='spantext'><b>Hello World 1</b></span><span class='spantext'><b>Hello World 2</b></span><a href='example.com'>example(English)</a><a href='example.co.jp'>example(JP)</a></body>"

do {
// if encoding is omitted, it defaults to NSUTF8StringEncoding
let doc = try HTMLDocument(string: html, encoding: NSUTF8StringEncoding)

// XPath queries
for anchor in doc.xpath("//a/@href[ends-with(.,'.txt')]") {
print(anchor.stringValue)
}

} catch let error {
print(error)
}

The ends-with function is part of Xpath 2.0.

SwiftSoup (CSS Selectors):

do{
let doc: Document = try SwiftSoup.parse("...")
let links: Elements = try doc.select("a[href]") // a with href
let pngs: Elements = try doc.select("img[src$=.png]")

// img with src ending .png
let masthead: Element? = try doc.select("div.masthead").first()

// div with class=masthead
let resultLinks: Elements? = try doc.select("h3.r > a") // direct a after h3
} catch Exception.Error(let type, let message){
print(message)
} catch {
print("error")
}

Ji (XPath):

let jiDoc = Ji(htmlURL: URL(string: "http://www.apple.com/support")!)
let titleNode = jiDoc?.xPath("//head/title")?.first
print("title: \(titleNode?.content)") // title: Optional("Official Apple Support")

I hope this helps you.

Is there an easy solution for parsing html in swift to get individual elements into their own variable?

Parsing HTML without a third party is not achievable without a WebView, BUT YOU CAN easily use a webView and run a getElementsByTagName with JS on it to get anything from the HTML code like this:

1- Define the js code:

let js = "document.getElementsByTagName("title")[0].innerHTML"

2- Import WebKit and load the html into a webView

class MyViewController : UIViewController {

let html = """
<#the HTML code, can be loaded from anywhere#>
"""

override func loadView() {
let webView = WKWebView()
webView.navigationDelegate = self // Here is the Delegate
webView.loadHTMLString(html, baseURL: nil)

self.view = webView
}
}

3- Take the delegation and implement this method:

extension MyViewController: WKNavigationDelegate {
func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
webView.evaluateJavaScript(js) {(result, error) in
guard error == nil else {
print(error!)
return
}

print(String(describing: result))
}
}
}

Note 1: remember getElementsByTagName returns an array and you must pass the index you want the get like [0]

Note 2: since it use JavaScriptCore, it can't be done without webView, and it must be run on mainThread. Only safari can do this off main thread, because it has V8 engine.

Note 3: You must wait for delegate to be completed even if you pass the HTML statically

Note 4: you can use a third party framework like SwiftSoap to do this.

Trying to parse HTML in Swift 4 using only the Standard Library

You can use regex to find all string occurrences between two specific strings (check this SO answer) and use the extension method ranges(of:) from this answer to get all ranges of that regex pattern. You just need to pass options .regularExpression to that method.


extension String {
func ranges(of string: String, options: CompareOptions = .literal) -> [Range<Index>] {
var result: [Range<Index>] = []
var start = startIndex
while let range = range(of: string, options: options, range: start..<endIndex) {
result.append(range)
start = range.lowerBound < range.upperBound ? range.upperBound : index(range.lowerBound, offsetBy: 1, limitedBy: endIndex) ?? endIndex
}
return result
}
func slices(from: String, to: String) -> [Substring] {
let pattern = "(?<=" + from + ").*?(?=" + to + ")"
return ranges(of: pattern, options: .regularExpression)
.map{ self[$0] }
}
}

Testing playground

let itemListURL = URL(string: "http://steamcommunity.com/market/search?appid=252490")!
let itemListHTML = try! String(contentsOf: itemListURL, encoding: .utf8)
let result = itemListHTML.slices(from: "market_listing_row_link\" href=\"", to: "\"")
result.forEach({print($0)})

Result

http://steamcommunity.com/market/listings/252490/Night%20Howler%20AK47
http://steamcommunity.com/market/listings/252490/Hellcat%20SAR
http://steamcommunity.com/market/listings/252490/Metal
http://steamcommunity.com/market/listings/252490/Volcanic%20Stone%20Hatchet
http://steamcommunity.com/market/listings/252490/Box
http://steamcommunity.com/market/listings/252490/High%20Quality%20Bag
http://steamcommunity.com/market/listings/252490/Utilizer%20Pants
http://steamcommunity.com/market/listings/252490/Lizard%20Skull
http://steamcommunity.com/market/listings/252490/Frost%20Wolf
http://steamcommunity.com/market/listings/252490/Cloth

parsing HTML in swift

A couple of thoughts:

  1. The use of // says "find this anywhere in the HTML". If you want to control what level you want to consider, just use / and follow this from the root of the document. For example, to get the second level, but not the first or third levels, you'd do something like:

    let tutorialsParser = TFHpple(HTMLData: data)
    let tutorialsXPathString = "/html/body/ul/li/ul/li"
    if let tutorialNodes = tutorialsParser.searchWithXPathQuery(tutorialsXPathString) as? [TFHppleElement] {
    for element in tutorialNodes {
    let content = element.firstChild.content.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
    let identifier = element.attributes["id"] as String
    println("id = \(identifier); content = \(content)")
    }
    }
  2. Note, I'm not sure why you were using the scanner, but if you want the attributes of an element, you can use the attributes method.

  3. I also defined the tutorialNodes to be an array of TFHppleElement objects, which simplifies the for loop a bit.

  4. If you wanted the top level /ul/li followed by the second level, but not the third level, you could do something like:

    let tutorialsParser = TFHpple(HTMLData: data)
    let tutorialsXPathString = "/html/body/ul/li"
    if let tutorialNodes = tutorialsParser.searchWithXPathQuery(tutorialsXPathString) as? [TFHppleElement] {
    for element in tutorialNodes {
    let content = element.firstChild.content.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
    let identifier = element.attributes["id"] as String
    println("id = \(identifier); content = \(content)")

    if let ul = element.childrenWithTagName("ul") as? [TFHppleElement] {
    if let li = ul.first?.childrenWithTagName("li") as? [TFHppleElement] {
    for element in li {
    let content = element.firstChild.content.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
    let identifier = element.attributes["id"] as String
    println(" child id = \(identifier); content = \(content)")
    }
    }
    }
    }
    }

    Or you could do something like:

    let tutorialsParser = TFHpple(HTMLData: data)
    let tutorialsXPathString = "/html/body/ul/li"
    if let tutorialNodes = tutorialsParser.searchWithXPathQuery(tutorialsXPathString) as? [TFHppleElement] {
    for element in tutorialNodes {
    let content = element.firstChild.content.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
    let identifier = element.attributes["id"] as String
    println("id = \(identifier); content = \(content)")

    if let children = element.searchWithXPathQuery("/html/body/li/ul/li") as? [TFHppleElement] {
    for element in children {
    let content = element.firstChild.content.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
    let identifier = element.attributes["id"] as String
    println(" child id = \(identifier); content = \(content)")
    }
    }
    }
    }

Swift - Parsing a Web Page

I found the solution:

import UIKit
import Alamofire
import SwiftSoup

class ViewController: UIViewController {

override func viewDidLoad() {
super.viewDidLoad()

let diyanetURL = "https://namazvakitleri.diyanet.gov.tr/tr-TR/8648"

// let params = ["ulkeId" : 2, "ilId" : 500,"ilceId" : 9146]
Alamofire.request(diyanetURL, method: .post, parameters: nil, encoding: URLEncoding.default).validate(contentType: ["application/x-www-form-urlencoded"]).response { (response) in

if let data = response.data, let utf8Text = String(data: data, encoding: .utf8) {
do {
let html: String = utf8Text
let doc: Document = try SwiftSoup.parse(html)
for row in try! doc.select("tr") {
print("------------------")
for col in try! row.select("td") {
print(try col.text())
}
}

} catch let error {
print(error.localizedDescription)
}

}
}
}
}


Related Topics



Leave a reply



Submit