What Is the Fastest Way to Scrape HTML Webpage in Android

What is the fastest way to scrape HTML webpage in Android?

I think in this case it makes no sense to look for a fast way to extract the information as there is virtually no performance difference between the methods already suggested in answers when you compare it to the time it will take to download the HTML.

So assuming that by fastest you mean most convenient, readable and maintainable code, I suggest you use a DocumentBuilder to parse the relevant HTML and extract data using XPathExpressions:

Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse(new InputSource(new StringReader(html)));

XPathExpression xpath = XPathFactory.newInstance()
.newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");

String result = (String) xpath.evaluate(doc, XPathConstants.STRING);

If you happen to retrieve invalid HTML, I recommend to isolate the relevant portion (e.g. using substring(indexOf("<table")..) and if necessary correct remaining HTML errors with String operations before parsing. If this gets too complex however (i.e. very bad HTML), just go with the hacky pattern matching approach as suggested in other answers.

Remarks

  • XPath is available since API Level 8 (Android 2.2). If you develop for lower API levels you can use DOM methods and conditionals to navigate to the node you want to extract

Web Scraping Android Application

Try JSoup for extracting and manipulating HTML data.

scraping data from website that could change

A much better way to do this would be to have a server that does the actual scraping of the website then your App will talk to this your server and receive only the data the App needs, this way the App will not break every time the website changes.

As for the server scraping you will need to update your scraping code every time the website structure changes, you will know if it has changed when your scraping code breaks or returns garbage results.

You can know if the website data has changed by scrapping it and comparing the results to the previous results, if the results are new then you allow the App to fetch the new data.

If you do it in the app you will consume a ton of data because you have to download the site every time you want to check for changes.
Also your app will break maybe even crash when the site structure changes which will frustrate users and it takes a long time for users to receive an App update and some of them will not update at all.

How can I parse webpage content in Android

Try HTMLCleaner or TagSoup, for more information please check this example: http://blog.andrewpearson.org/2010/07/android-html-parsing.html

Also check out this StackOverflow question: What is the fastest way to scrape HTML webpage in Android?

Scrape a dynamically-produced page on Android

Selenium would be a good option for web scraping. https://www.selenium.dev/ It basically has access to the website's DOM. In past experience, a dynamically generated web page can be difficult to scrape. RegExp will be your friend. https://regexone.com/



Related Topics



Leave a reply



Submit