What is the fastest way to scrape HTML webpage in Android?
I think in this case it makes no sense to look for a fast way to extract the information as there is virtually no performance difference between the methods already suggested in answers when you compare it to the time it will take to download the HTML.
So assuming that by fastest you mean most convenient, readable and maintainable code, I suggest you use a DocumentBuilder
to parse the relevant HTML and extract data using XPathExpression
s:
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse(new InputSource(new StringReader(html)));
XPathExpression xpath = XPathFactory.newInstance()
.newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");
String result = (String) xpath.evaluate(doc, XPathConstants.STRING);
If you happen to retrieve invalid HTML, I recommend to isolate the relevant portion (e.g. using substring(indexOf("<table")..
) and if necessary correct remaining HTML errors with String
operations before parsing. If this gets too complex however (i.e. very bad HTML), just go with the hacky pattern matching approach as suggested in other answers.
Remarks
- XPath is available since API Level 8 (Android 2.2). If you develop for lower API levels you can use DOM methods and conditionals to navigate to the node you want to extract
Web Scraping Android Application
Try JSoup for extracting and manipulating HTML data.
scraping data from website that could change
A much better way to do this would be to have a server that does the actual scraping of the website then your App will talk to this your server and receive only the data the App needs, this way the App will not break every time the website changes.
As for the server scraping you will need to update your scraping code every time the website structure changes, you will know if it has changed when your scraping code breaks or returns garbage results.
You can know if the website data has changed by scrapping it and comparing the results to the previous results, if the results are new then you allow the App to fetch the new data.
If you do it in the app you will consume a ton of data because you have to download the site every time you want to check for changes.
Also your app will break maybe even crash when the site structure changes which will frustrate users and it takes a long time for users to receive an App update and some of them will not update at all.
How can I parse webpage content in Android
Try HTMLCleaner or TagSoup, for more information please check this example: http://blog.andrewpearson.org/2010/07/android-html-parsing.html
Also check out this StackOverflow question: What is the fastest way to scrape HTML webpage in Android?
Scrape a dynamically-produced page on Android
Selenium would be a good option for web scraping. https://www.selenium.dev/ It basically has access to the website's DOM. In past experience, a dynamically generated web page can be difficult to scrape. RegExp will be your friend. https://regexone.com/
Related Topics
Android, How to Apply CSS into Webview
Android HTML5 Input Type="Password" and Numeric Keyboard
CSS - Sibling Selector Not Working in Android
How to Create a Drop-Down List
Displaying Emoticons in Android
Disable/Check for Mock Location (Prevent Gps Spoofing)
Show Dialogfragment with Animation Growing from a Point
Is There a Real Solution to Debug Cordova Apps
Recyclerview Itemtouchhelper Buttons on Swipe
Actionbar Up Navigation with Fragments
Refreshing Data in Recyclerview and Keeping Its Scroll Position
Android Automatic Horizontally Scrolling Textview
How to Programmatically Scroll a Scroll View to a Specific Edit Text
Illegalargumentexception: Navigation Destination Xxx Is Unknown to This Navcontroller
How Does Push Notification Technology Work on Android