Pulling Data from a Webpage, Parsing It for Specific Pieces, and Displaying It

Pulling data from a webpage, parsing it for specific pieces, and displaying it

This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.

protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);

string metascore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}

An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:

  • Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
  • Select the element in the page that you want the XPath for.
  • Right click the element in the "Elements" tab.
  • Click on "Copy as XPath".

You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.

You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.

Edit

Per @knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:

https://www.nuget.org/packages/HtmlAgilityPack/

How to extract data from a website with specifying a search criteria?

You can make use of HTMLAgilityPack for this purpose. I've made a small testing code and tested with the second page you wish to scrap based on the search criteria which you can set.

        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
//string InitialUrl = "https://www.hudhomestore.com/Home/Index.aspx";
//Here you need to set the values of these variable to whatever user inputs
//after setting these values, add them to initial URL
string zipCode = "", city = "", county = "", street = "", sState = "AK", fromPrice = "0", toPrice = "0", fcaseNumber = "",
bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";
HtmlAgilityPack.HtmlDocument document = web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
"zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
"&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
"&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
"&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
"&outdoorAmenities=" +outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
"&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage);
HtmlNodeCollection tdNodeCollection = document
.DocumentNode
.SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");

Count them again and look at your expression, there are exactly 121 td's within tr with id="dgPropertyList"
Next, check your td manually and trace what you need from that td and fetch that data.

            foreach (HtmlAgilityPack.HtmlNode node in tdNodeCollection)
{
//Do you say you want to access to <h2>, <p> here?
//You can do:
HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
HtmlNodeCollection allH2Nodes = node.SelectNodes(".//h2"); //That will search in depth too

//And you can also take a look at the children, without using XPath (like in a tree):
HtmlNode h2Node_ = node.ChildNodes["h2"];
}

I've tested the code, it works and parse the whole document to reach the required table. It will get you all the rows within that table inside div. So, you can further dig into these rows, find your td and get what you need.

Another option could be using Selenium webdriver, Get your hands on Selenium

If you don't want the browser to be visible and still want to use Selenium like functionality then you can make use of PhantomJS

Hope it helps.

Xamarin how to pull data from a aspx web page

Go through this link....

Pulling data from a webpage, parsing it for specific pieces, and displaying it

hope you resolve problem.........

Scrape data from web page with HtmlAgilityPack c#

Try this:

public static string Download(string search)
{
var request = (HttpWebRequest)WebRequest.Create("https://webportal.thpa.gr/ctreport/container/track");

var postData = string.Format("report_container%5Bcontainerno%5D={0}&report_container%5Bsearch%5D=", search);
var data = Encoding.ASCII.GetBytes(postData);

request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;

using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}

using (var response = (HttpWebResponse)request.GetResponse())
using (var stream = new StreamReader(response.GetResponseStream()))
{
return stream.ReadToEnd();
}
}

Usage:

var html = Download("ARKU2215462");

UPDATE

To find the post parameters to use, press F12 in the browser to show dev tools, then select Network tab. Now, fill the search input with your ARKU2215462 and press the button.

That do a request to the server to get the response. In that request, you can inspect both request and response. There are lots of request (styles, scripts, iamges...) but you want the html pages. In this case, look this:

Analyze request

This is the Form data requested. If you click in "view source", you get the data encoded like "report_container%5Bcontainerno%5D=ARKU2215462&report_container%5Bsearch%5D=", as you need in your code.

Web page(html) scraping using C#

You may take a look at SgmlReader or Html Agility Pack which are HTML parsing libraries for .NET.

Problems scraping data from a webpage that takes time to load

You could use a library like selenium to achieve this.

For example:

from selenium import webdriver
from bs4 import BeautifulSoup as bs

driver = webdriver.Firefox()
driver.get("https://www.cbn.gov.ng/rates/ExchRateByCurrency.asp")

html = driver.page_source
print(html.find("div",id="ContentTextinner"))

driver.quit()

Web scraping - how to identify main content on a webpage

There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.



Related Topics



Leave a reply



Submit