Pulling data from a webpage, parsing it for specific pieces, and displaying it
This small example uses HtmlAgilityPack, and using XPath
selectors to get to the desired elements.
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);
string metascore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}
An easy way to obtain the XPath
for a given element is by using your web browser (I use Chrome) Developer Tools:
- Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
- Select the element in the page that you want the XPath for.
- Right click the element in the "Elements" tab.
- Click on "Copy as XPath".
You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.
You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.
Edit
Per @knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:
https://www.nuget.org/packages/HtmlAgilityPack/
How to extract data from a website with specifying a search criteria?
You can make use of HTMLAgilityPack for this purpose. I've made a small testing code and tested with the second page you wish to scrap based on the search criteria which you can set.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
//string InitialUrl = "https://www.hudhomestore.com/Home/Index.aspx";
//Here you need to set the values of these variable to whatever user inputs
//after setting these values, add them to initial URL
string zipCode = "", city = "", county = "", street = "", sState = "AK", fromPrice = "0", toPrice = "0", fcaseNumber = "",
bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";
HtmlAgilityPack.HtmlDocument document = web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
"zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
"&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
"&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
"&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
"&outdoorAmenities=" +outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
"&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage);
HtmlNodeCollection tdNodeCollection = document
.DocumentNode
.SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");
Count them again and look at your expression, there are exactly 121 td's
within tr
with id="dgPropertyList"
Next, check your td
manually and trace what you need from that td
and fetch that data.
foreach (HtmlAgilityPack.HtmlNode node in tdNodeCollection)
{
//Do you say you want to access to <h2>, <p> here?
//You can do:
HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
HtmlNodeCollection allH2Nodes = node.SelectNodes(".//h2"); //That will search in depth too
//And you can also take a look at the children, without using XPath (like in a tree):
HtmlNode h2Node_ = node.ChildNodes["h2"];
}
I've tested the code, it works and parse the whole document to reach the required table. It will get you all the rows within that table inside div. So, you can further dig into these rows, find your td and get what you need.
Another option could be using Selenium webdriver
, Get your hands on Selenium
If you don't want the browser to be visible and still want to use Selenium like functionality then you can make use of PhantomJS
Hope it helps.
Xamarin how to pull data from a aspx web page
Go through this link....
Pulling data from a webpage, parsing it for specific pieces, and displaying it
hope you resolve problem.........
Scrape data from web page with HtmlAgilityPack c#
Try this:
public static string Download(string search)
{
var request = (HttpWebRequest)WebRequest.Create("https://webportal.thpa.gr/ctreport/container/track");
var postData = string.Format("report_container%5Bcontainerno%5D={0}&report_container%5Bsearch%5D=", search);
var data = Encoding.ASCII.GetBytes(postData);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}
using (var response = (HttpWebResponse)request.GetResponse())
using (var stream = new StreamReader(response.GetResponseStream()))
{
return stream.ReadToEnd();
}
}
Usage:
var html = Download("ARKU2215462");
UPDATE
To find the post parameters to use, press F12 in the browser to show dev tools, then select Network tab. Now, fill the search input with your ARKU2215462 and press the button.
That do a request to the server to get the response. In that request, you can inspect both request and response. There are lots of request (styles, scripts, iamges...) but you want the html pages. In this case, look this:
This is the Form data requested. If you click in "view source", you get the data encoded like "report_container%5Bcontainerno%5D=ARKU2215462&report_container%5Bsearch%5D=", as you need in your code.
Web page(html) scraping using C#
You may take a look at SgmlReader or Html Agility Pack which are HTML parsing libraries for .NET.
Problems scraping data from a webpage that takes time to load
You could use a library like selenium
to achieve this.
For example:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
driver = webdriver.Firefox()
driver.get("https://www.cbn.gov.ng/rates/ExchRateByCurrency.asp")
html = driver.page_source
print(html.find("div",id="ContentTextinner"))
driver.quit()
Web scraping - how to identify main content on a webpage
There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.
Related Topics
How Should Anonymous Types Be Used in C#
How Accurate Is Thread.Sleep(Timespan)
Shellexecute Equivalent in .Net
Identityserver4 Register Userservice and Get Users from Database in ASP.NET Core
How to Return a File Using Web API
Programmatically Get the Version Number of a Dll
How Do C# Events Work Behind the Scenes
Copy Rows from One Datatable to Another Datatable
Is Endinvoke() Optional, Sort-Of Optional, or Definitely Not Optional
Create Xml Nodes Based on Xpath
Newtonsoft Add JSONignore at Runtime
Why Isn't My Public Property Serialized by the Xmlserializer
Differencebetween for and Foreach
Webbrowser Documentcompleted Event Fired More Than Once
Is It Ok to Use a String as a Lock Object