How to Find the Text Within a Div in the Source of a Web Page Using C#

How do I find the text within a div in the source of a web page using C#

Getting HTML code from a website. You can use code like this:

string urlAddress = "http://google.com";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (String.IsNullOrWhiteSpace(response.CharacterSet))
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream,
Encoding.GetEncoding(response.CharacterSet));
string data = readStream.ReadToEnd();
response.Close();
readStream.Close();
}

This will give you the returned HTML from the website. But find text via LINQ is not that easy.
Perhaps it is better to use regular expression but that does not play well with HTML.

Get specific content from website via C#

One of the solutions

var htmlWeb = new HtmlWeb();
var documentNode = htmlWeb.Load(source).DocumentNode;

var findclasses = documentNode.Descendants("p")
.Where(d => d.Attributes["class"]?.Value.Contains("mxm-lyrics__content") == true);
//or
var findclasses = documentNode.SelectNodes("//p[contains(@class,'mxm-lyrics__content')]")
var text = string.Join(Environment.NewLine, findclasses.Select(x => x.InnerText));

Reading Html Code and finding div text. C#

try using HTMLAgilityPack for this sort of thing...

for select item from a div see for example Select only items in a specific DIV using HtmlAgilityPack

C# count paragraphs in div from a website's html source code

How about this way :

//get the maximum number of paragraph
int maxNumberOfParagraph =
doc.DocumentNode
.SelectNodes("//div[.//p]")
.Max(o => o.SelectNodes(".//p").Count);

//get divs having number of containing paragraph equals maxNumberOfParagraph
var divs = doc.DocumentNode
.SelectNodes("//div[.//p]")
.Where(o => o.SelectNodes(".//p").Count == maxNumberOfParagraph);

get whole html source of webpage with c# or get data of jquery code with c#

Short answer - You can't. Evaluating the entire page including all client side HTML changes would require you to develop a full web browser.

How to select a div element in the code-behind page?

If you want to find the control from code behind you have to use runat="server" attribute on control. And then you can use Control.FindControl.

<div class="tab-pane active" id="portlet_tab1" runat="server">

Control myControl1 = FindControl("portlet_tab1");
if(myControl1!=null)
{
//do stuff
}

If you use runat server and your control is inside the ContentPlaceHolder you have to know the ctrl name would not be portlet_tab1 anymore. It will render with the ctrl00 format.

Something like: #ctl00_ContentPlaceHolderMain_portlet_tab1. You will have to modify name if you use jquery.

You can also do it using jQuery on client side without using the runat-server attribute:

<script type='text/javascript'>

$("#portlet_tab1").removeClass("Active");

</script>

Pulling data from a webpage, parsing it for specific pieces, and displaying it

This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.

protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);

string metascore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}

An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:

  • Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
  • Select the element in the page that you want the XPath for.
  • Right click the element in the "Elements" tab.
  • Click on "Copy as XPath".

You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.

You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.

Edit

Per @knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:

https://www.nuget.org/packages/HtmlAgilityPack/



Related Topics



Leave a reply



Submit