Grab All Text from HTML with HTML Agility Pack

Grab all text from html with Html Agility Pack

var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}

This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.

HTML Agility Pack - Grab Text after a node

You can use XPath following-sibling::text()[1] to get text node located directly after each strong. Here is a minimal but complete example :

var raw = @"<div>
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//strong"))
{
var val = node.SelectSingleNode("following-sibling::text()[1]");
Console.WriteLine(node.InnerText + ", " + val.InnerText);
}

dotnetfiddle demo

output :

Title, : Mr
First name, : Fake
Surname, : Guy

You should be able to remove the ":" by doing simple string manipulation, if needed...

Using Html Agility Pack to select all paragraphs that start with a certain text value

You can get the first 5 paragraphs where the inner text starts with "Version" like this:

var nodesParagraph = nodeRevHist
.Elements("p")
.Where(p => p.InnerText.Trim().StartsWith("Version"))
.Take(5);

Working demo here: https://dotnetfiddle.net/uvwcUN

Get href tag inner text from html (html agility pack)

You're effectively just collecting the inner text of the nodes. Do this:

var texts = doc.DocumentNode
.SelectNodes("//a[@href]")
.Select(n => n.InnerText)
.Distinct()
.ToList();

c# htmlagilitypack - how to extract specific text from web page

You need to loop through all siblings between first .heading-size-3, till next header .heading-size-3

HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.wowhead.com/quest=35151"));
var root = html.DocumentNode;
var descriptionHeader = root.Descendants("h2")
.Where(n => n.GetAttributeValue("class", "")
.Equals("heading-size-3"))
.FirstOrDefault();
var current = descriptionHeader.NextSibling;
var result = "";
while(current != null && !current.GetAttributeValue("class", "")
.Equals("heading-size-3"))
{
if (!string.IsNullOrEmpty(current.InnerText))
{
result += " "+current.InnerText;
}
current = current.NextSibling;
}
richTextBox1.Text = result;

At the end, you will receive:

You have already constructed an impressive garrison in Frostfire. I believe I should defer this next choice to you.
One region of Gorgrond is rich in resources. A lumber mill could help us make the most of them.
Another region harbors hardened gladiators. A sparring arena would help persuade them to fight for our cause.
Either path will strengthen us as we seek to find and weaken the Iron Horde.
Which do you choose, Commander?

HtmlAgilityPack how to extract html between some tag

You can use OuterHtml property of each <p> element to get the desired HTML :

string s = "<p>firt paragraph</p>some <br />text<p>another paragraph</p><span>some text between span</span><p>hellow word</p>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var nodes = doc.DocumentNode.SelectNodes("//p");
foreach (var item in nodes)
{
Console.WriteLine(item.OuterHtml);
}

output :

<p>firt paragraph</p>
<p>another paragraph</p>
<p>hellow word</p>

Or if you mean to get everything between the first <p> and the last <p> elements, inclusive, you can use the following XPath :

var query = "//node()[preceding-sibling::p or self::p][following-sibling::p or self::p]";

The XPath grab all nodes (either element or text node) that: has preceding sibling p and following sibling p, or the node itself is a p element.

var nodes = doc.DocumentNode.SelectNodes(query);
foreach (var item in nodes)
{
Console.WriteLine(item.OuterHtml);
}

output :

<p>firt paragraph</p>
some
<br />
text
<p>another paragraph</p>
<span>some text between span</span>
<p>hellow word</p>


Related Topics



Leave a reply



Submit