HtmlAgilityPack HtmlWeb.Load returning empty Document
It seems this website requires cookies to be enabled. So creating a cookie container for your web request should solve the issue:
var url = "http://www.prettygreen.com/";
var htmlWeb = new HtmlWeb();
htmlWeb.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
var htmlDoc = htmlWeb.Load(url);
var outerHtml = htmlDoc.DocumentNode.OuterHtml;
Assert.AreNotEqual("", outerHtml);
htmlagilityPack: Web page doesn't return complete html
Check Network tab in chrome on that page. There are ajax requests to https://www.verkkokauppa.com/resp-api/product?pids=467610
. So products are loaded using javascript.
You can't just trigger javascript here. HtmlAgilityPack is an html parser. If you want to work with dynamic content you need browser engine. I think you should check Selenium and phantomjs.
HtmlAgilityPack don't get xpath in c#
I have the same error using HtmlWeb.Load(), but I can easily solve your issue using HttpWebRequest (TLDR: See #3 for the working code).
Step 1) Using the following code:
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
You see that you actually get a 403 Forbidden error (WebException).
Step 2)
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
HtmlDocument doc = new HtmlDocument();
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
doc.LoadHtml(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd());
}
on doc.DocumentNode.OuterHtml, you see the HTML of the forbidden error with the JavaScript that sets the cookie on your browser and refreshes it.
3) So in order to load the page outside of a manual browser, you have to manually set that cookie and re-access it. Meaning, with:
string cookie = string.Empty;
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
cookie = Regex.Match(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd(), "document.cookie = '(.*?)';").Groups[1].Value;
}
hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
hwr.Headers.Add("Cookie", cookie);
HtmlDocument doc = new HtmlDocument();
using (Stream s = hwr.GetResponse().GetResponseStream())
using (StreamReader sr = new StreamReader(s))
{
doc.LoadHtml(sr.ReadToEnd());
}
You get the page :)
Moral of the story, if your browser can do it, so can you.
WebRequest not returning HTML
You need CookieCollection
to get cookies and set UseCookie
to true
in HtmlWeb
.
CookieCollection cookieCollection = null;
var web = new HtmlWeb
{
//AutoDetectEncoding = true,
UseCookies = true,
CacheOnly = false,
PreRequest = request =>
{
if (cookieCollection != null && cookieCollection.Count > 0)
request.CookieContainer.Add(cookieCollection);
return true;
},
PostResponse = (request, response) => { cookieCollection = response.Cookies; }
};
var doc = web.Load("https://www.google.com");
Why is this simple webcrawl failing?
Looks like the web page is trying to set cookies. Also see this answer with the same problem
var loader = new HtmlWeb{ UseCookies = true };
var doc = loader.Load(@"http://www.monki.com/en_sek/newin/view-all-new.html");
var node2 = doc.DocumentNode.SelectSingleNode("//head/title");
Console.WriteLine("\n\n\n\n");
Console.WriteLine("Node Name2: " + node2.Name + "\n" + node2.OuterHtml + "\n" + node2.InnerText);
Can't download HTML data from https URL using htmlagilitypack
HtmlWeb
doesn't support downloading from https. So instead, you can use WebClient
with a bit of modification to automatically decompress GZip
:
class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
Then use HtmlDocument.LoadHtml()
to populate your HtmlDocument
instance from HTML string :
var url = "https://kat.cr/";
var data = new MyWebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);
Related Topics
Pinvokestackimbalance C# Call to Unmanaged C++ Function
Sharing Variables Between C# and C++
Is a Finally Block Without a Catch Block a Java Anti-Pattern
Why Can't Reference to Child Class Object Refer to the Parent Class Object
How to Get The Checkboxlist Selected Values, What I Have Doesn't Seem to Work C#.Net/Visualwebpart
How to Run Sonarqube Code Analysis for .Net Core (C#) on Linux
How Get List of Local Network Computers
How to Pass Strings from C# to C++ (And from C++ to C#) Using Dllimport
.Net Decompiler for MAC or Linux
Non-Virtual Interface Design Pattern in C#/C++
Secure Way of Inserting Dynamic Values in External JavaScript Files
Code with Undefined Behavior in C#
How to Rewrite Complicated Lines of C++ Code (Nested Ternary Operator)
Java Equivalent of C# Async/Await