HTMLagilitypack HTMLweb.Load Returning Empty Document

HtmlAgilityPack HtmlWeb.Load returning empty Document

It seems this website requires cookies to be enabled. So creating a cookie container for your web request should solve the issue:

var url = "http://www.prettygreen.com/";
var htmlWeb = new HtmlWeb();
htmlWeb.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
var htmlDoc = htmlWeb.Load(url);
var outerHtml = htmlDoc.DocumentNode.OuterHtml;
Assert.AreNotEqual("", outerHtml);

htmlagilityPack: Web page doesn't return complete html

Check Network tab in chrome on that page. There are ajax requests to https://www.verkkokauppa.com/resp-api/product?pids=467610. So products are loaded using javascript.

You can't just trigger javascript here. HtmlAgilityPack is an html parser. If you want to work with dynamic content you need browser engine. I think you should check Selenium and phantomjs.

HtmlAgilityPack don't get xpath in c#

I have the same error using HtmlWeb.Load(), but I can easily solve your issue using HttpWebRequest (TLDR: See #3 for the working code).

Step 1) Using the following code:

HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }

You see that you actually get a 403 Forbidden error (WebException).

Step 2)

        HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
HtmlDocument doc = new HtmlDocument();
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
doc.LoadHtml(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd());
}

on doc.DocumentNode.OuterHtml, you see the HTML of the forbidden error with the JavaScript that sets the cookie on your browser and refreshes it.

3) So in order to load the page outside of a manual browser, you have to manually set that cookie and re-access it. Meaning, with:

        string cookie = string.Empty;
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
cookie = Regex.Match(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd(), "document.cookie = '(.*?)';").Groups[1].Value;
}
hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
hwr.Headers.Add("Cookie", cookie);
HtmlDocument doc = new HtmlDocument();
using (Stream s = hwr.GetResponse().GetResponseStream())
using (StreamReader sr = new StreamReader(s))
{
doc.LoadHtml(sr.ReadToEnd());
}

You get the page :)

Moral of the story, if your browser can do it, so can you.

WebRequest not returning HTML

You need CookieCollection to get cookies and set UseCookie to true in HtmlWeb.

CookieCollection cookieCollection = null;
var web = new HtmlWeb
{
//AutoDetectEncoding = true,
UseCookies = true,
CacheOnly = false,
PreRequest = request =>
{
if (cookieCollection != null && cookieCollection.Count > 0)
request.CookieContainer.Add(cookieCollection);

return true;
},
PostResponse = (request, response) => { cookieCollection = response.Cookies; }
};

var doc = web.Load("https://www.google.com");

Why is this simple webcrawl failing?

Looks like the web page is trying to set cookies. Also see this answer with the same problem

var loader = new HtmlWeb{ UseCookies = true };
var doc = loader.Load(@"http://www.monki.com/en_sek/newin/view-all-new.html");

var node2 = doc.DocumentNode.SelectSingleNode("//head/title");
Console.WriteLine("\n\n\n\n");
Console.WriteLine("Node Name2: " + node2.Name + "\n" + node2.OuterHtml + "\n" + node2.InnerText);

Can't download HTML data from https URL using htmlagilitypack

HtmlWeb doesn't support downloading from https. So instead, you can use WebClient with a bit of modification to automatically decompress GZip :

class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}

Then use HtmlDocument.LoadHtml() to populate your HtmlDocument instance from HTML string :

var url = "https://kat.cr/";
var data = new MyWebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);


Related Topics



Leave a reply



Submit