Scraping Data Dynamically Generated by JavaScript in HTML Document Using C#

Scraping data dynamically generated by JavaScript in html document using C#

When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.

You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx

The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.

EDIT (example):

Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");

webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);

private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");

foreach (HtmlElement div in divs)
{
//do something
}
}

C# .NET: Scraping dynamic (JS) websites

if you need to scrape a website you can use ScrapySharp scraping framework. You can add it to a project as a nuget.
https://www.nuget.org/packages/ScrapySharp/

Install-Package ScrapySharp -Version 2.6.2

It has many useful properties to access different elements on the page.For example to access the entire HTML of the page you can use the following:

        ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();

Scraping webpage generated by JavaScript with C#

The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.

At a high-level, these are the steps:

  1. Installed selenium: http://docs.seleniumhq.org/
  2. Started the selenium hub as a service
  3. Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
  4. Started phantomjs in webdriver mode pointing to the selenium hub
  5. In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver

Here is an example usage of the phantomjs webdriver:

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
options.ToCapabilities(),
TimeSpan.FromSeconds(3)
);
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

More info on selenium, phantomjs and webdriver can be found at the following links:

http://docs.seleniumhq.org/

http://docs.seleniumhq.org/projects/webdriver/

http://phantomjs.org/

EDIT: Easier Method

It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):

Install web driver:

Install-Package Selenium.WebDriver

Install embedded exe:

Install-Package phantomjs.exe

Updated code:

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

Scrape data from web page with HtmlAgilityPack c#

Try this:

public static string Download(string search)
{
var request = (HttpWebRequest)WebRequest.Create("https://webportal.thpa.gr/ctreport/container/track");

var postData = string.Format("report_container%5Bcontainerno%5D={0}&report_container%5Bsearch%5D=", search);
var data = Encoding.ASCII.GetBytes(postData);

request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;

using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}

using (var response = (HttpWebResponse)request.GetResponse())
using (var stream = new StreamReader(response.GetResponseStream()))
{
return stream.ReadToEnd();
}
}

Usage:

var html = Download("ARKU2215462");

UPDATE

To find the post parameters to use, press F12 in the browser to show dev tools, then select Network tab. Now, fill the search input with your ARKU2215462 and press the button.

That do a request to the server to get the response. In that request, you can inspect both request and response. There are lots of request (styles, scripts, iamges...) but you want the html pages. In this case, look this:

Analyze request

This is the Form data requested. If you click in "view source", you get the data encoded like "report_container%5Bcontainerno%5D=ARKU2215462&report_container%5Bsearch%5D=", as you need in your code.

Html Agility Pack how to get dynamically generated content after page loads

The content you want to get is generated after the page loads, using Javascript and Ajax. HAP cannot get it unless it runs a browser in background and execute the scripts on the page.

.Net Core 2.0

Pre-requisites: you need Chrome web browser installed in your PC.

  1. Create a console application

  2. Install Nuget packages
    Install-Package HtmlAgilityPack
    Install-Package Selenium.WebDriver
    Install-Package Selenium.Chrome.WebDriver

  3. Replace Main method by the following

Code:

    static void Main(string[] args)
{
string url = "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys";
var browser = new ChromeDriver(Environment.CurrentDirectory);
browser.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(30);
browser.Navigate().GoToUrl(url);

var results = browser.FindElementByClassName("ss-results");
var doc = new HtmlDocument();
doc.LoadHtml(results.GetAttribute("innerHTML"));

// Show results
var list = doc.DocumentNode.SelectSingleNode("//div[@class='c-ProductList row ss-targeted']");
foreach (var title in list.SelectNodes(".//h2[@class='c-ProductListItem__title ng-binding']"))
{
Console.WriteLine(title.InnerText);
}
Console.ReadLine();
}

.Net 4.6

  1. Create a console application

  2. Install Nuget package Install-Package HtmlAgilityPack

  3. In Solution Explorer add reference to System.Windows.Form

  4. Add using statements as required

  5. Replace Main method by the following

Code:

[STAThread]
static void Main(string[] args)
{
string url = "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys";

var web = new HtmlWeb();
web.BrowserTimeout = TimeSpan.FromSeconds(30);

var doc = web.LoadFromBrowser(url, o =>
{
var webBrowser = (WebBrowser)o;

// Wait until the list shows up
return webBrowser.Document.Body.InnerHtml.Contains("c-ProductList");
});

// Show results
var list = doc.DocumentNode.SelectSingleNode("//div[@class='c-ProductList row ss-targeted']");
foreach (var title in list.SelectNodes(".//h2[@class='c-ProductListItem__title ng-binding']"))
{
Console.WriteLine(title.InnerText);
}
Console.ReadLine();
}

Displays a list starting with:

Iron Man Mark L

John Wick

The Punisher War Machine Armor

Wonder Woman Deluxe Version



Related Topics



Leave a reply



Submit