Scraping data dynamically generated by JavaScript in html document using C#
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
C# .NET: Scraping dynamic (JS) websites
if you need to scrape a website you can use ScrapySharp scraping framework. You can add it to a project as a nuget.
https://www.nuget.org/packages/ScrapySharp/
Install-Package ScrapySharp -Version 2.6.2
It has many useful properties to access different elements on the page.For example to access the entire HTML of the page you can use the following:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();
Scraping webpage generated by JavaScript with C#
The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.
At a high-level, these are the steps:
- Installed selenium: http://docs.seleniumhq.org/
- Started the selenium hub as a service
- Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
- Started phantomjs in webdriver mode pointing to the selenium hub
- In my scraping application installed the webdriver client nuget package:
Install-Package Selenium.WebDriver
Here is an example usage of the phantomjs webdriver:
var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);
var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
options.ToCapabilities(),
TimeSpan.FromSeconds(3)
);
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
More info on selenium, phantomjs and webdriver can be found at the following links:
http://docs.seleniumhq.org/
http://docs.seleniumhq.org/projects/webdriver/
http://phantomjs.org/
EDIT: Easier Method
It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):
Install web driver:
Install-Package Selenium.WebDriver
Install embedded exe:
Install-Package phantomjs.exe
Updated code:
var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
Scrape data from web page with HtmlAgilityPack c#
Try this:
public static string Download(string search)
{
var request = (HttpWebRequest)WebRequest.Create("https://webportal.thpa.gr/ctreport/container/track");
var postData = string.Format("report_container%5Bcontainerno%5D={0}&report_container%5Bsearch%5D=", search);
var data = Encoding.ASCII.GetBytes(postData);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}
using (var response = (HttpWebResponse)request.GetResponse())
using (var stream = new StreamReader(response.GetResponseStream()))
{
return stream.ReadToEnd();
}
}
Usage:
var html = Download("ARKU2215462");
UPDATE
To find the post parameters to use, press F12 in the browser to show dev tools, then select Network tab. Now, fill the search input with your ARKU2215462 and press the button.
That do a request to the server to get the response. In that request, you can inspect both request and response. There are lots of request (styles, scripts, iamges...) but you want the html pages. In this case, look this:
This is the Form data requested. If you click in "view source", you get the data encoded like "report_container%5Bcontainerno%5D=ARKU2215462&report_container%5Bsearch%5D=", as you need in your code.
Html Agility Pack how to get dynamically generated content after page loads
The content you want to get is generated after the page loads, using Javascript and Ajax. HAP cannot get it unless it runs a browser in background and execute the scripts on the page.
.Net Core 2.0
Pre-requisites: you need Chrome web browser installed in your PC.
Create a console application
Install Nuget packages
Install-Package HtmlAgilityPack
Install-Package Selenium.WebDriver
Install-Package Selenium.Chrome.WebDriver
Replace
Main
method by the following
Code:
static void Main(string[] args)
{
string url = "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys";
var browser = new ChromeDriver(Environment.CurrentDirectory);
browser.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(30);
browser.Navigate().GoToUrl(url);
var results = browser.FindElementByClassName("ss-results");
var doc = new HtmlDocument();
doc.LoadHtml(results.GetAttribute("innerHTML"));
// Show results
var list = doc.DocumentNode.SelectSingleNode("//div[@class='c-ProductList row ss-targeted']");
foreach (var title in list.SelectNodes(".//h2[@class='c-ProductListItem__title ng-binding']"))
{
Console.WriteLine(title.InnerText);
}
Console.ReadLine();
}
.Net 4.6
Create a console application
Install Nuget package
Install-Package HtmlAgilityPack
In Solution Explorer add reference to
System.Windows.Form
Add
using
statements as requiredReplace
Main
method by the following
Code:
[STAThread]
static void Main(string[] args)
{
string url = "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys";
var web = new HtmlWeb();
web.BrowserTimeout = TimeSpan.FromSeconds(30);
var doc = web.LoadFromBrowser(url, o =>
{
var webBrowser = (WebBrowser)o;
// Wait until the list shows up
return webBrowser.Document.Body.InnerHtml.Contains("c-ProductList");
});
// Show results
var list = doc.DocumentNode.SelectSingleNode("//div[@class='c-ProductList row ss-targeted']");
foreach (var title in list.SelectNodes(".//h2[@class='c-ProductListItem__title ng-binding']"))
{
Console.WriteLine(title.InnerText);
}
Console.ReadLine();
}
Displays a list starting with:
Iron Man Mark L
John Wick
The Punisher War Machine Armor
Wonder Woman Deluxe Version
Related Topics
Get List of Certificates from the Certificate Store in C#
How to Configure ASP.NET Kestrel for Low Latency
Are Get and Set Functions Popular with C++ Programmers
Insert Text into the Textbox of Another Application
How to Pass Current User Information to All Layers in Ddd
.Net Decompiler for MAC or Linux
Incorrect Syntax Near the Keyword 'User'
Linq: How to Exclude Condition If Parameter Is Null
Load Different CSS File Based on Browser
Monodevelop + Naudio + Ubuntu Linux Tells Me Winmm.Dll Not Found
How to Get Float Value with SQLdatareader
Linq to Entities Generated SQL
.Net Class to Execute Remotely on Linux Over Ssh