Scraping webpage generated by JavaScript with C#
The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.
At a high-level, these are the steps:
- Installed selenium: http://docs.seleniumhq.org/
- Started the selenium hub as a service
- Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
- Started phantomjs in webdriver mode pointing to the selenium hub
- In my scraping application installed the webdriver client nuget package:
Install-Package Selenium.WebDriver
Here is an example usage of the phantomjs webdriver:
var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);
var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
options.ToCapabilities(),
TimeSpan.FromSeconds(3)
);
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
More info on selenium, phantomjs and webdriver can be found at the following links:
http://docs.seleniumhq.org/
http://docs.seleniumhq.org/projects/webdriver/
http://phantomjs.org/
EDIT: Easier Method
It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):
Install web driver:
Install-Package Selenium.WebDriver
Install embedded exe:
Install-Package phantomjs.exe
Updated code:
var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
scraping website that is generated by javascript in C#
So the whole "magic" of the test is made in app-ea56f7.js file.
This file is sending request and receiving chunks of data from netflix. Unfortunately, as referenced in Running Scripts in HtmlAgilityPack there is no direct way of having this without using a headless browser.
Either use https://www.npmjs.com/package/speedtest-net
How can I scrape a table that is created with JavaScript in c#
F12 is your friend in any browser.
Select the Network tab and you'll notice that all of the info is in this file :
https://www.belastingdienst.nl/data/douane_wisselkoersen/wks.douane.wisselkoersen.dd201806.xml
(I suppose that the data for july 2018 will be held in a url named *.dd201807.xml)
Using C# you will need to do a GET for that URL and parse it as XML, no need to use HtmlAgilityPack. You will need to construct the current year concatenated with the current month to pick the right URL.
Leuker kan ik het niet maken!
C# .NET: Scraping dynamic (JS) websites
if you need to scrape a website you can use ScrapySharp scraping framework. You can add it to a project as a nuget.
https://www.nuget.org/packages/ScrapySharp/
Install-Package ScrapySharp -Version 2.6.2
It has many useful properties to access different elements on the page.For example to access the entire HTML of the page you can use the following:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();
Scraping data dynamically generated by JavaScript in html document using C#
When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}
Scrape data from web page with HtmlAgilityPack c#
Try this:
public static string Download(string search)
{
var request = (HttpWebRequest)WebRequest.Create("https://webportal.thpa.gr/ctreport/container/track");
var postData = string.Format("report_container%5Bcontainerno%5D={0}&report_container%5Bsearch%5D=", search);
var data = Encoding.ASCII.GetBytes(postData);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}
using (var response = (HttpWebResponse)request.GetResponse())
using (var stream = new StreamReader(response.GetResponseStream()))
{
return stream.ReadToEnd();
}
}
Usage:
var html = Download("ARKU2215462");
UPDATE
To find the post parameters to use, press F12 in the browser to show dev tools, then select Network tab. Now, fill the search input with your ARKU2215462 and press the button.
That do a request to the server to get the response. In that request, you can inspect both request and response. There are lots of request (styles, scripts, iamges...) but you want the html pages. In this case, look this:
This is the Form data requested. If you click in "view source", you get the data encoded like "report_container%5Bcontainerno%5D=ARKU2215462&report_container%5Bsearch%5D=", as you need in your code.
Scraping data dynamically generated by JavaScript
I would recommend to directly hit this url:
http://members.tsetmc.com/tsev2/data/InstTradeHistory.aspx?i=67126881188552864&Top=999999&A=0
And then parse the values as this contains all the values basically on all the 209 pages so it will save you also the trouble of moving from page 1 -209.
And it looks easy split by ; gives you all the rows and then split by @ gives you the values per column.
Wait for javascript to load page in c#
I hope the following code will help you
Create a function method to wait for a few seconds
public void Wait(int sec)
{
System.Windows.Forms.Timer timer1 = new System.Windows.Forms.Timer();
if (sec == 0 || sec < 0) return;
timer1.Interval = sec * 1000;
timer1.Enabled = true;
timer1.Start();
timer1.Tick += (s, e) =>
{
timer1.Enabled = false;
timer1.Stop();
};
while (timer1.Enabled)
{
Application.DoEvents();
}
}
Write the following code in the DocumentCompleted
event. Check the element has value or null if null wait for 2 sec and continue this process 30 times, nearly one minute. If it is not loaded display a message like a page not loaded
int cnt = 0;
HtmlElement htmlElement = WebBrowser1.Document.GetElementById("elementID")
do
{
Wait(2);
cnt++;
htmlElement = WebBrowser1.Document.GetElementById("elementID")
if (cnt > 30)
{
throw new Exception();
}
} while (htmlElement == null);
Related Topics
Marshal C++ Struct Array into C#
How to Calculate the Angle Between a Line and the Horizontal Axis
Does C# Optimize the Concatenation of String Literals
Deserialize Xml to Object Using Dynamic
Is There Any Benefit of Using an Object Initializer
Localization of Displaynameattribute
How to Change the File Location Programmatically
How to Limit the Maximum Number of Parallel Tasks in C#
How Does the Ternary Operator Work
How to Get the Computer Name in .Net
Backgroundworker Runworkercompleted Event
Are There Any Fuzzy Search or String Similarity Functions Libraries Written for C#
Converting Dd/Mm/Yyyy Formatted String to Datetime
Using a Self-Signed Certificate with .Net's Httpwebrequest/Response