Scraping Webpage Generated by JavaScript with C#

Scraping webpage generated by JavaScript with C#

The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.

At a high-level, these are the steps:

  1. Installed selenium: http://docs.seleniumhq.org/
  2. Started the selenium hub as a service
  3. Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
  4. Started phantomjs in webdriver mode pointing to the selenium hub
  5. In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver

Here is an example usage of the phantomjs webdriver:

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
options.ToCapabilities(),
TimeSpan.FromSeconds(3)
);
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

More info on selenium, phantomjs and webdriver can be found at the following links:

http://docs.seleniumhq.org/

http://docs.seleniumhq.org/projects/webdriver/

http://phantomjs.org/

EDIT: Easier Method

It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):

Install web driver:

Install-Package Selenium.WebDriver

Install embedded exe:

Install-Package phantomjs.exe

Updated code:

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

scraping website that is generated by javascript in C#

So the whole "magic" of the test is made in app-ea56f7.js file.
This file is sending request and receiving chunks of data from netflix. Unfortunately, as referenced in Running Scripts in HtmlAgilityPack there is no direct way of having this without using a headless browser.

Either use https://www.npmjs.com/package/speedtest-net

How can I scrape a table that is created with JavaScript in c#

F12 is your friend in any browser.

Select the Network tab and you'll notice that all of the info is in this file :

https://www.belastingdienst.nl/data/douane_wisselkoersen/wks.douane.wisselkoersen.dd201806.xml

(I suppose that the data for july 2018 will be held in a url named *.dd201807.xml)

Using C# you will need to do a GET for that URL and parse it as XML, no need to use HtmlAgilityPack. You will need to construct the current year concatenated with the current month to pick the right URL.

Leuker kan ik het niet maken!

C# .NET: Scraping dynamic (JS) websites

if you need to scrape a website you can use ScrapySharp scraping framework. You can add it to a project as a nuget.
https://www.nuget.org/packages/ScrapySharp/

Install-Package ScrapySharp -Version 2.6.2

It has many useful properties to access different elements on the page.For example to access the entire HTML of the page you can use the following:

        ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();

Scraping data dynamically generated by JavaScript in html document using C#

When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.

You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx

The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.

EDIT (example):

Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");

webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);

private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");

foreach (HtmlElement div in divs)
{
//do something
}
}

Scrape data from web page with HtmlAgilityPack c#

Try this:

public static string Download(string search)
{
var request = (HttpWebRequest)WebRequest.Create("https://webportal.thpa.gr/ctreport/container/track");

var postData = string.Format("report_container%5Bcontainerno%5D={0}&report_container%5Bsearch%5D=", search);
var data = Encoding.ASCII.GetBytes(postData);

request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;

using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}

using (var response = (HttpWebResponse)request.GetResponse())
using (var stream = new StreamReader(response.GetResponseStream()))
{
return stream.ReadToEnd();
}
}

Usage:

var html = Download("ARKU2215462");

UPDATE

To find the post parameters to use, press F12 in the browser to show dev tools, then select Network tab. Now, fill the search input with your ARKU2215462 and press the button.

That do a request to the server to get the response. In that request, you can inspect both request and response. There are lots of request (styles, scripts, iamges...) but you want the html pages. In this case, look this:

Analyze request

This is the Form data requested. If you click in "view source", you get the data encoded like "report_container%5Bcontainerno%5D=ARKU2215462&report_container%5Bsearch%5D=", as you need in your code.

Scraping data dynamically generated by JavaScript

I would recommend to directly hit this url:

http://members.tsetmc.com/tsev2/data/InstTradeHistory.aspx?i=67126881188552864&Top=999999&A=0

And then parse the values as this contains all the values basically on all the 209 pages so it will save you also the trouble of moving from page 1 -209.

And it looks easy split by ; gives you all the rows and then split by @ gives you the values per column.

Wait for javascript to load page in c#

I hope the following code will help you

Create a function method to wait for a few seconds

    public void Wait(int sec)
{
System.Windows.Forms.Timer timer1 = new System.Windows.Forms.Timer();
if (sec == 0 || sec < 0) return;
timer1.Interval = sec * 1000;
timer1.Enabled = true;
timer1.Start();
timer1.Tick += (s, e) =>
{
timer1.Enabled = false;
timer1.Stop();

};
while (timer1.Enabled)
{
Application.DoEvents();
}
}

Write the following code in the DocumentCompleted event. Check the element has value or null if null wait for 2 sec and continue this process 30 times, nearly one minute. If it is not loaded display a message like a page not loaded

int cnt = 0;
HtmlElement htmlElement = WebBrowser1.Document.GetElementById("elementID")
do
{
Wait(2);
cnt++;
htmlElement = WebBrowser1.Document.GetElementById("elementID")
if (cnt > 30)
{
throw new Exception();
}
} while (htmlElement == null);


Related Topics



Leave a reply



Submit