how to dynamically generate HTML code using .NET's WebBrowser or mshtml.HTMLDocument?
I'd like to contribute some code to Alexei's answer. A few points:
Strictly speaking, it may not always be possible to determine when the page has finished rendering with 100% probability. Some pages
are quite complex and use continuous AJAX updates. But we
can get quite close, by polling the page's current HTML snapshot for changes
and checking theWebBrowser.IsBusy
property. That's whatLoadDynamicPage
does below.Some time-out logic has to be present on top of the above, in case the page rendering is never-ending (note
CancellationTokenSource
).Async/await
is a great tool for coding this, as it gives the linear
code flow to our asynchronous polling logic, which greatly simplifies it.It's important to enable HTML5 rendering using Browser Feature
Control, asWebBrowser
runs in IE7 emulation mode by default.
That's whatSetFeatureBrowserEmulation
does below.This is a WinForms app, but the concept can be easily converted into a console app.
This logic works well on the URL you've specifically mentioned: https://www.google.com/#q=where+am+i.
using Microsoft.Win32;
using System;
using System.ComponentModel;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace WbFetchPage
{
public partial class MainForm : Form
{
public MainForm()
{
SetFeatureBrowserEmulation();
InitializeComponent();
this.Load += MainForm_Load;
}
// start the task
async void MainForm_Load(object sender, EventArgs e)
{
try
{
var cts = new CancellationTokenSource(10000); // cancel in 10s
var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
MessageBox.Show(html.Substring(0, 1024) + "..." ); // it's too long!
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
// navigate and download
async Task<string> LoadDynamicPage(string url, CancellationToken token)
{
// navigate and await DocumentCompleted
var tcs = new TaskCompletionSource<bool>();
WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
tcs.TrySetResult(true);
using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
{
this.webBrowser.DocumentCompleted += handler;
try
{
this.webBrowser.Navigate(url);
await tcs.Task; // wait for DocumentCompleted
}
finally
{
this.webBrowser.DocumentCompleted -= handler;
}
}
// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);
// continue polling if the WebBrowser is still busy
if (this.webBrowser.IsBusy)
continue;
var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop
html = htmlNow;
}
// consider the page fully rendered
token.ThrowIfCancellationRequested();
return html;
}
// enable HTML5 (assuming we're running IE10+)
// more info: https://stackoverflow.com/a/18333982/1768303
static void SetFeatureBrowserEmulation()
{
if (LicenseManager.UsageMode != LicenseUsageMode.Runtime)
return;
var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 10000, RegistryValueKind.DWord);
}
}
}
How to get fully loaded HTML page's code
have you heard about http://webkitdotnet.sourceforge.net/?
moreover .net has WebBrowser component that can be used for
C# HtmlDocument full HTML
Changing
Microsoft.Win32.Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 9000, Microsoft.Win32.RegistryValueKind.DWord);
to
Microsoft.Win32.Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 11000, Microsoft.Win32.RegistryValueKind.DWord);
actually solved the problem (started getting the style tags). Might have something to do with me running IE 11.
Web scraping using WebBrowser in a class library
Here is what I tested in a web application and worked properly.
It uses a WebBrowser
control in another thread and returns a Task<string>
containing which completes when the browser content load completely:
using System;
using System.Threading.Tasks;
using System.Threading;
using System.Windows.Forms;
public class BrowserBasedWebScraper
{
public static Task<string> LoadUrl(string url)
{
var tcs = new TaskCompletionSource<string>();
Thread thread = new Thread(() => {
try {
Func<string> f = () => {
using (WebBrowser browser = new WebBrowser())
{
browser.ScriptErrorsSuppressed = true;
browser.Navigate(url);
while (browser.ReadyState != WebBrowserReadyState.Complete)
{
System.Windows.Forms.Application.DoEvents();
}
return browser.DocumentText;
}
};
tcs.SetResult(f());
}
catch (Exception e) {
tcs.SetException(e);
}
});
thread.SetApartmentState(ApartmentState.STA);
thread.IsBackground = true;
thread.Start();
return tcs.Task;
}
}
Related Topics
How to Convert HTML to Plain Text
How to Do a Deep Copy of an Object in .Net
What Is a Good Pattern For Using a Global Mutex in C#
How to Get the Webbrowser Control to Show Modern Contents
How to Translate Between Windows and Iana Time Zones
How to Query an Ntp Server Using C#
Async/Await VS Backgroundworker
How to Get the Index of the Current Iteration of a Foreach Loop
How to Parse a Json String That Would Cause Illegal C# Identifiers
What Are the Uses of "Using" in C#
How to Make Realistic N-Body Solar System Simulation in Matter of Size and Mass
How to Dynamically Create a Class
No Connection Could Be Made Because the Target Machine Actively Refused It