How to Dynamically Generate HTML Code Using .Net'S Webbrowser or Mshtml.Htmldocument

how to dynamically generate HTML code using .NET's WebBrowser or mshtml.HTMLDocument?

I'd like to contribute some code to Alexei's answer. A few points:

  • Strictly speaking, it may not always be possible to determine when the page has finished rendering with 100% probability. Some pages
    are quite complex and use continuous AJAX updates. But we
    can get quite close, by polling the page's current HTML snapshot for changes
    and checking the WebBrowser.IsBusy property. That's what
    LoadDynamicPage does below.

  • Some time-out logic has to be present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource).

  • Async/await is a great tool for coding this, as it gives the linear
    code flow to our asynchronous polling logic, which greatly simplifies it.

  • It's important to enable HTML5 rendering using Browser Feature
    Control, as WebBrowser runs in IE7 emulation mode by default.
    That's what SetFeatureBrowserEmulation does below.

  • This is a WinForms app, but the concept can be easily converted into a console app.

  • This logic works well on the URL you've specifically mentioned: https://www.google.com/#q=where+am+i.

using Microsoft.Win32;
using System;
using System.ComponentModel;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace WbFetchPage
{
public partial class MainForm : Form
{
public MainForm()
{
SetFeatureBrowserEmulation();
InitializeComponent();
this.Load += MainForm_Load;
}

// start the task
async void MainForm_Load(object sender, EventArgs e)
{
try
{
var cts = new CancellationTokenSource(10000); // cancel in 10s
var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
MessageBox.Show(html.Substring(0, 1024) + "..." ); // it's too long!
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}

// navigate and download
async Task<string> LoadDynamicPage(string url, CancellationToken token)
{
// navigate and await DocumentCompleted
var tcs = new TaskCompletionSource<bool>();
WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
tcs.TrySetResult(true);

using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
{
this.webBrowser.DocumentCompleted += handler;
try
{
this.webBrowser.Navigate(url);
await tcs.Task; // wait for DocumentCompleted
}
finally
{
this.webBrowser.DocumentCompleted -= handler;
}
}

// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];

// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);

// continue polling if the WebBrowser is still busy
if (this.webBrowser.IsBusy)
continue;

var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop

html = htmlNow;
}

// consider the page fully rendered
token.ThrowIfCancellationRequested();
return html;
}

// enable HTML5 (assuming we're running IE10+)
// more info: https://stackoverflow.com/a/18333982/1768303
static void SetFeatureBrowserEmulation()
{
if (LicenseManager.UsageMode != LicenseUsageMode.Runtime)
return;
var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 10000, RegistryValueKind.DWord);
}
}
}

How to get fully loaded HTML page's code

have you heard about http://webkitdotnet.sourceforge.net/?
moreover .net has WebBrowser component that can be used for

C# HtmlDocument full HTML

Changing

Microsoft.Win32.Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 9000, Microsoft.Win32.RegistryValueKind.DWord);

to

Microsoft.Win32.Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
appName, 11000, Microsoft.Win32.RegistryValueKind.DWord);

actually solved the problem (started getting the style tags). Might have something to do with me running IE 11.

Web scraping using WebBrowser in a class library

Here is what I tested in a web application and worked properly.

It uses a WebBrowser control in another thread and returns a Task<string> containing which completes when the browser content load completely:

using System;
using System.Threading.Tasks;
using System.Threading;
using System.Windows.Forms;
public class BrowserBasedWebScraper
{
public static Task<string> LoadUrl(string url)
{
var tcs = new TaskCompletionSource<string>();
Thread thread = new Thread(() => {
try {
Func<string> f = () => {
using (WebBrowser browser = new WebBrowser())
{
browser.ScriptErrorsSuppressed = true;
browser.Navigate(url);
while (browser.ReadyState != WebBrowserReadyState.Complete)
{
System.Windows.Forms.Application.DoEvents();
}
return browser.DocumentText;
}
};
tcs.SetResult(f());
}
catch (Exception e) {
tcs.SetException(e);
}
});
thread.SetApartmentState(ApartmentState.STA);
thread.IsBackground = true;
thread.Start();
return tcs.Task;
}
}


Related Topics



Leave a reply



Submit