Requesting HTML Over Https with C# Webclient

How can I use WebClient.DownloadString from a secure URL (https)?

If you look at the headers in Fiddler, the response is GZip-encoded (compressed). See this answer for how to deal with this, since there's no "quick and easy" way with the WebClient class.

Can't download webpage via C# webclient and via request/respond

Some pages load in stages. First they load the core of the page and only then they evaluate any JavaScript inside which loads further content via AJAX. To scrape these pages you will need more advanced content scraping libraries, than just simple HTTP request sender.

EDIT:
Here is a question in SO about the same problem that you are having now:
Jquery Ajax Web page scraping using c#

Get html that is generated via AJAX in webclient

The general approach is this:

  1. using a tool like Fiddler, find out which HTTP requests are made by the browser in order to fetch the data you're looking for.
  2. use WebClient to fetch the HTTP request(s) you need.

Take a look at my answer to this question for more info about HTML screen scraping for more details and how to work around various issues you may run across.

For #1 above, here's how to use fiddler to understand how a specific request is being made:

First, find the request you care about (the request which contains the data you want in its response). You can do this by inspecting each request by double-clicking it on the left pane in fiddler and looking inside the "text fiew" tab on the lower-right pane. You can also use CTRL+F to find content across multiple requests, but some requests are compressed so you'll want to ensure the "autodecode" button is selected in the toolbar before making your requests if you want to be sure you can text-search across all of them.

Once you've found the request you want, double-click it in Fiddler and select the "headers" tab in the upper-right pane. Those are the headers being sent. If your client sends exactly these headers to the server, you should get back the same data. But usually not all the headers are needed, so you'll want to figure out which ones are needed. You do this using Fiddler's Request Builder tab in the upper-right pane. Select that tab and drag your data request over from the left pane onto the request builder. Then submit the request to validate that it returns the correct results. Then start deleting headers, one header at a time, until the request stops working-- you know that that header was required. Try to delete each header until you find the ones that are required.

Then, you'll need to write code to generate the right header. Don't worry about the Host: header, that's generated automatically for you. For the Cookie: header, you'll need to generate it using the CookieContainer class. For the other headers (e.g. UserAgent:, Accept:, etc. you can generally copy them and add them to your request as-is.



Related Topics



Leave a reply



Submit