How to Parse a Url in C++ Cross Platform

Easy way to parse a url in C++ cross platform?

There is a library that's proposed for Boost inclusion and allows you to parse HTTP URI's easily. It uses Boost.Spirit and is also released under the Boost Software License. The library is cpp-netlib which you can find the documentation for at http://cpp-netlib.github.com/ -- you can download the latest release from http://github.com/cpp-netlib/cpp-netlib/downloads .

The relevant type you'll want to use is boost::network::http::uri and is documented here.

Parse URL query part

"?" starts the query part in any URI (not only http(s)).

If you need a "?" in the path, you'll need to percent-escape it.

Cross-language HTTP parser?

Several reasons:

  • C is a product of the early 1970s, when systems tended to be monolithic and network-centric architectures were somewhat rare. It was created primarily to implement the Unix operating system. And it has precious little language-level support for much of anything - no native networking, graphics, sound, or much else. That’s why it’s as portable as it is - the language definition makes relatively few demands of the underlying platform. The group that maintains the C standard tends to be conservative about adding features.

  • HTTP is one protocol of many - telnet, SMTP, NNTP, FTP, SSH, etc., all of which are or have been as widely used as HTTP at some point. 30 years ago a good case could have been made for making telnet or FTP support native (which would have required a native TCP/IP stack as well). Now it’s HTTP and HTTPS, which would require a native SSL implementation.

Paradigms (and protocols) come and go, but legacy code is forever. Making protocols part of the language makes the language bigger and harder to maintain. New protocols get created, old protocols fall out of favor or are deprecated, leading to more maintenance issues. Each time a protocol is updated you’d need a compiler update (or at least a standard library update).

Life is just easier if all of that is kept separate from the language itself.

As for why there are so many different implementations...

  • Different platforms have different APIs - at some point you have to have a system-specific implementation;

  • Different people have different requirements for usability, capability, scalability, and security. A lightweight implementation that may work just fine for individual use may fall down under load;

  • Somebody may just not be aware of an existing implementation and rolls their own;

  • And, finally, there’s no referee; standards exist, and groups that maintain and enforce those standards exist, but there’s no one who officially blesses a particular implementation.

How to parse an URI elegantly in Casablanca

It seems that there are static helper functions in the URI class, e.g. uri::split_query and uri::split_path that perform exactly as requested.

I found references to them after looking at this gist which uses

auto http_get_vars = uri::split_query(request.request_uri().query());

Read XML from URL with C not C++ in Win32 application

Libcurl has pure C API. Expat and libxml are written in pure C too.

How can I use a regular expression to parse a generic, complex URL?

Parsing URLs with Regular Expressions

You can parse URL/URI with Regular Expressions.

Example advanced URL look like:

http://login:password@www.example.org:80/demo/example.cgi?lang=de&foo=bar&empty#position

RegExr for parse that advanced URL is something like:

([^ :]*):\/\/(?:([^:]*):([^@]*)@|)([^/:]{1,}):?(\d*)?(\/[^? ]*)\??((?:[^=&# ]*=?[^&# ]*&?)*)#?([^ ]*)?

Yep, it's so crazy. But, you are able to obtain following fields from it (groups):

#1 Protocol, #2 Login, #3 Password, #4 Host name, #5 Port, #6 Path, #7 Query, #8 Fragment

Let's say you have some URL and want to know only a host name: