Encode/Decode Urls in C++

Encode/Decode URl In C++

You can check out this article and this

Encode:

std::string UriEncode(const std::string & sSrc)
{
const char DEC2HEX[16 + 1] = "0123456789ABCDEF";
const unsigned char * pSrc = (const unsigned char *)sSrc.c_str();
const int SRC_LEN = sSrc.length();
unsigned char * const pStart = new unsigned char[SRC_LEN * 3];
unsigned char * pEnd = pStart;
const unsigned char * const SRC_END = pSrc + SRC_LEN;

for (; pSrc < SRC_END; ++pSrc)
{
if (SAFE[*pSrc])
*pEnd++ = *pSrc;
else
{
// escape this char
*pEnd++ = '%';
*pEnd++ = DEC2HEX[*pSrc >> 4];
*pEnd++ = DEC2HEX[*pSrc & 0x0F];
}
}

std::string sResult((char *)pStart, (char *)pEnd);
delete [] pStart;
return sResult;
}

Decode:

std::string UriDecode(const std::string & sSrc)
{
// Note from RFC1630: "Sequences which start with a percent
// sign but are not followed by two hexadecimal characters
// (0-9, A-F) are reserved for future extension"

const unsigned char * pSrc = (const unsigned char *)sSrc.c_str();
const int SRC_LEN = sSrc.length();
const unsigned char * const SRC_END = pSrc + SRC_LEN;
// last decodable '%'
const unsigned char * const SRC_LAST_DEC = SRC_END - 2;

char * const pStart = new char[SRC_LEN];
char * pEnd = pStart;

while (pSrc < SRC_LAST_DEC)
{
if (*pSrc == '%')
{
char dec1, dec2;
if (-1 != (dec1 = HEX2DEC[*(pSrc + 1)])
&& -1 != (dec2 = HEX2DEC[*(pSrc + 2)]))
{
*pEnd++ = (dec1 << 4) + dec2;
pSrc += 3;
continue;
}
}

*pEnd++ = *pSrc++;
}

// the last 2- chars
while (pSrc < SRC_END)
*pEnd++ = *pSrc++;

std::string sResult(pStart, pEnd);
delete [] pStart;
return sResult;
}

Encoding decoded urls in c++

In POSIX you can print UTF8 string directly:

std::string utf8 = "\xc3\xb6"; // or just u8"ö"
printf(utf8);

In Windows, you have to convert to UTF16. Use wchar_t instead of char16_t, even though char16_t is supposed to be the right one. They are both 2 bytes per character in Windows.

You want convert.from_bytes to convert from UTF8, instead of convert.to_bytes which converts to UTF8.

Printing Unicode in Windows console is another headache. See relevant topics.

Note that std::wstring_convert is deprecated and has no replacement as of now.

#include <iostream>
#include <string>
#include <codecvt>
#include <windows.h>

int main()
{
std::string utf8 = "\xc3\xb6";

std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
std::wstring utf16 = convert.from_bytes(utf8);

MessageBox(0, utf16.c_str(), 0, 0);
DWORD count;
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), utf16.c_str(), utf16.size(), &count, 0);

return 0;
}

Encoding/Decoding URL

"URL safe characters" don't need encoding. All other characters, including non-ASCII characters, should be encoded. Example:

std::string encode_url(const std::string& s)
{
const std::string safe_characters =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~";
std::ostringstream oss;
for(auto c : s) {
if (safe_characters.find(c) != std::string::npos)
oss << c;
else
oss << '%' << std::setfill('0') << std::setw(2) <<
std::uppercase << std::hex << (0xff & c);
}
return oss.str();
}

std::string decode_url(const std::string& s)
{
std::string result;
for(std::size_t i = 0; i < s.size(); i++) {
if(s[i] == '%') {
try {
auto v = std::stoi(s.substr(i + 1, 2), nullptr, 16);
result.push_back(0xff & v);
} catch(...) { } //handle error
i += 2;
}
else {
result.push_back(s[i]);
}

}
return result;
}

How to properly decode url with unicode in C

%81 and %8A are perfectly valid %-escapes, but the result is not a UTF-8 string. URLs are not required to be UTF-8 strings, but these days they usually are.

It looks to me like some very strange double encoding has happened. There is no convention I know of which uses three-digit %-encodings, but that's what it looks like you have in that URL. On the assumption that the intention was to encode the Spanish word "cariño" (care, affection, fondness), it should have been cari%C3%B1o in UTF-8, or cari%F1o in ISO-8859-1/Windows-1252 (which usually show up in URLs by accident).

The rules for valid UTF-8 sequences are simple enough that you can check for a valid sequence using a regular expression. Not all valid sequences are mapped to characters, and 66 of them are mapped explicitly as "not characters", but all valid sequences should be accepted by a conforming decoder even if it later rejects the decoded character as semantically incorrect.

A UTF-8 sequence is a one-to-four byte sequence corresponding to one of the following patterns: (taken from the Unicode standard, table 3.7)

    Byte 1      Byte 2      Byte 3      Byte 4
------ ------ ------ ------
00..7F -- -- --
C2..DF 80..BF -- --
E0 A0..BF 80..BF --
E1..EC 80..BF 80..BF --
ED 80..9F 80..BF --
EE..EF 80..BF 80..BF --
F0 90..BF 80..BF 80..BF
F1..F3 80..BF 80..BF 80..BF
F4 80..8F 80..BF 80..BF

Anything else is illegal. (So codes C0, C1 and F5 through FF cannot appear at all.) In particular, the hex codes 81 and 8A can never start a UTF-8 sequence.

Since there is no good way to know what might be meant by an invalid sequence, the simplest thing is just to strip them out.

C - URL encoding

curl_escape

which apparently has been superseded by

curl_easy_escape

How to encode or decode URL in objective-c

It's natural that Chinese and Japanese characters don't work with ASCII string encoding. If you try to escape the string by Apple's methods, which you definitely should to avoid code duplication, store the result as a Unicode string. Use one of the following encodings:

NSUTF8StringEncoding
NSUTF16StringEncoding
NSShiftJISStringEncoding (not Unicode, Japanese-specific)


Related Topics



Leave a reply



Submit