How to detect encoding in Data based on a String?
You can extend Data
and create a stringEncoding
property to try to detect the string encoding. Try like this:
extension Data {
var stringEncoding: String.Encoding? {
var nsString: NSString?
guard case let rawValue = NSString.stringEncoding(for: self, encodingOptions: nil, convertedString: &nsString, usedLossyConversion: nil), rawValue != 0 else { return nil }
return .init(rawValue: rawValue)
}
}
Then you can simply pass data.stringEncoding
to the String initialer:
if let string = String(data: data, encoding: data.stringEncoding) {
print(string)
}
Detect encoding and make everything UTF-8
If you apply utf8_encode()
to an already UTF-8 string, it will return garbled UTF-8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8()
.
You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8()
will convert everything to UTF-8.
I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
I've included another function, Encoding::fixUFT8()
, which will fix every UTF-8 string that looks garbled.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
I've transformed the function (forceUTF8
) into a family of static functions on a class called Encoding
. The new function is Encoding::toUTF8()
.
How to determine the encoding of text
EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative
Correctly detecting the encoding all times is impossible.
(From chardet FAQ:)
However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.
There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.
You can also use UnicodeDammit. It will try the following methods:
- An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
- An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252
How can I find out which character a certain string is encoding in my database?
The latin1 encoding for ’
is (in hex) E28099
.
The utf8 encoding for ’
is E28099
.
But you pasted in C3A2E282ACE284A2
, which is the "double encoding" of that apostrophe.
What apparently happened is that you had ’
in the client; the client was generating utf8 encodings. But your connection parameters to MySQL said "latin1". So, your INSERT
statement dutifully treated it as 3 latin1 characters E2 80 99
(visually ’
), and converted each one to utf8, hex C3A2 E282AC E284A2
.
Read about "double encoding" in Trouble with UTF-8 characters; what I see is not what I stored
Meanwhile, browsers tend to be forgiving about double-encoding, or else it might have shown ’
latin1 characters are each 1 byte (2 hex digits). utf8/utf8mb4 characters are 1-to-4 bytes; some 2-byte and 3-byte encodings showed up in your exercise.
As for Â
... Go to http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings and look at the second table there. Notice how the first two columns have lots of things starting with Â
. In latin1, that is hex C2
. In utf8, many punctuation marks are encoded as 2 bytes: C2xx
. For example, the copyright symbol, ©
is utf8 hex C2A9
, which is misinterpreted ©
.
Detect encoding of a string in C/C++
Assuming you know the length of the input array, you can make the following guesses:
- First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
- Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
- If any character is from
0x80
to0xff
, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun. - At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
How to detect string byte encoding?
if your files either in cp1252
and utf-8
, then there is an easy way.
import logging
def force_decode(string, codecs=['utf8', 'cp1252']):
for i in codecs:
try:
return string.decode(i)
except UnicodeDecodeError:
pass
logging.warn("cannot decode url %s" % ([string]))
for item in os.listdir(rootPath):
#Convert to Unicode
if isinstance(item, str):
item = force_decode(item)
print item
otherwise, there is a charset detect lib.
Python - detect charset and convert to utf-8
https://pypi.python.org/pypi/chardet
Related Topics
How to Name File Stored to Files App via Uactivityviewcontroller
How to Set .Realm File on Realm
Preload a Scene to Prevent Lag
Generic Return Type Based on Class
Spritekit Skscene Not Resizing Correctly to Fit iPhone 12
Convert Gregorian Date to Hijri Date
Nsundomanager: Capturing Reference Types Possible
How to Make an Ellipse/Circular UIimage with Transparent Background
Close Window Based on Kcgwindowname Value
Image to String Using Base64 in Swift 4
How to Create a UIprintpaper to Test UIprintinteractioncontrollerdelegate
Wkwebview on Macos Cuts Off Top
Watchkit Extension Cannot Read from Icloud
Swift Dictionary Initialization of Custom Type Gives: '>' Is Not a Postfix Unary Operator Error