Remove ÿþ from String

Remove ÿþ from string

ÿþ is 0xfffe in UTF-8; this is the byte order mark in UTF-16.
You can convert your string to UTF-8 with iconv or mb_convert_encoding():

$trackartist1 = iconv('UTF-16LE', 'UTF-8', $trackartist1);

# Same as above, but different extension
$trackartist1 = mb_convert_encoding($trackartist1, 'UTF-16LE', 'UTF-8');

# str_replace() should now work
$trackartist1 = str_replace('ÿþ', '', $trackartist1);

This assumes $trackartist1 is always in UTF-16LE; check the documentation of your ID3 tag library on how to get the encoding of the tags, since this may be different for different files. You usually want to convert everything to UTF-8, since this is what PHP uses by default.

How do I remove  from the beginning of a file?

Three words for you:

Byte Order Mark (BOM)

That's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to strip them out.

To automatize the BOM's removal you can use awk as shown in this question.

As another answer says, the best would be for PHP to actually interpret the BOM correctly, for that you can use mb_internal_encoding(), like this:

 <?php
//Storing the previous encoding in case you have some other piece
//of code sensitive to encoding and counting on the default value.
$previous_encoding = mb_internal_encoding();

//Set the encoding to UTF-8, so when reading files it ignores the BOM
mb_internal_encoding('UTF-8');

//Process the CSS files...

//Finally, return to the previous encoding
mb_internal_encoding($previous_encoding);

//Rest of the code...
?>

How to clean up byte character like ÿþ?

These two are the Byte order mark of UTF-16.

You could use the tools from Apache Commons IO.

Python - Remove Square symbol from text string

You need to specify the right encoding when opening the file. Try

open(path+fn, 'r', encoding="utf-16")

(I'm guessing utf-16 because ASCII characters seem to be encoded in two bytes in the sample string)

Remove Unicode characters in a String

Would a RegEx solution be of interest to you?

There are plenty of examples for different languages on this site - here's a C# one: How can you strip non-ASCII characters from a string? (in C#).

Try this for VBA:

Private Function GetStrippedText(txt As String) As String
Dim regEx As Object

Set regEx = CreateObject("vbscript.regexp")
regEx.Pattern = "[^\u0000-\u007F]"
GetStrippedText = regEx.Replace(txt, "")

End Function

Remove BOM from string in Java

You're replacing the BOM with U+0000, rather than with an empty string. You should replace the BOM with the empty string, e.g.

out.write(l.replace("\uFEFF", "") + "\n");

Perl regex to catch spam pattern ÿþ?

U+FEFF becomes FF FE when encoded using UTF-16le.

At the start of a text, U+FEFF is the UTF-16le BOM. Elsewhere, it's a zero-width non-breaking space (which is to say an invisible, function-less character).

I can think of two offensive uses. Both involve situations where HTML is checked for malicious content by one program before being used by another.

  • If the checker is fooled to switching to UTF-16le when it encounters FF FE (because it incorrectly believes it to be a BOM), the following < would appear as something other than < to it, thus bypassing checks for <. This would allow \xFF\xFE<script>...</script> (for example) to bypass the checks for those tags.

  • The checker could correctly determine that <\x{FEFF}script (decoded from UTF-16le) is not an HTML element and allow <\x{FEFF}script>...</script> through to a buggy browser that filters out all instances of U+FEFF. This browser would see <script>...</script> where there isn't one.


You probably plan on removing the characters, but that's a bad idea. Removing would introduce the second security problem I mentioned above. Instead, you should leave them be or change them to U+FFFD.

s/[\xFE\xFF]/\x{FFFD}/g

Why does my Python code print the extra characters  when reading from a text file?

I can't find a duplicate of this for Python 3, which handles encodings differently from Python 2. So here's the answer: instead of opening the file with the default encoding (which is 'utf-8'), use 'utf-8-sig', which expects and strips off the UTF-8 Byte Order Mark, which is what shows up as .

That is, instead of

data = open('info.txt')

Do

data = open('info.txt', encoding='utf-8-sig')

Note that if you're on Python 2, you should see e.g. Python, Encoding output to UTF-8 and Convert UTF-8 with BOM to UTF-8 with no BOM in Python. You'll need to do some shenanigans with codecs or with str.decode for this to work right in Python 2. But in Python 3, all you need to do is set the encoding= parameter when you open the file.



Related Topics



Leave a reply



Submit