Why does sed fail with International characters and how to fix?
I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.
Example: in
is UTF-8
$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Y
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y
UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.
Example: in
is ISO-8859-1
$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Gras Och Stenar Trad - From MöY
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y
ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.
The answer is based on Debian Lenny/Sid and sed 4.1.5.
why don't escape characters and regex work well with sed command?
sed matches on Basic Regular Expressions while the meta-character +
is from Extended Regular Expressions. The shorthand \s
for the POSIX character class [[:space:]]
will only work in some seds (e.g. GNU sed) as an extension. Similarly \n
will only work as meaning "newline" in some seds while in any sed you can use a backslash followed by a literal newline character. Your use of double ("
) instead of single quotes ('
) around your script is exposing it to the shell and so requiring extra backslash escapes - always use single quotes around strings or scripts unless you have a very specific need for double quotes (e.g. to let a variable expand) and only use double unless you have a very specific need for none (e.g. to allow globbing wildcard expansion).
To do what you want in any POSIX sed is:
$ echo 'abc def gks dps' | sed 's/[[:space:]][[:space:]]*/\
/g'
abc
def
gks
dps
but this will work with GNU sed (note the -E
to enable EREs for +
- that is supported in GNU sed and OSX/BSD sed but of those 2 seds only GNU sed will support \s
and \n
):
$ echo 'abc def gks dps' | sed -E 's/\s+/\n/g'
abc
def
gks
dps
Replacing special characters using sed within windows cmder....and getting an strange error with url parameter
You need to use
- Double quotation marks around the sed command
- Escape
&
chars in the replacement only.
So you need to use
sed -i "s|category=2fLShbTEL0cKrSR7J9S2hk&emailRedirect=Y|utm_source=marketingcloud\&utm_medium=email\&utm_campaign=p8%202021%20donutshop%20lifestyle\&utm_content=primary%20cta\&brand=Donut%20Shop\&emailRedirect=Y|gi" *.html
Does . really match any character?
It works for me. It's probably a character encoding problem.
This might help:
- Why does sed fail with International characters and how to fix?
- http://www.barregren.se/blog/how-use-sed-together-utf8
Related Topics
How to Read Single Character Input from Keyboard Using Nasm (Assembly) Under Ubuntu
Don't Fail Jenkins Build If Execute Shell Fails
"Max Open Files" for Working Process
What Do These Kernel Panic Errors Mean
Get Yesterday's Date in Bash on Linux, Dst-Safe
Where Are Include Files Stored - Ubuntu Linux, Gcc
Linux Capabilities (Setcap) Seems to Disable Ld_Library_Path
Symbols from Convenience Library Not Getting Exported in Executable
Http Post and Get Using Curl in Linux
Bash - How to Pipe Result from the Which Command to Cd
Prevent Gnome Terminal from Exiting After Execution
Command Line Utility to Print Statistics of Numbers in Linux
Pack Shared Libraries into the Elf
Why Does Sed Fail with International Characters and How to Fix
How to Download a Tarball from Github Using Curl
Converting Jiffies to Milli Seconds
Get Free Disk Space with Df to Just Display Free Space in Kb