Why Does Sed Fail with International Characters and How to Fix

Why does sed fail with International characters and how to fix?

I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.

Example: in is UTF-8

$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Y
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y

UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.

Example: in is ISO-8859-1

$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Gras Och Stenar Trad - From MöY
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y

ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.

The answer is based on Debian Lenny/Sid and sed 4.1.5.

why don't escape characters and regex work well with sed command?

sed matches on Basic Regular Expressions while the meta-character + is from Extended Regular Expressions. The shorthand \s for the POSIX character class [[:space:]] will only work in some seds (e.g. GNU sed) as an extension. Similarly \n will only work as meaning "newline" in some seds while in any sed you can use a backslash followed by a literal newline character. Your use of double (") instead of single quotes (') around your script is exposing it to the shell and so requiring extra backslash escapes - always use single quotes around strings or scripts unless you have a very specific need for double quotes (e.g. to let a variable expand) and only use double unless you have a very specific need for none (e.g. to allow globbing wildcard expansion).

To do what you want in any POSIX sed is:

$ echo 'abc  def    gks       dps' | sed 's/[[:space:]][[:space:]]*/\
/g'
abc
def
gks
dps

but this will work with GNU sed (note the -E to enable EREs for + - that is supported in GNU sed and OSX/BSD sed but of those 2 seds only GNU sed will support \s and \n):

$ echo 'abc  def    gks       dps' | sed -E 's/\s+/\n/g'
abc
def
gks
dps

Replacing special characters using sed within windows cmder....and getting an strange error with url parameter

You need to use

  • Double quotation marks around the sed command
  • Escape & chars in the replacement only.

So you need to use

sed -i "s|category=2fLShbTEL0cKrSR7J9S2hk&emailRedirect=Y|utm_source=marketingcloud\&utm_medium=email\&utm_campaign=p8%202021%20donutshop%20lifestyle\&utm_content=primary%20cta\&brand=Donut%20Shop\&emailRedirect=Y|gi" *.html

Does . really match any character?

It works for me. It's probably a character encoding problem.

This might help:

  • Why does sed fail with International characters and how to fix?
  • http://www.barregren.se/blog/how-use-sed-together-utf8


Related Topics



Leave a reply



Submit