How to Write Non-Ascii Characters Using Echo

How do I write non-ASCII characters using echo?

Use

echo -e "\012"

ASCII-preserving binary-to-text conversion for `echo -e`

Can this achieve what you wanted ?

#!/usr/bin/env bash

python -c 'import sys;print(str(sys.argv[1].encode("utf-8"))[2:-1])' "$1"

Calling with :

$ test.sh $'ʃBC\n'
\xca\x83BC\n

This requires python version 3.

How can I find non ASCI in content in file in batch script?

The following defines a variable with valid Ascii characters (excluding ", handled by substitution) for character by character comparison.

Edit: Changes made to improve performance and ensure any possible ASCII input is handled correctly .

@Echo off

For /f "tokens=4 delims=: " %%G in ('CHCP')Do Set "Restore_Codepage=CHCP %%G > nul"
Set "Return[Len]=" & Set "Return[String]=" & Set "{input}="

Setlocal DISABLEDelayedExpansion
REM the label marker ":#" is used within this script to delimit help output.

:#
:# ========================= ASCII string filter v3.1 by T3RRY ======================
Rem - This script iterates over an input string character by character and tests
Rem each character against a a whitelist of printable ASCII characters, with
Rem succesful matches used to build a new string containing only printable
Rem ASCII characters.
Rem - Switch /R modifies this script to into a testing tool
Rem to check if a string contains any NonASCII or nonprintable ASCII characters.
Rem - Errorlevel 0 indicates the string contains only printable ASCII characters
Rem - A Positive errorlevel is returned containing the 1 indexed position of the
Rem first NonASCII or nonprintable ASCII character found.
Rem - Execution time increases as string length increases. Each character in the
Rem string is tested against a whitelist containing 95 printable ASCII characters.
:#
:# Usage: Filepath <"String"> [ /P | /R ] | [ -? | /? | -help ]
:#
:# Rem to use from another batch file:
:# For /f delims^= %%G in ('FilePath "string"')Do Echo(%%G
:#
:# Accepts input String via doublequoted argument - reads %* and trims " \P" or " \R"
:# switches if present
:# - No escaping of characters in the argument is required
:# - If unbalanced doublequotes exist in the string all doublequotes will be Removed.
:#
:# Use Switch /P to preserve original spaces
:# - Default behaviour is to Remove all double spaces from the string.
:#
:# Use Switch /R to reject input containing NonASCII characters
:# - If non ASCII character encountered, returns a positive errorlevel
:# ( the 1 indexed position of first non ASCII character encountered )
:#
Rem Version changes 09/Dec/2021 :
Rem - Changed input method to handle cases where qouted args contain
Rem standard delims within quotes IE: "string "substring=text""
Rem Version changes 08/Dec/2021 :
Rem - Added Help Switches -? /? and -help
Rem - Added switch: /R
Rem - Reject strings containing non ASCII characters. Default: Strip NonASCCi
Rem characters from the string.
Rem Note: this switch does not define Return[Len] or Return[String]
Rem Version changes 07/Dec/2021 :
Rem - Rewritten for much faster performance - NOTE:
Rem - Added Switch: /P
Rem - Preserve all whitespace. Default: multiple spaces truncated to single.
Rem - Renamed variable for returning String : Return[String]
Rem - Added variable Return[Len] to return 0 indexed string length.
Rem - Corrected handling of completely non ASCII strings to return empty / 0 Len
Rem ** Utilize alternate data stream to store variable containing printable ASCII
Rem characters so the variable only needs to be generated on first execution.
Rem ** Requires this batch file to be run from an NTFS drive.
:# =================================================================================

Set "ASCII= !"
2> nul (
more < "%~f0:ASCII.dat" > nul || (
Setlocal EnableDelayedExpansion
For /l %%i in (34 1 126) Do (
Cmd /c Exit %%i
Set "ASCII=!ASCII!!=ExitCodeAscii!"
)
>"%~f0:ASCII.dat" (Echo(Set ^^"ASCII=!ASCII!")
ENDLOCAL
))

Set "ASCII="
For /f "delims=" %%G in ('More ^< "%~f0:ASCII.dat"')Do %%G
If not Defined ASCII (
2> nul (
Powershell.exe -c "Remove-item -path '%~nx0' -Stream '*'"
)
1>&2 Echo(An error has occured. Ensure "%~nx0" is located on an NTFS drive.
Pause
ENDLOCAL
Exit /b 1
)

Rem Maximum stringlength to support. Modify here to propagate to RemoveChar loop and Return[Len]
REM maximum 1015 chars due to input reading method.
Set "SupportLength=1015"
Set "{input}="

::====================================================================================================
rem :: input capture method by Dave Benham : https://www.dostips.com/forum/viewtopic.php?t=4288#p23980
setlocal enableDelayedExpansion
>"%temp%\getArg.txt" <"%temp%\getArg.txt" (
setlocal disableExtensions
set prompt=#
echo on
for %%a in (%%a) do rem . %*.
echo off
endlocal
set /p "args="
set /p "args="
set "{input}=!args:~7,-2!"
set "count=!args:~7,-2!"
)

del "%temp%\getArg.txt"

::====================================================================================================

Rem the below line can be used to Remove the aleternate data stream this file creates.
Rem Powershell -c "Remove-item -path '%~nx0' -Stream '*'"

CHCP 65001 > nul
If not defined {input} (
Echo(Demo:
Rem escaped for definition in DelayedExpansion environment
Set "{input}=this is a demo) * ^! & ☺ ^= ¶ | ^! <. ~ ^^ & %% ▒ ╔ § ♣ This"
Set {input}
)

REM handle help switches

Set {input} | %SystemRoot%\System32\Findstr.exe /Xli "{input}=\/? {input}=-? {input}=-help" > nul && (
Setlocal EnableDelayedExpansion
For /f "tokens=2* delims=#" %%G in ('%SystemRoot%\System32\Findstr.exe /blic:":# " "%~f0"')Do (
Set "Usage=%%G"
Echo(!Usage:Filepath=%~f0!
)
ENDLOCAL & ENDLOCAL
Exit /b 0
)

Set Div="is=#", "1/(is<<9)"

Set "{DQ}=1"
Set ^"count=!count:"={DQ}!"

2> nul Set "null=%count:{DQ}=" & Set /A {DQ}+=1& set "null=%"

Set /A !Div:#=%{DQ}% %% 2! 2> nul || Set ^"{input}=!{input}:"=!"

REM handle nonhelp switches

Set "ASCIISwitch[R]="
Set "ASCIISwitch[P]="
If defined {input} (
Set {input} | %SystemRoot%\System32\findstr.exe /Elic:" /P" > nul && (
Set "{input}=!{input}:~0,-3!"
Set "ASCIISwitch[P]=true"
)
Set {input} | %SystemRoot%\System32\findstr.exe /Elic:" /R" > nul && (
Set "{input}=!{input}:~0,-3!"
Set "ASCIISwitch[R]=true"
))

Rem Remove outer doublequotes from input argument if not already removed due to unbalanced quoting.

If .^%{input}:~0,1%^%{input}:~-1%. == ."". Set "{input}=!{input}:~1,-1!"

Rem RemoveChar loop - iterate over input character by character; Compare against each character in whitelist
Rem Appends ASCII Whitelist characters to New string unless /R switch used, in which case NonASCII characters
Rem trigger an exit of the script with a positive errorlevel indicating the string is not ASCII.
Rem the return value is the 1 indexed position of the first non ascii character encountered.

Set "end=" & Set "New="
For /l %%i in (0 1 %SupportLength%)Do If not "!{input}:~%%i,1!"=="" (
Set "Char=!{input}:~%%i,1!"
Set "ISAscii="
For /l %%c in (0 1 94)Do If not "!ASCII:~%%c,1!" == "" (
Set "C_Char=!ASCII:~%%c,1!"
if "!Char!"=="!C_Char!" (
Set "New=!New!!Char!"
Set "ISAscii=true"
))
If Defined ASCIISwitch[R] (
If Not Defined ISAscii (
Endlocal & Endlocal & %Restore_Codepage%
For /f "delims=" %%G in ('Set /A %%i+1')Do Exit /b %%G
)))

Set "{Input}=!New!"
If not Defined ASCIISwitch[P] (
For /l %%i in (0 1 9)Do if defined {Input} Set "{Input}=!{Input}: = !"
)

If defined {input} (
Echo(!{input}!
For /l %%i in (0 1 %SupportLength%)Do If not defined Return[Len] If "!{input}:~%%i,1!"=="" Set "Return[Len]=%%i"
) Else (
Set "Return[Len]=0"
Set "Return[String]="
)

ENDLOCAL & ENDLOCAL & Set "Return[Len]=%Return[Len]%" & Set "Return[string]=%{input}%" )
%Restore_Codepage%

Exit /B 0

Using Findstr to find some string with non-ASCII character by batch

The exact behavior you see depends a bit on what code page your text file is using. Assuming your file uses code page 1252 - Latin (Western European), then ã is 0xE3 (decimal 227).

The reason your FINDSTR fails is explained at What are the undocumented features and limitations of the Windows FINDSTR command? under the section Character limits for command line parameters - Extended ASCII transformation. There it explains how FINDSTR transforms (corrupts) many non-ASCII command line characters into an ASCII value.

If you read the referenced section, you will see that character 227 is transformed into 112, which corresponds to the letter p. So your FINDSTR command is looking for the wrong string.

The only way to use FINDSTR to search for your string is to put the search string in a text file and use the /g:file option. FINDSTR does not corrupt characters when using the /G option.

If the contents of "search.txt" is a single line Conexão falhou, then the following command will match the correct lines:

findstr /I /L /G:search.txt WinSCP.log

That being said, the way the string is displayed may be incorrect, depending on your active code page. My machine defaults to code page 437, so ã is displayed as π on my machine. Either way, the character code is 0xE3. If you pipe the results of the FINDSTR to a file, you should see the correct result.

If you really want to put the search string on the command line, then you can explicitly specify a regular expression search even though you use /C by adding the /R option. You can then use . to match any character at the offending position.

findstr /I /R /C:"Conex.o falhou" WinSCP.log

Another option is to use the FIND command instead:

find /I "Conexão falhou" <"WinSCP.log"

Though on my machine I need the following due to active code page 437

find /I "Conexπo falhou" <"WinSCP.log

Batch file with non-ASCII characters

There is whether ANSI nor Unicode used by default in console windows. By default Windows uses for console a OEM code page.

Which OEM code page is used depends on Windows region and language settings. For US and Canada the default OEM code page is 437, for Western European countries the default code page is 850.

For US, Canada and Western European countries the non Unicode code page in GUI windows is Windows-1252.

The character æ has decimal code value 230 (hex. E6) in code page Windows-1252 as well as in Unicode table. But in OEM code page 437 and 850 the decimal code value of this character is 145 (hex. 91).

So you need to insert this character into the batch file either with method suggested by SomethingDark or you edit the batch file in text editor directly using the appropriate OEM code page.

I'm using UltraEdit for editing text files. I have configured UltraEdit for automatically using OEM code page as defined by the system (code page 850 in my case) for files with extension BAT and CMD and use for all other non Unicode text files the system code page for GUI windows (code page 1252 in my case). UltraEdit makes also the necessary conversion from Unicode or Windows-1252 to OEM code page 850 on pasting a text copied for example in browser into the batch file. And UltraEdit converts the OEM encoded characters also from 850 to 1252 and Unicode on copying a selected text in batch file to clipboard.

To find out which OEM code page is used on your machine in console windows, open a command prompt window and run either command chcp or mode con.

Using youtube-dl in a Windows CMD FOR loop strips non ASCII characters

The --encoding utf-8 switch appears to be working here with chcp 65001 (disclaimer: only tried under win10 v1909 using the non-legacy console with the NSimSun font, ymmv with other versions or settings).

C:\etc>chcp 65001
Active code page: 65001

C:\etc>for /f "delims=" %i in ('youtube-dl --encoding utf-8 -e "https://www.youtube.com/watch?v=E_JXrNAxGzM"') do @echo %i
27/12/2016 晚間新聞 楊家駿直播睇手機

________

However, I have been assured that it's not a youtube-dl problem.

The real question to ask the dev is whether youtube-dl does any detection of the output stream being sent to the interactive console vs. being piped or redirected, and whether it changes the output encoding based on that detection. I believe the answer to that might be yes, which would explain the difference between direct console output vs. the for loop.

Is there a way to use FINDSTR with non-ASCII (in this case Japanese/Chinese) characters in batch?

You need to use find (which supports Unicode but not regex) instead of findstr (which supports regex but not Unicode). See Why are there both FIND and FINDSTR programs, with unrelated feature sets?

D:\kanji>chcp
Active code page: 65001

D:\kanji>find "哀" JouyouKanjiReadings.txt

---------- JOUYOUKANJIREADINGS.TXT
哀 アイ,あわれ,あわれむ

Redirect to NUL to suppress the output if you don't need it

That said, find isn't a good solution either. Nowadays you should use PowerShell instead of cmd with all of its quirks due to compatibility legacy issues. PowerShell fully supports Unicode and can run any .NET framework methods. To search for strings you can use the cmdlet Select-String or its alias sls

PS D:\kanji> Select-String '握'  JouyouKanjiReadings.txt

JouyouKanjiReadings.txt:5:握 アク,にぎる

If fact you don't even need to use UTF-8 and codepage 65001. Just store the file in UTF-16 with BOM (that'll result in a much smaller file because your file contains mostly Japanese characters), then find and sls will automatically do a search in UTF-16

Of course if there are a lot of existing batch code then you can call PowerShell from cmd like this

powershell -Command "Select-String '哀'  JouyouKanjiReadings.txt"

But if it's entirely new then please just avoid the hassle and use PowerShell



Related Topics



Leave a reply



Submit