Why Can't You Use Cat to Read a File Line by Line Where Each Line Has Delimiters

Why can't you use cat to read a file line by line where each line has delimiters

The problem is not in cat, nor in the for loop per se; it is in the use of back quotes. When you write either:

for i in `cat file`

or (better):

for i in $(cat file)

or (in bash):

for i in $(<file)

the shell executes the command and captures the output as a string, separating the words at the characters in $IFS. If you want lines input to $i, you either have to fiddle with IFS or use the while loop. The while loop is better if there's any danger that the files processed will be large; it doesn't have to read the whole file into memory all at once, unlike the versions using $(...).

IFS='
'
for i in $(<file)
do echo "$i"
done

The quotes around the "$i" are generally a good idea. In this context, with the modified $IFS, it actually isn't critical, but good habits are good habits even so. It matters in the following script:

old="$IFS"
IFS='
'
for i in $(<file)
do
(
IFS="$old"
echo "$i"
)
done

when the data file contains multiple spaces between words:

$ cat file
abc 123, comma
the quick brown fox
jumped over the lazy dog
comma, comma
$

Output:

$ sh bq.sh
abc 123, comma
the quick brown fox
jumped over the lazy dog
comma, comma
$

Without the double quotes:

$ cat bq.sh
old="$IFS"
IFS='
'
for i in $(<file)
do
(
IFS="$old"
echo $i
)
done
$ sh bq.sh
abc 123, comma
the quick brown fox
jumped over the lazy dog
comma, comma
$

Read a file line by line assigning the value to a variable

The following reads a file passed as an argument line by line:

while IFS= read -r line; do
echo "Text read from file: $line"
done < my_filename.txt

This is the standard form for reading lines from a file in a loop. Explanation:

  • IFS= (or IFS='') prevents leading/trailing whitespace from being trimmed.
  • -r prevents backslash escapes from being interpreted.

Or you can put it in a bash file helper script, example contents:

#!/bin/bash
while IFS= read -r line; do
echo "Text read from file: $line"
done < "$1"

If the above is saved to a script with filename readfile, it can be run as follows:

chmod +x readfile
./readfile filename.txt

If the file isn’t a standard POSIX text file (= not terminated by a newline character), the loop can be modified to handle trailing partial lines:

while IFS= read -r line || [[ -n "$line" ]]; do
echo "Text read from file: $line"
done < "$1"

Here, || [[ -n $line ]] prevents the last line from being ignored if it doesn't end with a \n (since read returns a non-zero exit code when it encounters EOF).

If the commands inside the loop also read from standard input, the file descriptor used by read can be chanced to something else (avoid the standard file descriptors), e.g.:

while IFS= read -r -u3 line; do
echo "Text read from file: $line"
done 3< "$1"

(Non-Bash shells might not know read -u3; use read <&3 instead.)

How to cat EOF a file containing code?

You only need a minimal change; single-quote the here-document delimiter after <<.

cat <<'EOF' >> brightup.sh

or equivalently backslash-escape it:

cat <<\EOF >>brightup.sh

Without quoting, the here document will undergo variable substitution, backticks will be evaluated, etc, like you discovered.

If you need to expand some, but not all, values, you need to individually escape the ones you want to prevent.

cat <<EOF >>brightup.sh
#!/bin/sh
# Created on $(date # : <<-- this will be evaluated before cat;)
echo "\$HOME will not be evaluated because it is backslash-escaped"
EOF

will produce

#!/bin/sh
# Created on Fri Feb 16 11:00:18 UTC 2018
echo "$HOME will not be evaluated because it is backslash-escaped"

As suggested by @fedorqui, here is the relevant section from man bash:

Here Documents

This type of redirection instructs the shell to read input from the
current source until a line containing only delimiter (with no
trailing blanks) is seen. All of the lines read up to that point are
then used as the standard input for a command.

The format of here-documents is:

      <<[-]word
here-document
delimiter

No parameter expansion, command substitution, arithmetic expansion,
or pathname expansion is performed on word. If any characters in word
are quoted, the delimiter is the result of quote removal on word, and
the lines in the here-document are not expanded. If word is
unquoted, all lines of the here-document are subjected to parameter
expansion, command substitution, and arithmetic expansion
. In the
latter case, the character sequence \<newline> is ignored, and \
must be used to quote the characters \, $, and `.

How to get the part of a file after the first line that matches a regular expression

The following will print the line matching TERMINATE till the end of the file:

sed -n -e '/TERMINATE/,$p'

Explained: -n disables default behavior of sed of printing each line after executing its script on it, -e indicated a script to sed, /TERMINATE/,$ is an address (line) range selection meaning the first line matching the TERMINATE regular expression (like grep) to the end of the file ($), and p is the print command which prints the current line.

This will print from the line that follows the line matching TERMINATE till the end of the file:
(from AFTER the matching line to EOF, NOT including the matching line)

sed -e '1,/TERMINATE/d'

Explained: 1,/TERMINATE/ is an address (line) range selection meaning the first line for the input to the 1st line matching the TERMINATE regular expression, and d is the delete command which delete the current line and skip to the next line. As sed default behavior is to print the lines, it will print the lines after TERMINATE to the end of input.

If you want the lines before TERMINATE:

sed -e '/TERMINATE/,$d'

And if you want both lines before and after TERMINATE in two different files in a single pass:

sed -e '1,/TERMINATE/w before
/TERMINATE/,$w after' file

The before and after files will contain the line with terminate, so to process each you need to use:

head -n -1 before
tail -n +2 after

IF you do not want to hard code the filenames in the sed script, you can:

before=before.txt
after=after.txt
sed -e "1,/TERMINATE/w $before
/TERMINATE/,\$w $after" file

But then you have to escape the $ meaning the last line so the shell will not try to expand the $w variable (note that we now use double quotes around the script instead of single quotes).

I forgot to tell that the new line is important after the filenames in the script so that sed knows that the filenames end.

How would you replace the hardcoded TERMINATE by a variable?

You would make a variable for the matching text and then do it the same way as the previous example:

matchtext=TERMINATE
before=before.txt
after=after.txt
sed -e "1,/$matchtext/w $before
/$matchtext/,\$w $after" file

to use a variable for the matching text with the previous examples:

## Print the line containing the matching text, till the end of the file:
## (from the matching line to EOF, including the matching line)
matchtext=TERMINATE
sed -n -e "/$matchtext/,\$p"
## Print from the line that follows the line containing the
## matching text, till the end of the file:
## (from AFTER the matching line to EOF, NOT including the matching line)
matchtext=TERMINATE
sed -e "1,/$matchtext/d"
## Print all the lines before the line containing the matching text:
## (from line-1 to BEFORE the matching line, NOT including the matching line)
matchtext=TERMINATE
sed -e "/$matchtext/,\$d"

The important points about replacing text with variables in these cases are:

  1. Variables ($variablename) enclosed in single quotes ['] won't "expand" but variables inside double quotes ["] will. So, you have to change all the single quotes to double quotes if they contain text you want to replace with a variable.
  2. The sed ranges also contain a $ and are immediately followed by a letter like: $p, $d, $w. They will also look like variables to be expanded, so you have to escape those $ characters with a backslash [\] like: \$p, \$d, \$w.

Using multiple delimiters in awk

The delimiter can be a regular expression.

awk -F'[/=]' '{print $3 "\t" $5 "\t" $8}' file

Produces:

tc0001   tomcat7.1    demo.example.com  
tc0001 tomcat7.2 quest.example.com
tc0001 tomcat7.5 www.example.com

Java delimiter while reading text file - regex/or not?

Use String#split or Pattern#split Method.
For example,

   String[] list ="AB523:[joe, pierre][charlie][dogs,cat]".split("[:\\[\\]]+");
for(String s : list)
System.out.println(s);

how to split one line with customized separator and assign to variables in BASH?

I would suggest using shell arrays for storing individual field values and slightly different awk for this:

IFS=$'\03' read -ra arr < <(awk -F'#\\$' -v OFS='\03' '{$1=$1}1' file)

# check array content
declare -p arr

declare -a arr='([0]="hah a" [1]="hehe" [2]="hoho")'

We are using control character \03 as output field separator and using same in IFS to make read split fields on \03.


Alternatively you can use sed instead of awk also:

IFS=$'\03' read -ra arr < <(sed 's/#\$/\x03/g' file)

Using a NUL byte with BASH ver 4+

readarray -d $'\0' arr < <(
awk -F'#\\$' -v OFS='\0' '{ORS=OFS; $1=$1} 1' file)

Read txt file to pandas dataframe with unique delimiter and end of line

I guess as pointed out by matheubv there is no option to solve this with pd.read_csv. However this can be easily fixed a few lines of codes. Just open the file (in the example sample.csv) and parse it (use the string method .replace()). Afterwards you can read in the data currently saved as string in data_string with a very basic list comprehension.

Hope this work-around helps you

import pandas as pd
from pathlib import Path

p = Path("Data/sample.csv")

with p.open() as f:
string_data = f.readline().replace('#%#',';').replace('##@##','\n')
df = pd.DataFrame([x.split(';') for x in string_data.split('\n')])
print(df)

Output:

       0      1      2       3
0 cat dog rat cow
1 red blue green yellow
2 north south east west

Import text file with uneven column number and complicated delimiter

To provide another example in addition to the one provided by @JD Long, you could use a regular expression plus a list comprehension:

import re, pandas as pd

string = """
apple pear banana peach orange grape

dog cat white horse

salmon

tiger lion eagle hawk monkey
"""

rx = re.compile(r'''[ ]{2,}''')

items = [(rx.split(line)) for line in string.split("\n") if line]

df = pd.DataFrame.from_records(items)
print(df)

... which yields:

        0     1            2                   3
0 apple pear banana peach orange grape
1 dog cat white horse None
2 salmon None None None
3 tiger lion eagle hawk monkey


Related Topics



Leave a reply



Submit