Delete Only Fully Formed Line Ranges from a Text File While Ignoring Those That Only Have a Start Delimiter

Delete only fully formed line ranges from a text file while ignoring those that only have a start delimiter

becomes easier by reversing the file linewise:

$ tac test.txt | sed '/END/,/START/d' | tac
START
text1
text2
text3
text5
text6
START
test7

How to get block printed out with sed

sed -ne '/START/,/END/p' test.txt

Is the typical solution. This will apply the p command to all lines between (and including) line that matches START until the line that matches END.

Excluding the start and end of the range is not very clean in sed. One approach is to explicitly match them:

sed -ne '/START/,/END/{/START/d; /END/d; p;}' test.txt

but for this particular use case it's probably cleaner to include those lines in the range of lines that are explicitly deleted:

sed -e '1,/START/d' -e '/END/,$d' test.txt

Using sed in Linux to extract lines from a log file

This might work for you (GNU sed):

sed -n '/^..:..:..\./{N;/Summary Report/!D;:a;N;/Sample Text4/!ba;s/\n/&    /gp}' file

Switch off automatic printing. If the current line is a timestamp and the next is not a Summary Report, delete the first line and repeat. Otherwise, gather up the following lines until the Sample Text4, indent all but the first line, print and repeat.

Skip or delete line/row in CSV file if the line/row does not start with established character in c#

~~So you want to skip all lines that start with a pipe?~~

List<List<string>> CSV = CSVDump
  .Where(x => !x.StartsWith('|'))
  .Select(x => x.Split('|').ToList()).ToArray();

So you want to keep anything that starts with a number, an N or a pipe?

List<List<string>> CSV = CSVDump
  .Where(x => x.Length > 0 && "0123456789N|".Contains(x[0]))
  .Select(x => x.Split('|').ToList()).ToArray();

In response to Steve's concerns about performance etc, perhaps the best route to go is:

int posNewColumn = 3;

string input = @"C:\Temp\SO\import.csv";
string output = @"C:\Temp\SO\out.csv";

using (var dest = File.CreateText(output))
{  
    bool adjust = true;

    foreach (string s in File.ReadLines(input))
    {
        if(line.Length == 0 || !"0123456789N|".Contains(line[0]) //skip zero len or line not begin with number/pipe/N
          continue;

        string line = s; //copy enum variable so we can adjust it

        if(adjust)
        {
          string[] bits = line.Split('|');
          
          if(line.StartsWith("N"))
            bits[posNewColumn] += "|END DATE HOUR|LENGHT";
          else
            bits[posNewColumn] += "||";
          
          line = string.Join("|", bits);
        } 

        if(line.StartsWith("|Table2")
          adjust = false;

        dest.WriteLine(line);
    } 
}

This requires minimal memory and processing; we don't split every line needlessly, thousands of Lists are not created, we don't try to hold the whole file in memory; we just read lines in and maybe write them out, and maybe adjust them if we didn't encounter Table2

Note; I have written it but not debugged/tested it - it might have a typo or a minor logic error; treat it as pseudocode

python pandas read text file, skip particular lines

Here is an attempt to 'craft magic'. The idea is to try read_csv with different skiprows until it works

import pandas as pd
from io import StringIO
data = StringIO(
'''
========================================= 
hello 123
========================================= 
Dir: /x/y/z/RTchoice/release001/data 
Date: 17-Mar-2020 10:0:08 
Output File: /a/b/c/filename.txt 
N: 2842
-----------------------------------------
Subject col1    col2    col3    
001 10.00000    1.00000 3.00000 
002 11.00000    2.00000 4.00000
''')

for n in range(1000):
    try:
        data.seek(0)
        df = pd.read_csv(data, delimiter = "\s+", skiprows=n)
    except:
        print(f'skiprows = {n} failed (exception)')   
    else:
        if len(df.columns) == 1: # do not let it get away with a single-column df
            print(f'skiprows = {n} failed (single column)')
        else:   
            break
print('\n', df)

output:


skiprows = 0 failed (exception)
skiprows = 1 failed (exception)
skiprows = 2 failed (exception)
skiprows = 3 failed (exception)
skiprows = 4 failed (exception)
skiprows = 5 failed (exception)
skiprows = 6 failed (exception)
skiprows = 7 failed (exception)
skiprows = 8 failed (single column)

    Subject  col1  col2  col3
0        1  10.0   1.0   3.0
1        2  11.0   2.0   4.0

Edit within multi-line sed match

I think you would be better off with perl

Specifically because you can work 'per record' by setting $/ - if you're records are delimited by blank lines, setting it to \n\n.

Something like this:

#!/usr/bin/env perl
use strict;
use warnings;

local $/ = "\n\n";
while (<>) {

    #multi-lines of text one at a time here.
    if (m/^start :\d+/) {
        s/(modify \d+)/$1 Appended_DIR\//g;
        s/(delete) /$1 Appended_DIR\//g;
    }
    print;
}

Each iteration of the loop will pick out a blank line delimited chunk, check if it starts with a pattern, and if it does, apply some transforms.

It'll take data from STDIN via a pipe, or myscript.pl somefile.

Output is to STDOUT and you can redirect that in the normal way.

Your limiting factor on processing files in this way are typically:

Data transfer from disk
pattern complexity

The more complex a pattern, and especially if it has variable matching going on, the more backtracking the regex engine has to do, which can get expensive. Your transforms are simple, so packaging them doesn't make very much difference, and your limiting factor will be likely disk IO.

(If you want to do an in place edit, you can with this approach)

If - as noted - you can't rely on a record separator, then what you can use instead is perls range operator (other answers already do this, I'm just expanding it out a bit:

#!/usr/bin/env perl
use strict;
use warnings;

while (<>) {

    if ( /^start :/ .. /^$/)
        s/(modify \d+)/$1 Appended_DIR\//g;
        s/(delete) /$1 Appended_DIR\//g;
    }
    print;
}

We don't change $/ any more, and so it remains on it's default of 'each line'. What we add though is a range operator that tests "am I currently within these two regular expressions" that's toggled true when you hit a "start" and false when you hit a blank line (assuming that's where you would want to stop?).

It applies the pattern transformation if this condition is true, and it ... ignores and carries on printing if it is not.

Delete Only Fully Formed Line Ranges from a Text File While Ignoring Those That Only Have a Start Delimiter