Can "Text File Busy" Happen When Two Processes Trying to Execute a Perl File in The Same Time

/usr/bin/perl: bad interpreter: Text file busy

I'd guess you encountered this issue.

The Linux kernel will generate a bad interpreter: Text file busy error if your Perl script (or any other kind of script) is open for writing when you try to execute it.

You don't say what the disk-intensive processes were doing. Is it possible one of them had the script open for read+write access (even if it wasn't actually writing anything)?

Jenkins durable task plugin pipeline Text file busy

The issue here appears to be caused by a Java bug, https://bugs.openjdk.java.net/browse/JDK-8068370.

The issue is possible when multiple threads open a file for writing, close it, and then execute them (each thread using its own file). Even if all files are closed "properly", due to how file handles work around fork/exec, a child process in one thread may inherit the handle to anther thread's open file, and thus break that thread's later subprocess call.

See similar issues:

https://issues.jenkins-ci.org/browse/JENKINS-53387
https://issues.jenkins-ci.org/browse/JENKINS-48258?focusedCommentId=324590&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-324590

How does Perl interact with the scripts it is running?

It's not quite as simple as pavel's answer states, because Perl doesn't actually have a clean division of "first you compile the source, then you run the compiled code"[1], but the basic point stands: Each source file is read from disk in its entirety before any code in that file is compiled or executed and any subsequent changes to the source file will have no effect on the running program unless you specifically instruct perl to re-load the file and execute the new version's code[2].

[1] BEGIN blocks will run code during compilation, while commands such as eval and require will compile additional code at run-time

[2] Most likely by using eval or do, since require and use check whether the file has been loaded already and ignore it if it has.

Speeding up separation of large text file based on line content in Bash

Let's generate an example file:

$ seq -f "%.0f" 3000000 | awk -F $'\t' '{print $1 FS "Col_B" FS int(2000*rand())}' >file

That generates a 3 million line file with 2,000 different values in column 3 similar to this:

$ head -n 3 file; echo "..."; tail -n 3 file
1   Col_B   1680
2   Col_B   788
3   Col_B   1566
...
2999998 Col_B   1562
2999999 Col_B   1803
3000000 Col_B   1252

With a simple awk you can generate the files you describe this way:

$ time awk -F $'\t' '{ print $1 " " $2 >> $3; close($3) }' file
real    3m31.011s
user    0m25.260s
sys     3m0.994s

So that awk will generate the 2,000 group files in about 3 minutes 31 seconds. Certainly faster than Bash, but this can be faster by presorting the file by the third column and writing each group file in one go.

You can use the Unix sort utility in a pipe and feed the output to a script that can separate the sorted groups to different files. Use the -s option with sort and the value of the third field will be the only fields that will change the order of the lines.

Since we can assume sort has partitioned the file into groups based on column 3 of the file, the script only needs to detect when that value changes:

$ time sort -s -k3 file | awk -F $'\t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }'
real    0m4.727s
user    0m5.495s
sys     0m0.541s

Because of the efficiency gained by presorting, the same net process completes in 5 seconds.

If you are sure that the 'words' in column 3 are ascii only (ie, you do not need to deal with UTF-8), you can set LC_ALL=C for additional speed:

$ time LC_ALL=C sort -s -k3 file | awk -F $'\t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }'
real    0m3.801s
user    0m3.796s
sys     0m0.479s

From comments:

1) Please add a line to explain why we need the bracketed expression in fn != ($3 ""):

The awk construct of fn != ($3 "") {action} is an effective shortcut for fn != $3 || fn=="" {action} use the one you consider most readable.

2) Not sure if this also works if the file is larger than the available memory, so this might be a limiting factor.:

I ran the first and the last awk with 300 million records and 20,000 output files. The last one with sort did the task in 12 minutes. The first took 10 hours...

It may be that the sort version actually scales better since opening appending and closing 20,000 files 300 million times takes a while. It is more efficient to gang up and stream similar data.

3) I was thinking about sort earlier but then felt it might not be the fastest because we have to read the whole file twice with this approach.:

This is the case for purely random data; if the actual data is somewhat ordered, there is a tradeoff with reading the file twice. The first awk would be significantly faster with less random data. But then it will also take time to determine if the file is sorted. If you know file is mostly sorted, use the first; if it is likely somewhat disordered, use the last.

Locking a file for both read and write

flock will do what you need. Note that it is a cooperative measure, so other processes will need to use the same system otherwise there will be nothing to stop them doing what they like with the file.

With regard to losing the lock on the file, you have two obvious choices

Acquire the lock on a separate file that exists only to control access to the primary file. This is probably the tidiest method
Open the file with both read and write access using a mode of +<. That will allow you to read through the file and then rewrite after using seek and truncate

Each process should try to acquire an exclusive lock with flock $fh, LOCK_EX, and will wait until its turn to access the file.

You should use

use Fcntl qw/ :flock :seek /;

to import the relevant constants for these operations.

Here's an example of the first method, which uses a separate lock file to control access to a data file that has just one record containing a count. Note that the lock file must be created outside the process, as any attempt to check whether it exists and create it if not will cause a race condition in the sharing processes

use strict;
use warnings;
use 5.010;
use autodie;

use Fcntl qw/ :flock /;

my ($data_file, $lock_file) = qw/ data.txt lockfile.lock /;

open my $lock_fh, '<', $lock_file;
flock $lock_fh, LOCK_EX;

open my $data_fh, '<', $data_file;
chomp(my $record = <$data_fh>);

open $data_fh, '>', $data_file;
print $data_fh ++$record, "\n";

close $data_fh;

close $lock_fh;

and here's an example of the second method, which does the same thing but without using a separate lock file and instead opening the data file read/write. In the same way that the lock file above must be created independently of the sharing processes, the data file here must be created as a separate action. The locking system won't prevent two processes from creating a new file simultaneously so it cannot be left to them to do it

use strict;
use warnings;
use 5.010;
use autodie;

use Fcntl qw/ :flock :seek /;

my $data_file = 'data.txt';

open my $data_fh, '+<', $data_file;
flock $data_fh, LOCK_EX;

chomp(my $record = <$data_fh>);

seek $data_fh, 0, SEEK_SET;
truncate $data_fh, 0;
print $data_fh ++$record, "\n";

close $data_fh;

How to write a program that reads a text file and substitutes certain words and then outputs the text file under a different name

Using various shortcuts (which probably are forbidden for your assignment) you could do:

perl -pe "s/new/old/gi" WK5input.txt > output.txt

For the special situation of your homework assignment however, you got reasonably close to the goal, but many little details had to be fixed.

(Note that I missed one of those details, the case insensitivity. That would not have happened if you had provided appropriate sample input and desired output. You might want to read up on [mcve].)

I have listed some assumptions on which special side-requirements distinguish your assignment from the usual goal to make an efficient solution.

I intentionally provide a solution here which is as close as possible to your own version, because I believe seeing small changes which make your version work is more helpful to you than the optimised version above. That one is so compact that it completley hides how close you got.

Check the other answers for intermediate versions, neither fully optimised nor homework-oriented.
For an optimised solution, some people would even skip perl and use awk or, for the purpose of historical brain exercise, sed.

These are a few assumptions on implicit rules,

which I guess you are expected to obey:

write a perl program which covers all the requirements,

i.e. no use of shell features, no other tools
do not use commandline parameters

(just because you did not in your attempt; maybe you have not covered them in your course yet)
use line by line processing

(just because you did so in your attempt; maybe you have not covered other methods in your course yet)
do not use "obscure perl magics", e.g. the commandline options -pe`,

though they otherwise are highly convenient

(a special case/interpretation of the "no shell" rule)

If any of these guessed rules are not applicable to your assignment have a look at the other answers. They provide interesting alternatives.

# nice touch, using these is very good practice
use strict;
use warnings; 

# Not necessary, but good practice: collect the "my"s in one place soon.
# This supports self-documenting inside the code.
# Doing it with the "my" at the first use is an aleternative option and preferred by some.
my $filename = 'WK5input.txt';  # file name   for the input file
my $fhin;                        # file handle for the input file
my $fhout;                       # file handle for the output file
my $row;                         # variable with currently processed line

# Prefer to use the three parameter version of "open", explicitly stating the mode.
open($fhin, '<', $filename) # two file handles are needed, use different names
  or die 'Could not open input  file "'.$filename.'" '.$!;
# I chose to concatenate both variables (file name and failure reason) explicitly,
# to some text inside '...', which can be more efficiently handled by perl interpreter.
# This saves work on text inside "..." and is more self explaining, i.e. it is easier
# to understand at the first reading what the code does.

# The second file handle is setup here, to read from input and write to output at the same time.
open($fhout, '>', 'output.txt')
  or die 'Could not open output file output.txt '.$!;

while ($row = <$fhin>) { # you are reading into a dedicated variable here ...

  # There was the code "chomp $row;" here. 
  # This removes the newline from ther end of the line, if there is one.
  # It is not needed if you are going to append the newline again before printing.

  $row =~ s/new/old/gi;   # ... you need to use the variable here, instead of "$_"
  # The additional "i" behind the "g", makes the search for "new", "New", "NEW" case insensitive.
  # Credits to other answer and comments for finding the requirement I missed.

  # I accept the requirement to replace new->old, though it seems strange.
  # I itched to replace old->new.

  print $fhout $row; # print into the output file instead of to stdout
  # You had an additional "\n" at the end, which was in fact needed, but only
  # because of the "chomp" a few lines above.
  # Also, you had the variable in quotes, i.e. "$row\n". That uses some time for interpreting
  # the text inside the quotes. If you only want to print the content of a variable, then
  # only print the variable outside of quotes.
}

# There was the code "$_ =~ s/new/old/g;" here, it was moved into the loop.
# Compare to a different answer to see a solution which used a single global replace on
# a variable with all the input. Instead, I decided to go for line by line processing in
# a loop, because it seemed closer to your approach.

# There was the code "open( $fh, '>', 'output.txt');" here it was moved to before the loop.
# There was the code "print $filename ;" here. It was deleted, because it seems not to be
# required by the assignment. Printing the modified content is done line by line inside the
# loop. 

# Closing file handles instead of file name:
close $fhin;  
close $fhout;

(StackOverflow recommends not to immediatly provide complete solutions for homework questions.

I interpret that as "... for homework questions which are not already close to solution".

I provided a solution because I considered your attempt very much close enough.

Close enough to show you the last details in comparision to something which works. Take that as a compliment.

StackOverflow also recommends to take students a little further along the way they can already see, in a helpful manner. Providing only the optimised, fine tuned final version, as the one start of this answer, is not constructive for them.

This is of course no excuse for any really bad code in my answer, so everybody feel free to point it out. When editing however, please stick to my goal, staying close to OPs attempt.)

How to prevent a Perl script from running more than once in parallel

use strict;
use warnings;
use Fcntl ':flock';

flock(DATA, LOCK_EX|LOCK_NB) or die "There can be only one! [$0]";

# mandatory line, flocking depends on DATA file handle
__DATA__