Linux: Move 1 Million Files into Prefix-Based Created Folders

Linux: Move 1 million files into prefix-based created Folders

for i in *.*; do mkdir -p ${i:0:1}/${i:1:1}/${i:2:1}/; mv $i ${i:0:1}/${i:1:1}/${i:2:1}/; done;

The ${i:0:1}/${i:1:1}/${i:2:1} part could probably be a variable, or shorter or different, but the command above gets the job done. You'll probably face performance issues but if you really want to use it, narrow the *.* to fewer options (a*.*, b*.* or what fits you)

edit: added a $ before i for mv, as noted by Dan

Linux: Update directory structure for millions of images which are already in prefix-based folders

One way to do it is to simply loop over all the directories you already have, and in each bottom-level subdirectory create the new subdirectory and move the files:

for d in ?/?/?/; do (
cd "$d" &&
printf '%.4s\0' * | uniq -z |
xargs -0 bash -c 'for prefix do
s=${prefix:3:1}
mkdir -p "$s" && mv "$prefix"* "$s"
done' _
) done

That probably needs a bit of explanation.

The glob ?/?/?/ matches all directory paths made up of three single-character subdirectories. Because it ends with a /, everything it matches is a directory so there is no need to test.

( cd "$d" && ...; )

executes ... after cd'ing to the appropriate subdirectory. Putting that block inside ( ) causes it to be executed in a subshell, which means the scope of the cd will be restricted to the parenthesized block. That's easier and safer than putting cd .. at the end.

We then collecting the subdirectories first, by finding the unique initial strings of the files:

printf '%.4s\0' * | uniq -z | xargs -0 ...

That extracts the first four letters of each filename, nul-terminating each one, then passes this list to uniq to eliminate duplicates, providing the -z option because the input is nul-terminated, and then passes the list of unique prefixes to xargs, again using -0 to indicate that the list is nul-terminated. xargs executes a command with a list of arguments, issuing the command several times only if necessary to avoid exceeding the command-line limit. (We probably could have avoided the use of xargs but it doesn't cost that much and it's a lot safer.)

The command called with xargs is bash itself; we use the -c option to pass it a command to be executed. That command iterates over its arguments by using the for arg in syntax. Each argument is a unique prefix; we extract the fourth character from the prefix to construct the new subdirectory and then mv all files whose names start with the prefix into the newly created directory.

The _ at the end of the xargs invocation will be passed to bash (as with all the rest of the arguments); bash -c uses the first argument following the command as the $0 argument to the script, which is not part of the command line arguments iterated over by the for arg in syntax. So putting the _ there means that the argument list constructed by xargs will be precisely $1, $2, ... in the execution of the bash command.

How can I sort files by their subject number (prefix of the filename) and create a new folder from it?

If I understand you correctly, the following should work:

for i in *.nii; do
dir="${i%%_*}/mri/orig"
mkdir -p -- "$dir" && mv -- "$i" "$dir"
done

Here ${i%%_*} expands to the contents of i with any trailing substring matching _* cut off, i.e. it is the value of $i up to the first underscore.

Efficient method to parse large number of files

Approach 1 and 3 expand the list of files on the shell command line. This will not work with a huge number of files. Approach 1 and 3 also do not work if the files are distributed across many directories (which is likely with millions of files).

Approach 2 makes a copy of all data, so it is inefficient as well.

You should use find and pass the file names directly to egrep. Use the -h option to suppress the prefix with the file name:

find . -name \*.txt -print0 \
| xargs -0 egrep -i -v -h 'pattern1|...|pattern8' \
| awk '{gsub(/"\t",",")}1' > all_in_1.out

xargs will automatically launch multiple egrep processes in sequence to avoid exceeding the command line limit in a single invocation.

Depending on the file contents, it may also be more efficient to avoid the egrep processes altogether, and do the filtering directly in awk:

find . -name \*.txt -print0 \
| xargs -0 awk 'BEGIN { IGNORECASE = 1 } ! /pattern1|...|pattern8/ {gsub(/"\t",",")}1' > all_in_1.out

BEGIN { IGNORECASE = 1 } corresponds to the -i option of egrep, and the ! inverts the sense of the matching, just like -v. IGNORECASE appears to be a GNU extension.

Software to manage 1 Million files on Amazon S3

Well after trying many S3 tools I've finally found one which handles >million files with ease, and can do a sync as well. It's free, thought that wasn't important to me, I just wanted something that worked.

Dragon Disk:

http://www.dragondisk.com

how to copy top 100 files of a particular extension to a target folder using terminal

The argument list being too long happens because when you do this:

ls -1 11944*.DAT

It tries to construct a huge line like:

foo bar [...] baz quux

And there is of course a limit on how long a command line can be. The good news is that it's easy to fix--just use find to match the files you want, then xargs to launch cp, because xargs knows how long the maximum length of a single command is, and will launch cp several times as required:

find -name '11944*.DAT' | tail -n 1000 | xargs -I{} cp {} /ftp/BSEG_SRC 

By the way, there is no specified sort order here, because your original question didn't have any.

Best way to store/retrieve millions of files when their meta-data is in a SQL Database

I'd group the files in specific subfolders, and try to organize them (the subfolders) in some business-logic way. Perhaps all files made during a given day? During a six-hour period of each day? Or every # of files, I'd say a few 1000 max. (There's probably an ideal number out there, hopefully someone will post it.)

Do the files ever age out and get deleted? If so, sort and file be deletable chunk. If not, can I be your hardware vendor?

There's arguments on both sides of storing files in a database.

  • On the one hand you get enhanced security, 'cause it's more awkward to pull the files from the DB; on the other hand, you get potentially poorer performance, 'cause it's more awkward to pull the files from the DB.
  • In the DB, you don't have to worry about how many files per folder, sector, NAS cluster, whatever--that's the DB's problem, and probably they've got a good implementation for this. On the flip side, it'll be harder to manage/review the data, as it'd be a bazillion blobs in a single table, and, well, yuck. (You could partition the table based on the afore-mentioned business-logic, which would make deletion or archiving infinitely easier to perform. That, or maybe partitioned views, since table partitioning has a limit of 1000 partitions.)
  • SQL Server 2008 has the FileStream data type; I don't know much about it, might be worth looking into.

A last point to worry about is keeping the data "aligned". If the DB stores the info on the file along with the path/name to the file, and the file gets moved, you could get totally hosed.

How to split CSV files as per number of rows specified?

Made it into a function. You can now call splitCsv <Filename> [chunkSize]

splitCsv() {
HEADER=$(head -1 $1)
if [ -n "$2" ]; then
CHUNK=$2
else
CHUNK=1000
fi
tail -n +2 $1 | split -l $CHUNK - $1_split_
for i in $1_split_*; do
sed -i -e "1i$HEADER" "$i"
done
}

Found on: http://edmondscommerce.github.io/linux/linux-split-file-eg-csv-and-keep-header-row.html



Related Topics



Leave a reply



Submit