Linux: Move 1 million files into prefix-based created Folders
for i in *.*; do mkdir -p ${i:0:1}/${i:1:1}/${i:2:1}/; mv $i ${i:0:1}/${i:1:1}/${i:2:1}/; done;
The ${i:0:1}/${i:1:1}/${i:2:1}
part could probably be a variable, or shorter or different, but the command above gets the job done. You'll probably face performance issues but if you really want to use it, narrow the *.*
to fewer options (a*.*
, b*.*
or what fits you)
edit: added a $
before i
for mv
, as noted by Dan
Linux: Update directory structure for millions of images which are already in prefix-based folders
One way to do it is to simply loop over all the directories you already have, and in each bottom-level subdirectory create the new subdirectory and move the files:
for d in ?/?/?/; do (
cd "$d" &&
printf '%.4s\0' * | uniq -z |
xargs -0 bash -c 'for prefix do
s=${prefix:3:1}
mkdir -p "$s" && mv "$prefix"* "$s"
done' _
) done
That probably needs a bit of explanation.
The glob ?/?/?/
matches all directory paths made up of three single-character subdirectories. Because it ends with a /
, everything it matches is a directory so there is no need to test.
( cd "$d" && ...; )
executes ...
after cd'ing to the appropriate subdirectory. Putting that block inside ( )
causes it to be executed in a subshell, which means the scope of the cd
will be restricted to the parenthesized block. That's easier and safer than putting cd ..
at the end.
We then collecting the subdirectories first, by finding the unique initial strings of the files:
printf '%.4s\0' * | uniq -z | xargs -0 ...
That extracts the first four letters of each filename, nul-terminating each one, then passes this list to uniq
to eliminate duplicates, providing the -z
option because the input is nul-terminated, and then passes the list of unique prefixes to xargs
, again using -0
to indicate that the list is nul-terminated. xargs
executes a command with a list of arguments, issuing the command several times only if necessary to avoid exceeding the command-line limit. (We probably could have avoided the use of xargs
but it doesn't cost that much and it's a lot safer.)
The command called with xargs is bash itself; we use the -c
option to pass it a command to be executed. That command iterates over its arguments by using the for arg in
syntax. Each argument is a unique prefix; we extract the fourth character from the prefix to construct the new subdirectory and then mv
all files whose names start with the prefix into the newly created directory.
The _
at the end of the xargs
invocation will be passed to bash
(as with all the rest of the arguments); bash -c
uses the first argument following the command as the $0
argument to the script, which is not part of the command line arguments iterated over by the for arg in
syntax. So putting the _
there means that the argument list constructed by xargs
will be precisely $1
, $2
, ... in the execution of the bash command.
How can I sort files by their subject number (prefix of the filename) and create a new folder from it?
If I understand you correctly, the following should work:
for i in *.nii; do
dir="${i%%_*}/mri/orig"
mkdir -p -- "$dir" && mv -- "$i" "$dir"
done
Here ${i%%_*}
expands to the contents of i
with any trailing substring matching _*
cut off, i.e. it is the value of $i
up to the first underscore.
Efficient method to parse large number of files
Approach 1 and 3 expand the list of files on the shell command line. This will not work with a huge number of files. Approach 1 and 3 also do not work if the files are distributed across many directories (which is likely with millions of files).
Approach 2 makes a copy of all data, so it is inefficient as well.
You should use find
and pass the file names directly to egrep
. Use the -h
option to suppress the prefix with the file name:
find . -name \*.txt -print0 \
| xargs -0 egrep -i -v -h 'pattern1|...|pattern8' \
| awk '{gsub(/"\t",",")}1' > all_in_1.out
xargs
will automatically launch multiple egrep
processes in sequence to avoid exceeding the command line limit in a single invocation.
Depending on the file contents, it may also be more efficient to avoid the egrep
processes altogether, and do the filtering directly in awk
:
find . -name \*.txt -print0 \
| xargs -0 awk 'BEGIN { IGNORECASE = 1 } ! /pattern1|...|pattern8/ {gsub(/"\t",",")}1' > all_in_1.out
BEGIN { IGNORECASE = 1 }
corresponds to the -i
option of egrep
, and the !
inverts the sense of the matching, just like -v
. IGNORECASE
appears to be a GNU extension.
Software to manage 1 Million files on Amazon S3
Well after trying many S3 tools I've finally found one which handles >million files with ease, and can do a sync as well. It's free, thought that wasn't important to me, I just wanted something that worked.
Dragon Disk:
http://www.dragondisk.com
how to copy top 100 files of a particular extension to a target folder using terminal
The argument list being too long happens because when you do this:
ls -1 11944*.DAT
It tries to construct a huge line like:
foo bar [...] baz quux
And there is of course a limit on how long a command line can be. The good news is that it's easy to fix--just use find
to match the files you want, then xargs
to launch cp
, because xargs
knows how long the maximum length of a single command is, and will launch cp
several times as required:
find -name '11944*.DAT' | tail -n 1000 | xargs -I{} cp {} /ftp/BSEG_SRC
By the way, there is no specified sort order here, because your original question didn't have any.
Best way to store/retrieve millions of files when their meta-data is in a SQL Database
I'd group the files in specific subfolders, and try to organize them (the subfolders) in some business-logic way. Perhaps all files made during a given day? During a six-hour period of each day? Or every # of files, I'd say a few 1000 max. (There's probably an ideal number out there, hopefully someone will post it.)
Do the files ever age out and get deleted? If so, sort and file be deletable chunk. If not, can I be your hardware vendor?
There's arguments on both sides of storing files in a database.
- On the one hand you get enhanced security, 'cause it's more awkward to pull the files from the DB; on the other hand, you get potentially poorer performance, 'cause it's more awkward to pull the files from the DB.
- In the DB, you don't have to worry about how many files per folder, sector, NAS cluster, whatever--that's the DB's problem, and probably they've got a good implementation for this. On the flip side, it'll be harder to manage/review the data, as it'd be a bazillion blobs in a single table, and, well, yuck. (You could partition the table based on the afore-mentioned business-logic, which would make deletion or archiving infinitely easier to perform. That, or maybe partitioned views, since table partitioning has a limit of 1000 partitions.)
- SQL Server 2008 has the FileStream data type; I don't know much about it, might be worth looking into.
A last point to worry about is keeping the data "aligned". If the DB stores the info on the file along with the path/name to the file, and the file gets moved, you could get totally hosed.
How to split CSV files as per number of rows specified?
Made it into a function. You can now call splitCsv <Filename> [chunkSize]
splitCsv() {
HEADER=$(head -1 $1)
if [ -n "$2" ]; then
CHUNK=$2
else
CHUNK=1000
fi
tail -n +2 $1 | split -l $CHUNK - $1_split_
for i in $1_split_*; do
sed -i -e "1i$HEADER" "$i"
done
}
Found on: http://edmondscommerce.github.io/linux/linux-split-file-eg-csv-and-keep-header-row.html
Related Topics
Check Library Version Netcdf Linux
Can't Add File to Git Repository But Can Change/Commit
Ensure Config.H Is Included Once
Cron Expression to Run on Different Days for Different Months
How to Check If The Sed Command Replaced Some String
Is Anyone Using Netlink for Ipc
How Does Copy-On-Write in Fork() Handle Multiple Fork
How to Paste from Buffer in Ex Mode of Vim
Best Approach of Image Versioning in Yocto
Linux Kernel Changing Default CPU Scheduler
How to Give to Some User Permissions Only to Subfolder
Linux: Update Embedded Resource from Executable
Analog of Com Programming in Linux/Unix
Open-Source Opengl Profiler for Linux