Detect If Something Is Modified in Directory, and If So, Backup - Otherwise Do Nothing

detect if something is modified in directory, and if so, backup - otherwise do nothing

If I understand correctly, you just want to see if any files have been modified so you can figure out whether to proceed to the rsync portion of your script?

It's a pretty simple task to figure out when the data was last synced, especially if you do this nightly. As soon as you find one file with mtime greater than the time of the last sync, you know you have to proceed to the full rsync.

find has this functionality built in:

# find all files modified in the last 24 hours
find -mtime 1

How to work out if a file has been modified?

Going by modified date will be unreliable - the computer clock can go backwards when it synchronizes, or when manually adjusted. Some programs might not behave well when modifying or copying files in terms of managing the modified date.

Going by the archive bit might work in a controlled environment but what happens if another piece of software is running that uses the archive bit as well?

The Windows archive bit is evil and must be stopped

If you want (almost) complete reliability then what you should do is store a hash value of the last backed up version using a good hashing function like SHA1, and if the hash value changes then you upload the new copy.

Here is the SHA1 class along with a code sample on the bottom:

http://msdn.microsoft.com/en-us/library/system.security.cryptography.sha1.aspx

Just run the file bytes through it and store the hash value. Pass a FileStream to it instead of loading your file into memory with a byte array to reduce memory usage, especially for large files.

You can combine this with modified date in various ways to tweak your program as needed for speed and reliability. For example, you can check modified dates for most backups and periodically run a hash checker that runs while the system is idle to make sure nothing got missed. Sometimes the modified date will change but the file contents are still the same (i.e. got overwritten with the same data), in which case you can avoid resending the whole file after you recompute the hash and realize it is still the same.

Most version control systems use some kind of combined approach with hashes and modified dates.

Your approach will generally involve some kind of risk management with a compromise between performance and reliability if you don't want to do a full backup and send all the data over each time. It's important to do "full backups" once in a while for this reason.

How to see if a subfile of a directory has changed

This article should help. Basically, you create one or more notification object such as:


HANDLE dwChangeHandles[2];
dwChangeHandles[0] = FindFirstChangeNotification(
lpDir, // directory to watch
FALSE, // do not watch subtree
FILE_NOTIFY_CHANGE_FILE_NAME); // watch file name changes

if (dwChangeHandles[0] == INVALID_HANDLE_VALUE)
{
printf("\n ERROR: FindFirstChangeNotification function failed.\n");
ExitProcess(GetLastError());
}

// Watch the subtree for directory creation and deletion.
dwChangeHandles[1] = FindFirstChangeNotification(
lpDrive, // directory to watch
TRUE, // watch the subtree
FILE_NOTIFY_CHANGE_DIR_NAME); // watch dir name changes

if (dwChangeHandles[1] == INVALID_HANDLE_VALUE)
{
printf("\n ERROR: FindFirstChangeNotification function failed.\n");
ExitProcess(GetLastError());
}

and then you wait for a notification:


while (TRUE)
{
// Wait for notification.
printf("\nWaiting for notification...\n");

DWORD dwWaitStatus = WaitForMultipleObjects(2, dwChangeHandles,
FALSE, INFINITE);

switch (dwWaitStatus)
{
case WAIT_OBJECT_0:

// A file was created, renamed, or deleted in the directory.
// Restart the notification.
if ( FindNextChangeNotification(dwChangeHandles[0]) == FALSE )
{
printf("\n ERROR: FindNextChangeNotification function failed.\n");
ExitProcess(GetLastError());
}
break;

case WAIT_OBJECT_0 + 1:

// Restart the notification.
if (FindNextChangeNotification(dwChangeHandles[1]) == FALSE )
{
printf("\n ERROR: FindNextChangeNotification function failed.\n");
ExitProcess(GetLastError());
}
break;

case WAIT_TIMEOUT:

// A time-out occurred. This would happen if some value other
// than INFINITE is used in the Wait call and no changes occur.
// In a single-threaded environment, you might not want an
// INFINITE wait.

printf("\nNo changes in the time-out period.\n");
break;

default:
printf("\n ERROR: Unhandled dwWaitStatus.\n");
ExitProcess(GetLastError());
break;
}
}
}

How can I detect only deleted, changed, and created files on a volume?

You can enumerate all the files on a volume using FSCTL_ENUM_USN_DATA. This is a fast process (my tests returned better than 6000 records per second even on a very old machine, and 20000+ is more typical) and only includes files that currently exist.

The data returned includes the file flags as well as the USNs so you could check for changes whichever way you prefer.

You will still need to work out the full path for the files by matching the parent IDs with the file IDs of the directories. One approach would be to use a buffer large enough to hold all the file records simultaneously, and search through the records to find the matching parent for each file you need to back up. For large volumes you would probably need to process the directory records into a more efficient data structure, perhaps a hash table.

Alternately, you can read/reread the records for the parent directories as needed. This would be less efficient, but the performance might still be satisfactory depending on how many files are being backed up. Windows does appear to cache the data returned by FSCTL_ENUM_USN_DATA.

This program searches the C volume for files named test.txt and returns information about any files found, as well as about their parent directories.

#include <Windows.h>

#include <stdio.h>

#define BUFFER_SIZE (1024 * 1024)

HANDLE drive;
USN maxusn;

void show_record (USN_RECORD * record)
{
void * buffer;
MFT_ENUM_DATA mft_enum_data;
DWORD bytecount = 1;
USN_RECORD * parent_record;

WCHAR * filename;
WCHAR * filenameend;

printf("=================================================================\n");
printf("RecordLength: %u\n", record->RecordLength);
printf("MajorVersion: %u\n", (DWORD)record->MajorVersion);
printf("MinorVersion: %u\n", (DWORD)record->MinorVersion);
printf("FileReferenceNumber: %lu\n", record->FileReferenceNumber);
printf("ParentFRN: %lu\n", record->ParentFileReferenceNumber);
printf("USN: %lu\n", record->Usn);
printf("Timestamp: %lu\n", record->TimeStamp);
printf("Reason: %u\n", record->Reason);
printf("SourceInfo: %u\n", record->SourceInfo);
printf("SecurityId: %u\n", record->SecurityId);
printf("FileAttributes: %x\n", record->FileAttributes);
printf("FileNameLength: %u\n", (DWORD)record->FileNameLength);

filename = (WCHAR *)(((BYTE *)record) + record->FileNameOffset);
filenameend= (WCHAR *)(((BYTE *)record) + record->FileNameOffset + record->FileNameLength);

printf("FileName: %.*ls\n", filenameend - filename, filename);

buffer = VirtualAlloc(NULL, BUFFER_SIZE, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);

if (buffer == NULL)
{
printf("VirtualAlloc: %u\n", GetLastError());
return;
}

mft_enum_data.StartFileReferenceNumber = record->ParentFileReferenceNumber;
mft_enum_data.LowUsn = 0;
mft_enum_data.HighUsn = maxusn;

if (!DeviceIoControl(drive, FSCTL_ENUM_USN_DATA, &mft_enum_data, sizeof(mft_enum_data), buffer, BUFFER_SIZE, &bytecount, NULL))
{
printf("FSCTL_ENUM_USN_DATA (show_record): %u\n", GetLastError());
return;
}

parent_record = (USN_RECORD *)((USN *)buffer + 1);

if (parent_record->FileReferenceNumber != record->ParentFileReferenceNumber)
{
printf("=================================================================\n");
printf("Couldn't retrieve FileReferenceNumber %u\n", record->ParentFileReferenceNumber);
return;
}

show_record(parent_record);
}

void check_record(USN_RECORD * record)
{
WCHAR * filename;
WCHAR * filenameend;

filename = (WCHAR *)(((BYTE *)record) + record->FileNameOffset);
filenameend= (WCHAR *)(((BYTE *)record) + record->FileNameOffset + record->FileNameLength);

if (filenameend - filename != 8) return;

if (wcsncmp(filename, L"test.txt", 8) != 0) return;

show_record(record);
}

int main(int argc, char ** argv)
{
MFT_ENUM_DATA mft_enum_data;
DWORD bytecount = 1;
void * buffer;
USN_RECORD * record;
USN_RECORD * recordend;
USN_JOURNAL_DATA * journal;
DWORDLONG nextid;
DWORDLONG filecount = 0;
DWORD starttick, endtick;

starttick = GetTickCount();

printf("Allocating memory.\n");

buffer = VirtualAlloc(NULL, BUFFER_SIZE, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);

if (buffer == NULL)
{
printf("VirtualAlloc: %u\n", GetLastError());
return 0;
}

printf("Opening volume.\n");

drive = CreateFile(L"\\\\?\\c:", GENERIC_READ, FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_ALWAYS, FILE_FLAG_NO_BUFFERING, NULL);

if (drive == INVALID_HANDLE_VALUE)
{
printf("CreateFile: %u\n", GetLastError());
return 0;
}

printf("Calling FSCTL_QUERY_USN_JOURNAL\n");

if (!DeviceIoControl(drive, FSCTL_QUERY_USN_JOURNAL, NULL, 0, buffer, BUFFER_SIZE, &bytecount, NULL))
{
printf("FSCTL_QUERY_USN_JOURNAL: %u\n", GetLastError());
return 0;
}

journal = (USN_JOURNAL_DATA *)buffer;

printf("UsnJournalID: %lu\n", journal->UsnJournalID);
printf("FirstUsn: %lu\n", journal->FirstUsn);
printf("NextUsn: %lu\n", journal->NextUsn);
printf("LowestValidUsn: %lu\n", journal->LowestValidUsn);
printf("MaxUsn: %lu\n", journal->MaxUsn);
printf("MaximumSize: %lu\n", journal->MaximumSize);
printf("AllocationDelta: %lu\n", journal->AllocationDelta);

maxusn = journal->MaxUsn;

mft_enum_data.StartFileReferenceNumber = 0;
mft_enum_data.LowUsn = 0;
mft_enum_data.HighUsn = maxusn;

for (;;)
{
// printf("=================================================================\n");
// printf("Calling FSCTL_ENUM_USN_DATA\n");

if (!DeviceIoControl(drive, FSCTL_ENUM_USN_DATA, &mft_enum_data, sizeof(mft_enum_data), buffer, BUFFER_SIZE, &bytecount, NULL))
{
printf("=================================================================\n");
printf("FSCTL_ENUM_USN_DATA: %u\n", GetLastError());
printf("Final ID: %lu\n", nextid);
printf("File count: %lu\n", filecount);
endtick = GetTickCount();
printf("Ticks: %u\n", endtick - starttick);
return 0;
}

// printf("Bytes returned: %u\n", bytecount);

nextid = *((DWORDLONG *)buffer);
// printf("Next ID: %lu\n", nextid);

record = (USN_RECORD *)((USN *)buffer + 1);
recordend = (USN_RECORD *)(((BYTE *)buffer) + bytecount);

while (record < recordend)
{
filecount++;
check_record(record);
record = (USN_RECORD *)(((BYTE *)record) + record->RecordLength);
}

mft_enum_data.StartFileReferenceNumber = nextid;
}
}

Additional notes

  • As discussed in the comments, you may need to replace MFT_ENUM_DATA with MFT_ENUM_DATA_V0 on versions of Windows later than Windows 7. (This may also depend on what compiler and SDK you are using.)

  • I'm printing the 64-bit file reference numbers as if they were 32-bit. That was just a mistake on my part. Probably in production code you won't be printing them anyway, but FYI.

How to recursively find the latest modified file in a directory?

find . -type f -printf '%T@ %p\n' \
| sort -n | tail -1 | cut -f2- -d" "

For a huge tree, it might be hard for sort to keep everything in memory.

%T@ gives you the modification time like a unix timestamp, sort -n sorts numerically, tail -1 takes the last line (highest timestamp), cut -f2 -d" " cuts away the first field (the timestamp) from the output.

Edit: Just as -printf is probably GNU-only, ajreals usage of stat -c is too. Although it is possible to do the same on BSD, the options for formatting is different (-f "%m %N" it would seem)

And I missed the part of plural; if you want more then the latest file, just bump up the tail argument.

How to check if directory contents has changed with PHP?

Uh. I'd simply store the md5 of a directory listing. If the contents change, the md5(directory-listing) will change. You might get the very occasional md5 clash, but I think that chance is tiny enough..

Alternatively, you could store a little file in that directory that contains the "last modified" date. But I'd go with md5.


PS. on second thought, seeing as how you're looking at performance (caching) requesting and hashing the directory listing might not be entirely optimal..

Check if file has been modified

One option is to check, if file has been modified. You can achieve with adding extension of backup file to -i option:

perl -pi.orig -e 's/contoso/'"$hostname"'/g' /etc/inet/hosts

This command will store original content of /etc/inet/hosts into /etc/inet/hosts.orig. Then run the specified command. Then you can check if the files are different with, for example cmp command:

if ! cmp -s foo.txt foo.txt.orig; then
echo OK
else
echo ERROR
fi

Remove the .orig file after that.

The other option is to modify the script to read the content of the file, replace required entry, check is change actually happened and return proper status at the end to verify in the shell using $?. You have been given solution in this answer.



Related Topics



Leave a reply



Submit