Uniquely Identify Files

Uniquely identify files

In addition to the excellent suggestion above, you might consider using the inode number property of the files, viewable in a shell with ls -i.

Using index.php on one of my boxes:

ls -i

yields

196237 index.php

I then rename the file using mv index.php index1.php, after which the same ls -i yields:

196237 index1.php

(Note the inode number is the same)

Uniquely Identify a file across different machines in a web application

The simple answer to your question is that it can't be done.

Let me summarize what you're trying to accomplish.

If I have a file on my office computer, and upload that to your web application, you want to store that into your system as a new file. Then, if I copy the file from my office computer to my home computer, edit the file contents, rename the file, and then upload it into your web application, you want to identify that this is the same file as the one I previously uploaded.

It can't be done.

Not with a 100% guarantee that you can identify this.

When you are uploading files to a web application, what is sent is this:

  • The name of the file
  • The length of the file
  • The contents of the file

Things such as alternate data streams (NTFS), from the other answer here, or inode or similar identifiers, from the comments, are not sent. Your web application will not see them. Nor would these things be "across multiple computers".

So bottom line, this is impossible.

Your options are:

  1. Let the user uniquely pick the file they want to overwrite, meaning that the user could pick unrelated files and thus be "wrong"
  2. Work out a reasonable chance that you identified the right file, accepting the chance that you identified incorrectly
  3. Embed a unique id into the file itself, however since the file contents can be edited (and the id can be changed) this is not guaranteed
  4. ... other options that doesn't have a 100% guarantee of being right

The first option is of course the easiest.

The second option could use systems such as what git is doing when it tracks renames, but even this will fail depending on how much the file was edited between the uploads. Git fail in this respect too, except that "failure" here simply means it doesn't show you the full history of a file, it doesn't break down and become unusable.

The third option might work if the file should be edited by a program similar to Word or Excel or Photoshop, etc. You could embed the ID and just make sure that program doesn't change it. It would probably have a higher and acceptable chance of being right, but it might still be possible to edit.

So you will have to decide what would be acceptable to you, but you cannot create a system in which you are guaranteed to identify the file, even if it was renamed and the contents changed. Because at that point you have no guarantee that the user is simply trying to upload a different file altogether.

Uniquely identify file in Java

This java example demonstrates how to get the unix inode number of a file.

import java.nio.file.*;
import java.nio.file.attribute.*;

public class MyFile {

public static void main(String[] args) throws Exception {

BasicFileAttributes attr = null;
Path path = Paths.get("MyFile.java");

attr = Files.readAttributes(path, BasicFileAttributes.class);

Object fileKey = attr.fileKey();
String s = fileKey.toString();
String inode = s.substring(s.indexOf("ino=") + 4, s.indexOf(")"));
System.out.println("Inode: " + inode);
}
}

The output

$ java MyFile
Inode: 664938

$ ls -i MyFile.java
664938 MyFile.java

credit where credit is due: https://www.javacodex.com/More-Examples/1/8

Unique file identifier in windows

If you call GetFileInformationByHandle, you'll get a file ID in BY_HANDLE_FILE_INFORMATION.nFileIndexHigh/Low. This index is unique within a volume, and stays the same even if you move the file (within the volume) or rename it.

If you can assume that NTFS is used, you may also want to consider using Alternate Data Streams to store the metadata.

Uniquely identify Files and Directories on a Server for Comparison

$filePath = '/var/www/site/public/uploads/foo.txt'
$data = file_get_contents($filePath);

$key = sha1($data); //or $key = sha1_file($filePath);

Save this $key in a column of table also mark that column as UNIQUE so no to same file can be stored by default.

Use sha1 instead of md5 since many version control system like git use sha1 hash itself to identify uniqueness of file

Uniquely identify file on Windows

The next best method (but one that involves reading every file completely, which I'd avoid when it can be helped) would be to compare file size and a hash (e.g. SHA-256) of the file contents. The probability that both collide is fairly slim, especially under normal circumstances.

I'd use the GetFileInformationByHandle way on NTFS and fall back to hashing on FAT volumes.

In Dropbox' case I think though, that there is a service or process running in background observing file system changes. It's the most reliable way, even if it ceases to work if you stop said service/process.



Related Topics



Leave a reply



Submit