How to Determine If a File Is Binary or Text in C#

How can I determine if a file is binary or text in c#?

I would probably look for an abundance of control characters which would typically be present in a binary file but rarely in an text file. Binary files tend to use 0 enough that just testing for many 0 bytes would probably be sufficient to catch most files. If you care about localization you'd need to test multi-byte patterns as well.

As stated though, you can always be unlucky and get a binary file that looks like text or vice versa.

C Opening a file to check if it is Binary, if so print it is binary

No, there isn't, because it's impossible to tell for sure. If you expect a specific encoding, you can check yourself whether the file contents are valid in this encoding, e.g. if you expect ASCII, all bytes must be <= 0x7f. If you expect UTF-8, it's a bit more complicated, see a description of it.

In any case, there's no guarantee that a "binary" file would not by accident look like a valid file in any given text encoding. In fact, the term "binary file" doesn't make too much sense, as all files contain binary data.

How to tell if a file is text-readable in C#

There is no general way of figuring type of information stored in the file.

Even if you know in advance that it is some sort of text if you don't know what encoding was used to create file you may not be able to load it properly.

Note that HTTP give you some hints on type of file by content-type header, but there is no such information on file system.

C# - Check if File is Text Based

I guess you could just check through the first 1000 (arbitrary number) characters and see if there are unprintable characters, or if they are all ascii in a certain range. If the latter, assume that it is text?

Whatever you do is going to be a guess.

How can I distinguish between a binary file and a text file using dot-net languages

How to detect if a file is in the PDF format?

Allow me to quote ISO 32000-1:

The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.

And ISO 32000-2:

The PDF file begins with the 5 characters “%PDF–” and offsets shall be calculated from the PERCENT SIGN (25h).

What's the difference? When you encounter a file that starts with %PDF-1.0 to %PDF-1.7, you have an ISO 32000-1 file; starting with ISO 32000-2, a PDF file can also start with %PDF-2.0.

How to detect if a file is a binary file?

This is also explained in ISO 32000:

If a PDF file contains binary data, as most do, the header line shall be immediately followed by a comment line containing at least four binary characters–that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.

If you open a PDF in a text editor instead of in a PDF viewer, you'll often see that the second line looks like this:

%âãÏÓ

There is no such thing as a "plain text file"; a file always has an encoding. However, when people talk about plain text files, they often mean to say ASCII files. ASCII files are files of which all the bytes have a value lower than 128 (10000000).

Back in the old days, transfer protocols often treated PDF documents as if they were ASCII files. Instead of sending 8-bit bytes, they only sent the first 7-bit of each bytes (this is sometimes referred to as "byte shaving"). When this happens, the ASCII bytes of a PDF file are preserved, but all the binary content gets corrupted. When you open such a PDF in a PDF viewer, you see the pages of the PDF file, but every page is empty.

To avoid this problem, four non-ASCII characters are added in the PDF header. Transfer protocols check the first series of bytes, see that some of these bytes have a value higher than 127 (01111111), and therefor treat the file as a binary file.

How to detect if a file is in the HTML format?

That's more tricky, as HTML allows people to be sloppy. You'd expect the first non-white space of an HTML file to be a < character, but such a file can also be a simple XML file that is not in the HTML format.

You'd expect <!doctype html>, <html> or <body> somewhere in the file (with or without attributes inside the tag), but some people create HTML files without mentioning the DocType, and even without an <html> or a <body> tag.

Note that HTML files can come in many different encodings. For instance: when they are encoded using UTF-8, they will contain bytes with a value higher than 127.

How to detect if a file is an ASCII text file?

Just loop over all the bytes. If you find a byte with a value higher than 127, you have a file that is not in ASCII format.

What about files in Unicode?

In that case, there will be a Byte Order Mark (BOM) that allows you to detect the encoding of the file. Read more about that here.

Are there other encodings?

Of course there are! See for instance ISO/IEC 8859. In many cases, a text file doesn't know which encoding was used as the encoding isn't stored as a property of the file.

Detect if file contains text

Generally, you cannot reliably detect if the file is a text file. It starts with the general issue, what actually is "a text file". You already hinted at encodings, but especially those cannot be reliably detected (for example see Notepad's struggle).

Having that said, you might be able to employ the heuristics to do you best (including, but of course not limited to file extensions; excluding well known non-file types like EXE, DLL, ZIP, image files, by recognizing their signature; maybe combined with the approach used by browsers or Notepad).

Depending on your application, I guess it would be pretty much feasibly, to just let the user select the files to scan (maybe having a default list of extensions to include, like *.cs, *.txt, *.resx, *.xml, ...). If a file(type) / extension is not in the default list and was not added by the user, it is not counted. If the user adds a filetype/extension to the list that is not a "text file", the results are not useful.

But comparing effort and the fact that an automatic result will never be 100% exact (at detecting all possible files) it should be good enough.

Ruby: How to determine if file being read is binary or text

gem install ptools
require 'ptools'
File.binary?(file)


Related Topics



Leave a reply



Submit