How to Identify the File Content as Ascii or Binary

How do I distinguish between 'binary' and 'text' files?

The spreadsheet software my company makes reads a number of binary file formats as well as text files.

We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.

How to check if the file is a binary file and read all the files which are not?

Use utility file, sample usage:

 $ file /bin/bash
/bin/bash: Mach-O universal binary with 2 architectures
/bin/bash (for architecture x86_64): Mach-O 64-bit executable x86_64
/bin/bash (for architecture i386): Mach-O executable i386

$ file /etc/passwd
/etc/passwd: ASCII English text

$ file code.c
code.c: ASCII c program text

file manual page

How can I distinguish between a binary file and a text file using dot-net languages

How to detect if a file is in the PDF format?

Allow me to quote ISO 32000-1:

The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.

And ISO 32000-2:

The PDF file begins with the 5 characters “%PDF–” and offsets shall be calculated from the PERCENT SIGN (25h).

What's the difference? When you encounter a file that starts with %PDF-1.0 to %PDF-1.7, you have an ISO 32000-1 file; starting with ISO 32000-2, a PDF file can also start with %PDF-2.0.

How to detect if a file is a binary file?

This is also explained in ISO 32000:

If a PDF file contains binary data, as most do, the header line shall be immediately followed by a comment line containing at least four binary characters–that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.

If you open a PDF in a text editor instead of in a PDF viewer, you'll often see that the second line looks like this:

%âãÏÓ

There is no such thing as a "plain text file"; a file always has an encoding. However, when people talk about plain text files, they often mean to say ASCII files. ASCII files are files of which all the bytes have a value lower than 128 (10000000).

Back in the old days, transfer protocols often treated PDF documents as if they were ASCII files. Instead of sending 8-bit bytes, they only sent the first 7-bit of each bytes (this is sometimes referred to as "byte shaving"). When this happens, the ASCII bytes of a PDF file are preserved, but all the binary content gets corrupted. When you open such a PDF in a PDF viewer, you see the pages of the PDF file, but every page is empty.

To avoid this problem, four non-ASCII characters are added in the PDF header. Transfer protocols check the first series of bytes, see that some of these bytes have a value higher than 127 (01111111), and therefor treat the file as a binary file.

How to detect if a file is in the HTML format?

That's more tricky, as HTML allows people to be sloppy. You'd expect the first non-white space of an HTML file to be a < character, but such a file can also be a simple XML file that is not in the HTML format.

You'd expect <!doctype html>, <html> or <body> somewhere in the file (with or without attributes inside the tag), but some people create HTML files without mentioning the DocType, and even without an <html> or a <body> tag.

Note that HTML files can come in many different encodings. For instance: when they are encoded using UTF-8, they will contain bytes with a value higher than 127.

How to detect if a file is an ASCII text file?

Just loop over all the bytes. If you find a byte with a value higher than 127, you have a file that is not in ASCII format.

What about files in Unicode?

In that case, there will be a Byte Order Mark (BOM) that allows you to detect the encoding of the file. Read more about that here.

Are there other encodings?

Of course there are! See for instance ISO/IEC 8859. In many cases, a text file doesn't know which encoding was used as the encoding isn't stored as a property of the file.

How can I determine if a file is binary or text in c#?

I would probably look for an abundance of control characters which would typically be present in a binary file but rarely in an text file. Binary files tend to use 0 enough that just testing for many 0 bytes would probably be sufficient to catch most files. If you care about localization you'd need to test multi-byte patterns as well.

As stated though, you can always be unlucky and get a binary file that looks like text or vice versa.

How to Check if File is ASCII or Binary in PHP

This only works for PHP>=5.3.0, and isn't 100% reliable, but hey, it's pretty darn close.

// return mime type ala mimetype extension
$finfo = finfo_open(FILEINFO_MIME);

//check to see if the mime-type starts with 'text'
return substr(finfo_file($finfo, $filename), 0, 4) == 'text';

http://us.php.net/manual/en/ref.fileinfo.php

How can I detect if a file is binary (non-text) in Python?

You can also use the mimetypes module:

import mimetypes
...
mime = mimetypes.guess_type(file)

It's fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.



Related Topics



Leave a reply



Submit