How to Identify File Type by Base64 Encoded String of a Image

How to identify file type by Base64 encoded string of a image

I have solved my problem with using mimeType = URLConnection.guessContentTypeFromStream(inputstream);

{ //Decode the Base64 encoded string into byte array
// tokenize the data since the 64 encoded data look like this "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoAAAAKAC"

String delims="[,]";
String[] parts = base64ImageString.split(delims);
String imageString = parts[1];
byte[] imageByteArray = Base64.decode(imageString );

InputStream is = new ByteArrayInputStream(imageByteArray);

//Find out image type
String mimeType = null;
String fileExtension = null;
try {
mimeType = URLConnection.guessContentTypeFromStream(is); //mimeType is something like "image/jpeg"
String delimiter="[/]";
String[] tokens = mimeType.split(delimiter);
fileExtension = tokens[1];
} catch (IOException ioException){

}
}

How can i check a base64 string is a file(what type?) or not?

Many filetypes have a header (the first few bytes of the file) with some fixed information by which a file can be identified as a gz, png, pdf, etc.

So every base64 encoded gz file would also start with a certain sequence of base64 characters, by which it can be recognized.

A gzip-file always starts with the two byte sequence 0x1f 0x1b, which in base64 encoding is H4 plus a third character in the range of s to v.

The reason is, that every base64 character represents 6 bits of the original bytes, so the two bytes 0x1f 0x1b are encoded with two base64 characters (12 bits) plus the first 4 bits of the third character.

Based on that, I would say that's no base64 encoded gzip that you show there.

other examples are:

  • png

    starts with: 0x89 0x50 0x4e 0x47 0x0d 0x0a 0x1a 0x0a

    base64 encoded: iVBORw0KGg...

  • jpg

    starts with: 0xFF 0xD8 0xFF 0xD0

    base64 encoded: /9j/4...

  • gif

    starts with: GIF

    base64 encoded: R0lG

  • tif

    a) little endian:
    starts with: 0x49 0x49 0x2A 0x00

    base64 encoded: SUkqA

    b) big endian:
    starts with: 0x4D 0x4D 0x00 0x2A

    base64 encoded: TU0AK

  • flv

    starts with FLV

    base64 encoded: RkxW

  • wav/avi/webp and others

    several audio/video/image/graphic -formats are base on RIFF(Resource Interchange Format)
    The common part is that all files start with RIFF

    base64 encoded: UklGR

    After the RIFFheader, you'll find the specific format starting in the 4 bytes starting at the 9th byte.
    In the following _ is used as a placeholder for any character.

    wav

    starts with: RIFF____WAVE
    base64 encoded: UklGR______XQVZF

    webp

    starts with: RIFF____WEBP
    base64 encoded: UklGR______XRUJQ

    avi

    starts with: RIFF____AVI
    base64 encoded: UklGR______BVkkg


Regarding the specific example in the question:

in the updated question there's a hint in the attached picture that

the data is first base32 encoded and then base64 encoded.

When we feed an online base32 decoder with the string given in the question (JA2HGSKBJI4DSZ2WGRAS...), we get:

H4sIAJ89gV4A/+1ZURaEIAi8SkfQ+1/O3f7MtEBfMgz9rC/diXmIA5hSzun3HNdBbgbtVP2v/2+LowM837wFHKxZbmE9pQfsLOaiLAL8kvIk4MBma17ufHQbIJCXoWNZZKGPWB5QljvXIuXOmm0SgLixJw8HRC8Tbmz7x5eIspypaZHSWbj8cAhdjli2WUkR1sv2dZmwXhZlDnIcCl0GyrFX6fKkBEBTBsq+9uY2Ecug2Rf0xtaJlNdYJuxjP9kcd1LOW/fQXtb1sd3fSTGXFTx3UjfGFx6uJGjeIAAA

It starts with H4s, so according to what I wrote about how to recognize file types in base64 encoding, it's a base64 encoded gzip file.

This can be saved in a text file and then uploaded on base64decode.org where it will be converted into a gzip file. When you download and open that gzip file it contains a file with text like this:

00110000 00110000 00110001 00110001 00110000 00110001 00110000 00110000 00100000 00110000 00110000 00110001 00110001 00110000 00110001 00110000 00110001 00100000 ...

Conclusion for this case: The original string/file is a gzip file that was first base64 encoded and the base64 encoded part was again encoded with base32.

How do I know file type encrypted in base64 string

In relation to your switch statement, the string for a WAV file would be "UklGR" and the string for an MP3 file would be "SUQzB".

These strings are the bytes of the file itself and so this string is essentially the first part of the file header.

Python, can someone guess the type of a file only by its base64 encoding?

You can't, at least not without decoding, because the bytes that help identify the filetype are spread across the base64 characters, which don't directly align with whole bytes. Each character encodes 6 bits, which means that for every 4 characters, there are 3 bytes encoded.

Identifying a filetype requires access to those bytes in different block sizes. A JPEG image for example, can be identified from the bytes FF D8 or FF D9, but that's two bytes; the third byte that follows must also be encoded as part of the 4-character block.

What you can do is decode just enough of the base64 string to do your filetype fingerprinting. So you can decode the first 4 characters to get the 3 bytes, and then use the first two to see if the object is a JPEG image. A large number of file formats can be identified from just the first or last series of bytes (a PNG image can be identified by the first 8 bytes, a GIF by the first 6, etc.). Decoding just those bytes from the base64 string is trivial.

Your sample is a PNG image; you can test for image types using the imghdr module:

>>> import imghdr
>>> image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""
>>> sample = image_data[:44].decode('base64') # 33 bytes / 3 times 4 is 44 base64 chars
>>> for tf in imghdr.tests:
... res = tf(sample, None)
... if res:
... break
...
>>> print res
png

I only used the first 33 bytes from the base64 data, to echo what the imghdr.what() function will read from the file you pass it (it reads 32 bytes, but that number doesn't divide by 3).

There is an equivalent soundhdr module, and there is also the python-magic project that lets you pass in a number of bytes to determine a file type.

How to know MIME-type of a file from base64 encoded data in python?

In the general case, there is no way to reliably identify the MIME type of a piece of untagged data.

Many file formats have magic markers which can be used to determine the type of the file with reasonable accuracy, but some magic markers are poorly chosen and might e.g. coincide with text in unrelated files; and of course, a completely random sequence of bits is not in any well-defined file format.

libmagic is the central component of the file command which is commonly used to perform this task. There are several Python bindings but https://pypi.org/project/python-libmagic/ seems to be the most popular and active.

Of course, base64 is just a way to encode untyped binary data. Here's a quick demo with your sample data.

import base64

import magic

encoded_data = '/9j/4AAQSkZJRgABAQEASABIAAD//gA7Q1JFQVRPUjogZ2QtanBlZyB2MS4wICh1c2luZyBJSkcgSlBFRyB2NjIpLCBxdWFsaXR5ID0gOTUK/9sAQwAGBAUGBQQGBgUGBwcGCAoQCgoJCQoUDg8MEBcUGBgXFB==='
with magic.Magic() as m:
print(m.from_buffer(base64.b64decode(encoded_data)))

Output:

image/jpeg

(Notice I had to fix the padding at the end of your encoded_data.)

Javascript - get extension from base64 image

For a String (which you can parse out of an image) you can do this:

// Create Base64 Object
var Base64={_keyStr:"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=",encode:function(e){var t="";var n,r,i,s,o,u,a;var f=0;e=Base64._utf8_encode(e);while(f<e.length){n=e.charCodeAt(f++);r=e.charCodeAt(f++);i=e.charCodeAt(f++);s=n>>2;o=(n&3)<<4|r>>4;u=(r&15)<<2|i>>6;a=i&63;if(isNaN(r)){u=a=64}else if(isNaN(i)){a=64}t=t+this._keyStr.charAt(s)+this._keyStr.charAt(o)+this._keyStr.charAt(u)+this._keyStr.charAt(a)}return t},decode:function(e){var t="";var n,r,i;var s,o,u,a;var f=0;e=e.replace(/[^A-Za-z0-9\+\/\=]/g,"");while(f<e.length){s=this._keyStr.indexOf(e.charAt(f++));o=this._keyStr.indexOf(e.charAt(f++));u=this._keyStr.indexOf(e.charAt(f++));a=this._keyStr.indexOf(e.charAt(f++));n=s<<2|o>>4;r=(o&15)<<4|u>>2;i=(u&3)<<6|a;t=t+String.fromCharCode(n);if(u!=64){t=t+String.fromCharCode(r)}if(a!=64){t=t+String.fromCharCode(i)}}t=Base64._utf8_decode(t);return t},_utf8_encode:function(e){e=e.replace(/\r\n/g,"\n");var t="";for(var n=0;n<e.length;n++){var r=e.charCodeAt(n);if(r<128){t+=String.fromCharCode(r)}else if(r>127&&r<2048){t+=String.fromCharCode(r>>6|192);t+=String.fromCharCode(r&63|128)}else{t+=String.fromCharCode(r>>12|224);t+=String.fromCharCode(r>>6&63|128);t+=String.fromCharCode(r&63|128)}}return t},_utf8_decode:function(e){var t="";var n=0;var r=c1=c2=0;while(n<e.length){r=e.charCodeAt(n);if(r<128){t+=String.fromCharCode(r);n++}else if(r>191&&r<224){c2=e.charCodeAt(n+1);t+=String.fromCharCode((r&31)<<6|c2&63);n+=2}else{c2=e.charCodeAt(n+1);c3=e.charCodeAt(n+2);t+=String.fromCharCode((r&15)<<12|(c2&63)<<6|c3&63);n+=3}}return t}}

// Define the string, also meaning that you need to know the file extension
var encoded = "Base64 encoded image returned from your service";

// Decode the string
var decoded = Base64.decode(encoded);
console.log(decoded);

// if the file extension is unknown
var extension = undefined;
// do something like this
var lowerCase = decoded.toLowerCase();
if (lowerCase.indexOf("png") !== -1) extension = "png"
else if (lowerCase.indexOf("jpg") !== -1 || lowerCase.indexOf("jpeg") !== -1)
extension = "jpg"
else extension = "tiff";

// and then to display the image
var img = document.createElement("img");
img.src = decoded;

// alternatively, you can do this
img.src = "data:image/" + extension + ";base64," + encoded;

For completion's sake here's the source and I hope this helps!

Retrieve MIME type from Base64 encoded String

In general, a base 64-encoded string could contain absolutely any data, so there is no way to know its file type.

To determine if it is an instance of a JPEG image, you'd need to base64-decode it, and then do something like checking its magic number, which is useful in telling you what the file isn't. You'd still need to do more work to determine if it is a valid JPEG image.

How to find file extension of base64 encoded image in Python

It is best practices to examine the file's contents rather than rely on something external to the file. Many emails attacks, for example, rely on mis-identifying the mime type so that an unsuspecting computer executes a file that it shouldn't. Fortunately, most image file extensions can be determined by looking at the first few bytes (after decoding the base64). Best practices, though, might be to use file magic which can be accessed via a python packages such as this one or this one.

Most image file extensions are obvious from the mimetype. For gif, pxc, png, tiff, and jpeg, the file extension is just whatever follows the 'image/' part of the mime type. To handle the obscure types also, python does provide a standard package:

>>> from mimetypes import guess_extension
>>> guess_extension('image/x-corelphotopaint')
'.cpt'
>>> guess_extension('image/png')
'.png'


Related Topics



Leave a reply



Submit