Detect If PDF File Is Correct (Header PDF)

Detect if PDF file is correct (header PDF)

a. Unfortunately, there is no easy way to determine is pdf file corrupt. Usually, the problem files have a correct header so the real reasons of corruption are different. PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So, most probably corrupted files have a broken offsets or may be some object is missed.

The best way to detect the corrupted file is to use specialized PDF libraries.
There are lots of both free and commercial PDF libraries for .NET. You may simply try to load PDF file with one of such libraries. iTextSharp will be a good choice.

b. According to the PDF reference the header of a PDF file usually looks like %PDF−1.X (where X is a number, for the present from 0 to 7). And 99% of PDF files have such header. However, there are some other kinds of headers which Acrobat Viewer accepts and even absence of a header isn't a real problem for PDF viewers. So, you shouldn't treat file as corrupted if it does not contain a header.
E.g., the header may be appeared somewhere within the first 1024 bytes of the file or be in the form %!PS−Adobe−N.n PDF−M.m

Just for your information I am a developer of the Docotic PDF library.

Determine if a byte[] is a pdf file

Check the first 4 bytes of the array.

If those are 0x25 0x50 0x44 0x46 then it's most probably a PDF file.

How can I determine if a file is a PDF file?

you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)

I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)

How can I test a PDF document if it is PDF/A compliant?

A list of PDF/A validators is on the pdfa.org web site here:

verapdf

A free online PDF/A validator is available here:

http://www.validatepdfa.com/

A report on the accuracy of many of these PDF/A validators is available from PDFLib:

https://www.pdflib.com/fileadmin/pdflib/pdf/pdfa/2009-05-04-Bavaria-report-on-PDFA-validation-accuracy.pdf

Se as well:

https://www.pdflib.com/knowledge-base/pdfa/

How to detect if a file is PDF or TIFF?

OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:

private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;

private bool IsTiff(Stream stm)
{
stm.Seek(0);
if (stm.Length < kMinimumTiffSize)
return false;
byte[] header = new byte[kHeaderSize];

stm.Read(header, 0, header.Length);

if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
return false;
bool isIntel = header[0] == kIntelMark;

ushort magicNumber = ReadShort(stm, isIntel);
if (magicNumber != kTiffMagicNumber)
return false;
return true;
}

private ushort ReadShort(Stream stm, bool isIntel)
{
byte[] b = new byte[2];
_stm.Read(b, 0, b.Length);
return ToShort(_isIntel, b[0], b[1]);
}

private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
if (isIntel)
{
return (ushort)(((int)b1 << 8) | (int)b0);
}
else
{
return (ushort)(((int)b0 << 8) | (int)b1);
}
}

I hacked apart some much more general code to get this.

For PDF, I have code that looks like this:

public bool IsPdf(Stream stm)
{
stm.Seek(0, SeekOrigin.Begin);
PdfToken token;
while ((token = GetToken(stm)) != null)
{
if (token.TokenType == MLPdfTokenType.Comment)
{
if (token.Text.StartsWith("%PDF-1."))
return true;
}
if (stm.Position > 1024)
break;
}
return false;
}

Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:

% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage

this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.

I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:

Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

magick image library writing incorrect header to PDF files?

I have held off from answering this because the solution is embarrassingly obvious. In retrospect, I just assumed that, because I used image_read_pdf it should and would save in PDF format. What I needed to do was specify it explicitly. Adding a format = "pdf" argument to the image_write call achieved that.

image_write(pdfimage3, path = "C:/temp/test.pdf", density = 300, format = "pdf", flatten = TRUE)

This results in a well-formed PDF. Problem solved. Lesson learned.

PDF file with proper header

Is it possible to verify that a posted file is pdf or not?

As all PDF files start with the ASCII string "%PDF-", simply test the first few bytes of the file to ensure that they start with this string.

bool IsPdf(string path)
{
var pdfString = "%PDF-";
var pdfBytes = Encoding.ASCII.GetBytes(pdfString);
var len = pdfBytes.Length;
var buf = new byte[len];
var remaining = len;
var pos = 0;
using(var f = File.OpenRead(path))
{
while(remaining > 0)
{
var amtRead = f.Read(buf, pos, remaining);
if(amtRead == 0) return false;
remaining -= amtRead;
pos += amtRead;
}
}
return pdfBytes.SequenceEqual(buf);
}


Related Topics



Leave a reply



Submit