Delphi HTML Decode

delphi html decode

Look at the HTTPApp unit. HTTPDecode and HTMLDecode (as well as the Encode functions). You should find this in your Source/Win32/Internet folder.

Issue with html decoding in Delphi

According to the official documentation of THTMLEncoding, it only supports character entities for the reserved HTML characters ", &, <, and >:

THTMLEncoding only encodes reserved HTML characters: "&<>.

But it also is able to decode numeric character references:

THTMLEncoding supports decoding any HTML numeric character reference, such as © or þ, as well as the character entity references of reserved HTML characters: ", &, <, >.

So the only named character entities it supports are the ones for the HTML reserved characters ", &, <, and >.

Indeed, the documentation emphasises and warns

Warning: Decoding character entity references of non-reserved characters, such as ' or ©, is not supported. The input data must not contain any other character entity references. Otherwise, the output data may be corrupted.

Fortunately, this SO question (and the answer by Ian Boyd) contains code to decode HTML character entities other than those for the reserved characters.

Read HTML special characters in Delphi string

Since your HTML file is encoded in UTF-8, you should specify it when calling LoadFromFile():

S := TStringList.Create;
S.LoadFromFile('path\index.html', TEncoding.UTF8);

Otherwise the ANSI encoding is used.

Is there a Delphi standard function for escaping HTML?

I am 99 % sure that such a function does not exist in the RTL (as of Delphi 2009). Of course - however - it is trivial to write such a function.

Update

HTTPUtil.HTMLEscape is what you are looking for:

function HTMLEscape(const Str: string): string;

I don't dare to publish the code here (copyright violation, probably), but the routine is very simple. It encodes "<", ">", "&", and """ to <, >, &, and ". It also replaces characters #92, #160..#255 to decimal codes, e.g. \.

This latter step is unnecessary if the file is UTF-8, and also illogical, because higher special characters, such as ∮ are left as they are, while lower special characters, such as ×, are encoded.

Update 2

In response to the answer by Stijn Sanders, I made a simple performance test.

program Project1;

{$APPTYPE CONSOLE}

uses
Windows, SysUtils;

var
t1, t2, t3, t4: Int64;
i: Integer;
str: string;
const
N = 100000;

function HTMLEncode(const Data: string): string;
var
i: Integer;
begin

result := '';
for i := 1 to length(Data) do
case Data[i] of
'<': result := result + '<';
'>': result := result + '>';
'&': result := result + '&';
'"': result := result + '"';
else
result := result + Data[i];
end;

end;

function HTMLEncode2(Data: string):string;
begin
Result:=
StringReplace(
StringReplace(
StringReplace(
StringReplace(
Data,
'&','&',[rfReplaceAll]),
'<','<',[rfReplaceAll]),
'>','>',[rfReplaceAll]),
'"','"',[rfReplaceAll]);
end;

begin

QueryPerformanceCounter(t1);
for i := 0 to N - 1 do
str := HTMLEncode('Testing. Is 3*4<3+4? Do you like "A & B"');
QueryPerformanceCounter(t2);

QueryPerformanceCounter(t3);
for i := 0 to N - 1 do
str := HTMLEncode2('Testing. Is 3*4<3+4? Do you like "A & B"');
QueryPerformanceCounter(t4);

Writeln(IntToStr(t2-t1));
Writeln(IntToStr(t4-t3));

Readln;

end.

The output is

532031
801969

Numeric equivalence between HTML Character Entities and Delphi?

This is 'MATHEMATICAL DOUBLE-STRUCK SMALL A' (U+1D552). It is outside the Basic Multilingual Plane, and so in UFT-16 is encoded using a surrogate pair. Which means that two UTF-16 character elements are required.

Look at your attempt: Chr(120146). Now, 120146 > high(Word) (= 65535) which tells you that your code cannot succeed. Remember that each UTF-16 character element is 16 bits in size. It would be nice if the compiler warned about this. Does it?

The link above tells you how to encode it. It is given by this surrogate pair:

0xD835 0xDD52

In Delphi that would be most easily written as:

#$D835#$DD52

If you are starting with the UTF-32 code as a numeric value then you can convert it to a Delphi string using TCharacter.ConvertFromUtf32 from the System.Character unit:

TCharacter.ConvertFromUtf32($1D552)

Obviously the argument to this function can be a variable.

If much of the above Unicode terminology is unknown to you, read these articles:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  • A Programmer’s Introduction to Unicode, Nathan Reed.

Delphi speed up decode and show a custom image

Here is an example how you could process the Binary data.

DISCLAMER
This code sample is far from optimized as I tried to keep it simple so one can grasp the concept how to process binary data.

The main concept here is that we have a 40 bit sync word (marker) but since we are dealing with individual bits, it can be on a non byte boundary. So all we need to do is read at least 48 bits (6 bytes) into a 64 bit integer and shift the bits to the right until we find our marker. I did not include the RGB pixel extraction logic, I leave that as an exercise for you :), I think you can decode it with WIC as GUID_WICPixelFormat32bppBGR101010

program SO59584303;

{$APPTYPE CONSOLE}

{$R *.res}

uses
Classes,
System.SysUtils;

type ImageArray = TArray<Byte>;
const FrameSync : UInt64 = $AC543265FC; // we need Int64 as our marker is > 32 bits

function GetByte(const Value : UInt64; const ByteNum : Byte) : Byte; inline;
begin
Result := (Value shr ((ByteNum-1)*8)) and $FF ;
end;

procedure WriteInt64BigEndian(const Value: UInt64; NumberOfBytes : Integer; var Stream : TBytes; var Ps : Integer);

var
I : Integer;

begin
for I := NumberOfBytes downto 1 do
begin
Stream[Ps] := GetByte(Value, I);
Inc(Ps);
end;
end;

function ReadInt64BigEndian(const NumberOfBytes : Integer; const Stream : TBytes; var Ps : Integer) : UInt64;

var
I : Integer;
B : Byte;

begin
Result := 0;
for I := NumberOfBytes downto 1 do
begin
B := Stream[Ps];
Result := Result or (UInt64(B) shl ((I-1)* 8));
Inc(Ps);
// sanity check
if Ps >= Length(Stream) then
Exit;
end;
end;

procedure ReadPixelData(const Stream : TBytes; Var Ps : Integer; const Shift : Byte; var Buffer : ImageArray);

// our buffer
var
I : UInt64;
BPos : Integer;

begin
BPos := 0;
// 1024 * 10 bit pixel = 10240 bits = 1280 bytes // initialize buffer
SetLength(Buffer, 1280);
// fill with 0's
FillChar(Buffer[0], Length(Buffer), 0);
if Shift = 0 then
begin
// if we are byte boundary, we can just copy our data
Move(Stream[Ps], Buffer[0], Length(Buffer));
Inc(Ps, Length(Buffer));
end
else
while Bpos < Length(Buffer) do
begin
// Read 8 bytes at a time and shift x bits to the right, mask off highest byte
// this means we can get max 7 bytes at a time
I := (ReadInt64BigEndian(8, Stream, Ps) shr Shift) and $00FFFFFFFFFFFFFF;
// Write 7 bytes to our image data buffer
WriteInt64BigEndian(I, 7, Buffer, BPos);
// go one position back for the next msb bits
Dec(Ps);
end;
end;

procedure WritePixelData(var Stream : TBytes; Var Ps : Integer; var Shift : Byte);
var
Count : Integer;
ByteNum : Byte;
Data : UInt64;

begin
for Count := 1 to 160 do
begin
// write four bytes at a time, due to the shifting we get 5 bytes in total
Data := $F1F2F3F4;
if (Shift > 0) then
begin
// special case, we need to fillup shift bits on last written byte in the buffer with highest byte from our UInt64
Data := Data shl Shift;
Stream[Ps-1] := Stream[Ps-1] or GetByte(Data, 5);
end;
WriteInt64BigEndian(Data, 4, Stream, Ps);
Data := $F5F6F7F8;
if (Shift > 0) then
begin
// special case, we need to fillup shift bits on last written byte in the buffer with highest byte from our UInt64
Data := Data shl Shift;
Stream[Ps-1] := Stream[Ps-1] or GetByte(Data, 5);
end;
WriteInt64BigEndian(Data, 4, Stream, Ps);
end;
end;

procedure GenerateData(var Stream : TBytes);

var
Count : Integer;
I : UInt64;
Ps : Integer;
Shift : Byte;

begin
Count := 1285*4+10;
SetLength(Stream, Count); // make room for 4 Imageframes (1280 bytes or 10240 bits) and 5 byte marker (40 bits) + 10 bytes extra room
FillChar(Stream[0], Count, 0);
Ps := 1;
// first write some garbage
Stream[0] := $AF;
// our first marker will be shifted 3 bits to the left
Shift := 3;
I := FrameSync shl Shift;
// write our Framesync (40+ bits = 6 bytes)
WriteInt64BigEndian(I, 6, Stream, Ps);
// add our data, 1280 bytes or 160 times 8 bytes, we use $F1 F2 F3 F4 F5 F6 F7 F8 as sequence
// (fits in Int 64) so that we can verify our decoding stage later on
WritePixelData(Stream, Ps, Shift);
// write some garbage
Stream[Ps] := $AE;
Inc(Ps);
// our second marker will be shifted 2 bits to the left
Shift := 2;
I := FrameSync shl Shift;
WriteInt64BigEndian(I, 6, Stream, Ps);
WritePixelData(Stream, Ps, Shift);
// write some garbage
Stream[Ps] := $AD;
Inc(Ps);
// our third marker will be shifted 1 bit to the left
Shift := 1;
I := FrameSync shl Shift;
WriteInt64BigEndian(I, 6, Stream, Ps);
WritePixelData(Stream, Ps, Shift);
// write some garbage
Stream[Ps] := $AC;
Inc(Ps);
// our third marker will be shifted 5 bits to the left
Shift := 5;
I := FrameSync shl Shift;
WriteInt64BigEndian(I, 6, Stream, Ps);
WritePixelData(Stream, Ps, Shift);
SetLength(Stream, Ps-1)
end;

procedure DecodeData(const Stream : TBytes);

var
Ps : Integer;
OrgPs : Integer;
BPos : Integer;
I : UInt64;
Check : UInt64;
Shift : Byte;
ByteNum : Byte;
ImageData : ImageArray;

begin
Ps := 0;
Shift := 0;
while Ps < Length(Stream) do
begin
// try to find a marker
// determine the number of bytes we need to read, 40bits = 5 bytes,
// when we have shifted bits this will require 6 bytes
if Shift = 0 then
ByteNum := 5
else
ByteNum := 6;
// save initial position in the stream
OrgPs := Ps;
// read our marker
I := ReadInt64BigEndian(ByteNum, Stream, Ps);
// if we have shifted bits, shift them on byte boundary and make sure we only have the 40 lower bits
if Shift > 0 then
I := (I shr Shift) and $FFFFFFFFFF;
if I = FrameSync then
begin
// we found our marker, process pixel data (ie read next 10240 bits, taking shift into account)
// If we have shift, our first bits will be found in the last marker byte, so go back one position in the stream
if Shift > 0 then
Dec(Ps);
ReadPixelData(Stream, Ps, Shift, ImageData);
// process Image array accordingly, here we will just check that we have our written data back
BPos := 0;
Check := $F1F2F3F4F5F6F7F8;
for ByteNum := 1 to 160 do
begin
I := ReadInt64BigEndian(8, ImageData, BPos);
// if our data is not correct, raise error
Assert(I = Check, 'error decoding image data');
end;
end
else
begin
Ps := OrgPs;
// we did not find our marker, advance 1 bit
Inc(Shift);
if Shift > 7 then
begin
// reset shift value
Shift := 0;
// advance to next byte boundary
Inc(Ps);
end;
end;
end;
end;

Var
AStream : TBytes;

begin
try
GenerateData(AStream);
DecodeData(AStream);
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.

Best HTML encoder for Delphi?

Replacing the <, >, &, and " characters in a string is trivial. You could thus easily write your own routine for this. (And if your HTML page is UTF-8, there is absolutely no reason to encode any other characters, such as U+222B (the integral sign).)

But if you wish to stick to the Delphi RTL, then you can have a look at HTTPUtil.HTMLEscape with the exactly same signature as HTTPApp.HTMLEncode.

Or, have a look at this SO question.



Related Topics



Leave a reply



Submit