Minifying Final HTML Output Using Regular Expressions with Codeigniter

Minifying final HTML output using regular expressions with CodeIgniter

For those curious about how Alan Moore's regex works (and yes, it does work), I've taken the liberty of commented it so it can be read by mere mortals:

function process_data_alan($text) // 
{
    $re = '%# Collapse ws everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          (?:           # Begin (unnecessary) group.
            (?:         # Zero or more of...
              [^<]++    # Either one or more non-"<"
            | <         # or a < starting a non-blacklist tag.
              (?!/?(?:textarea|pre)\b)
            )*+         # (This could be "unroll-the-loop"ified.)
          )             # End (unnecessary) group.
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %ix';
    $text = preg_replace($re, " ", $text);
    return $text;
}

I'm new around here, but I can see right off that Alan is quite good at regex. I would only add the following suggestions.

There is an unnecessary capture group which can be removed.
Although the OP did not say so, the <SCRIPT> element should be added to the <PRE> and <TEXTAREA> blacklist.
Adding the 'S' PCRE "study" modifier speeds up this regex by about 20%.
There is an alternation group in the lookahead which is ripe for applying Friedl's "unrolling-the-loop" efficiency construct.
On a more serious note, this same alternation group: (i.e. (?:[^<]++|<(?!/?(?:textarea|pre)\b))*+) is susceptible to excessive PCRE recursion on large target strings, which can result in a stack-overflow causing the Apache/PHP executable to silently seg-fault and crash with no warning. (The Win32 build of Apache httpd.exe is particularly susceptible to this because it has only 256KB stack compared to the *nix executables, which are typically built with 8MB stack or more.) Philip Hazel (the author of the PCRE regex engine used in PHP) discusses this issue in the documentation: PCRE DISCUSSION OF STACK USAGE. Although Alan has correctly applied the same fix as Philip shows in this document (applying a possessive plus to the first alternative), there will still be a lot of recursion if the HTML file is large and has a lot of non-blacklisted tags. e.g. On my Win32 box (with an executable having a 256KB stack), the script blows up with a test file of only 60KB. Note also that PHP unfortunately does not follow the recommendations and sets the default recursion limit way too high at 100000. (According to the PCRE docs this should be set to a value equal to the stack size divided by 500).

Here is an improved version which is faster than the original, handles larger input, and gracefully fails with a message if the input string is too large to handle:

// Set PCRE recursion limit to sane value = STACKSIZE / 500
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
ini_set("pcre.recursion_limit", "16777");  // 8MB stack. *nix
function process_data_jmr1($text) // 
{
    $re = '%# Collapse whitespace everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          [^<]*+        # Either zero or more non-"<" {normal*}
          (?:           # Begin {(special normal*)*} construct
            <           # or a < starting a non-blacklist tag.
            (?!/?(?:textarea|pre|script)\b)
            [^<]*+      # more non-"<" {normal*}
          )*+           # Finish "unrolling-the-loop"
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre|script)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %Six';
    $text = preg_replace($re, " ", $text);
    if ($text === null) exit("PCRE Error! File too big.\n");
    return $text;
}

p.s. I am intimately familiar with this PHP/Apache seg-fault problem, as I was involved with helping the Drupal community while they were wrestling with this issue. See: Optimize CSS option causes php cgi to segfault in pcre function "match". We also experienced this with the BBCode parser on the FluxBB forum software project.

Hope this helps.

Possible to use CodeIgniter output compression with pre to display code blocks?

Let's start by looking at the code you're using now.

 $search = array(
    '/\n/',
    '/\>[^\S ]+/s',
    '/[^\S ]+\</s',
    '/(\s)+/s'
  );

 $replace = array(
    ' ',
    '>',
    '<',
    '\\1'
  );

The intention appears to be to convert all whitespace characters to simple spaces, and to compress every run of multiple spaces down to one. Except it's possible for carriage-returns, tabs, formfeeds and other whitespace characters to slip through, thanks to the \\1 in the fourth replacement string. I don't think that's what the author intended.

If that code was working for you (aside from matching inside <pre> elements), this would probably work just as well, if not better:

$search = '/(?>[^\S ]\s*|\s{2,})/`;

$replace = ' ';

And now we can add a lookahead to prevent it from matching inside <pre> elements:

$search = 
  '#(?>[^\S ]\s*|\s{2,})(?=(?:(?:[^<]++|<(?!/?pre\b))*+)(?:<pre>|\z))#`;

But really, this is not the right tool for the job you're doing. I mean, look at that monster! You'll never be able to maintain it, and complicated as it is, it's still nowhere near as robust as it should be.

I was going to urge you to drop this approach and use a dedicated HTML minifier instead, but that one seems to have its own problems with <pre> elements. If that problem has been fixed, or if there's another minifier out there that would meet your needs, you should definitely go that route.

EDIT: In response to a comment, here's a version that excludes <textarea> as well as <pre> elements:

$search = 
  '#(?ix)
    (?>[^\S ]\s*|\s{2,})
    (?=
      (?:(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+)
      (?:<(?>textarea|pre)\b|\z)
    )
    #'

Regex: how to uglify HTML without losing formatting in certain tags

It can be done using this regexp

Minifying final HTML output using regular expressions with CodeIgniter

How to minify php page html output?

CSS and Javascript

Consider the following link to minify Javascript/CSS files: https://github.com/mrclay/minify

HTML

Tell Apache to deliver HTML with GZip - this generally reduces the response size by about 70%. (If you use Apache, the module configuring gzip depends on your version: Apache 1.3 uses mod_gzip while Apache 2.x uses mod_deflate.)

Accept-Encoding: gzip, deflate
Content-Encoding: gzip

Use the following snippet to remove white-spaces from the HTML with the help ob_start's buffer:

<?php

function sanitize_output($buffer) {

    $search = array(
        '/\>[^\S ]+/s',     // strip whitespaces after tags, except space
        '/[^\S ]+\</s',     // strip whitespaces before tags, except space
        '/(\s)+/s',         // shorten multiple whitespace sequences
        '/<!--(.|\s)*?-->/' // Remove HTML comments
    );

    $replace = array(
        '>',
        '<',
        '\\1',
        ''
    );

    $buffer = preg_replace($search, $replace, $buffer);

    return $buffer;
}

ob_start("sanitize_output");

?>

Minify HTML with Boost regex in C++

Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.

You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.

I think you might be able to use xml parser or you could search for xml parser with html support.

In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.

Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.

Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

PHP regex not working - returns NULL on local server, but works properly on other

You say you solved the problem, but if your solution was to increase the backtrack_limit setting, that's not a solution. In fact, you're probably setting yourself up for bigger problems later on. You need to find out why it's doing so much backtracking.

After \{\s?joomla-tag\s+ locates the beginning of the tag, the first .* initially gobbles up the remainder of the document. Then it starts backing off, trying to let the rest of the regex match. When it reaches a point where <+ can match, the .+ again consumes the rest of the document, and another wave of backtracking begins. And with yet another .* after that, you're making it do a ridiculous amount of unnecessary work.

This is the reason for the rule of thumb,

Don't use the dot metacharacter (especially .* or .+) if you can use something more specific. If you do use the dot, don't use it in single-line or DOTALL mode (i.e., the /s modifier or its inline, (?s) form).

In this case, you know the match should end at the next closing brace (}), so don't let it match any braces before that:

\{\s?joomla-tag\s+([^}]*)\}

Auto Trim whitespace for optimize reading HTML/JS page

This should work

$document = preg_replace( '/\n/', '', $document );
$document = preg_replace( '/>\s+</', '><', $document );
$document = preg_replace( '/\s{2,}/', ' ', $document );

Bye

Minifying Final HTML Output Using Regular Expressions with Codeigniter