Using PHP Substr() and Strip_Tags() While Retaining Formatting and Without Breaking HTML

Using PHP substr() and strip_tags() while retaining formatting and without breaking HTML

Not amazing, but works.

function html_cut($text, $max_length)
{
    $tags   = array();
    $result = "";

    $is_open   = false;
    $grab_open = false;
    $is_close  = false;
    $in_double_quotes = false;
    $in_single_quotes = false;
    $tag = "";

    $i = 0;
    $stripped = 0;

    $stripped_text = strip_tags($text);

    while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
    {
        $symbol  = $text{$i};
        $result .= $symbol;

        switch ($symbol)
        {
           case '<':
                $is_open   = true;
                $grab_open = true;
                break;

           case '"':
               if ($in_double_quotes)
                   $in_double_quotes = false;
               else
                   $in_double_quotes = true;

            break;

            case "'":
              if ($in_single_quotes)
                  $in_single_quotes = false;
              else
                  $in_single_quotes = true;

            break;

            case '/':
                if ($is_open && !$in_double_quotes && !$in_single_quotes)
                {
                    $is_close  = true;
                    $is_open   = false;
                    $grab_open = false;
                }

                break;

            case ' ':
                if ($is_open)
                    $grab_open = false;
                else
                    $stripped++;

                break;

            case '>':
                if ($is_open)
                {
                    $is_open   = false;
                    $grab_open = false;
                    array_push($tags, $tag);
                    $tag = "";
                }
                else if ($is_close)
                {
                    $is_close = false;
                    array_pop($tags);
                    $tag = "";
                }

                break;

            default:
                if ($grab_open || $is_close)
                    $tag .= $symbol;

                if (!$is_open && !$is_close)
                    $stripped++;
        }

        $i++;
    }

    while ($tags)
        $result .= "</".array_pop($tags).">";

    return $result;
}

Usage example:

$content = html_cut($content, 100);

How to truncate an html text and still maintain the format, in PHP

I saw this answered in another question here, The link provided was http://snippets.dzone.com/posts/show/7125
by Dennis Pedrie on How to truncate HTML to certain number of characters?

Hope this helps, its basically a short class that will do what you want with a very simple call, if you scroll down some people have made a few improvements too.

Regards
Luke

PHP substr() function that allows you to set start and stop point AND keeps HTML formatting?

There are so many complications involved in what you are asking (essentially, generate a valid html subset given a string offset), that it would really be better if you reformulate your problem in such a way that it is expressed as the number of text characters you want to keep rather than as cutting an arbitrary string which has html in it. If you do that this problem becomes much easier because you can use a real HTML parser. You will not need to worry about:

Accidentally cutting elements in half.
Accidentally cutting entites in half.
Not counting text inside elements.
Making sure a character entity counts as a single character.
Making sure all elements are properly closed.
Making sure you don't destroy the string because you're using substr() on a utf-8 string.

It is possible to accomplish this with regexes (using the u flag) and mb_substr() and a tag stack (I've done it before), but there are many edge cases and you are generally in for a hard slog.

However, a DOM solution is fairly straightforward: walk through all the text nodes counting up string lengths and either remove or substring their text content as needed. The code below does this:

$html = <<<'EOT'
<p>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>
EOT;

function substr_html($html, $start, $length=null, $removeemptyelements=true) {
    if (is_int($length)) {
        if ($length===0) return '';
        $end = $start + $length;
    } else {
        $end = null;
    }
    $d = new DOMDocument();
    $d->loadHTML('<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><title></title></head><body>'.$html.'</body>');
    $body = $d->getElementsByTagName('body')->item(0);
    $dxp = new DOMXPath($d);
    $t_start = 0; // text node's start pos relative to all text
    $t_end   = null; // text node's end pos relative to all text

    // copy because we may modify result of $textnodes
    $textnodes = iterator_to_array($dxp->query('/descendant::*/text()', $body));

// PHP 5.2 doesn't seem to implement Traversable on DOMNodeList,
// so `iterator_to_array()` won't work. Use this instead:
// $textnodelist = $dxp->query('/descendant::*/text()', $body);
// $textnodes = array();
// for ($i = 0; $i < $textnodelist->length; $i++) {
//  $textnodes[] = $textnodelist->item($i);
//}
//unset($textnodelist);

    foreach($textnodes as $text) {
        $t_end = $t_start + $text->length;
        $parent = $text->parentNode;
        if ($start >= $t_end || ($end!==null && $end < $t_start)) {
            $parent->removeChild($text);
        } else {
            $n_offset = max($start - $t_start, 0);
            $n_length = ($end===null) ? $text->length : $end - $t_start;
            if (!($n_offset===0 && $n_length >= $text->length)) {
                $substr = $text->substringData($n_offset, $n_length);
                if (strlen($substr)) {
                    $text->deleteData(0, $text->length);
                    $text->appendData($substr);
                } else {
                    $parent->removeChild($text);
                }
            }
        }

        // if removing this text emptied the parent of nodes, remove the node!
        if ($removeemptyelements && !$parent->hasChildNodes()) {
            $parent->parentNode->removeChild($parent);
        }

        $t_start = $t_end;
    }
    unset($textnodes);
    $newstr = $d->saveHTML($body);

    // mb_substr() is to remove <body></body> tags
    return mb_substr($newstr, 6, -7, 'utf-8');
}

echo substr_html($html, 480, 30);

This will output:

<p> of "de Finibus</p> <p>Bonorum et Mal</p>

Notice it is not confused by the fact that your "substring" spans multiple p elements.

PHP substr breaks my table

It's probably the contents of description that contains HTML markup that's breaking the table, use htmlspecialchars...

echo '<td>' . htmlspecialchars(substr($value['description'], 0, 10)) . '</td>';

Copy a file's contents while ignoring characters between and in Java

public static void main(String[] args) {
    String html = " <html>\n"
            + " <head>\n"
            + " <title>My web page</title>\n"
            + " </head>\n"
            + " <body>\n"
            + " <p>There are many pictures of my cat here,\n"
            + " as well as my <b>very cool</b> blog page,\n"
            + " which contains <font color=\"red\">awesome\n"
            + " stuff about my trip to Vegas.</p>\n"
            + "\n"
            + "\n"
            + " Here's my cat now:<img src=\"cat.jpg\">\n"
            + " </body>\n"
            + " </html>";

    boolean inTag = false;
    StringBuilder finalString = new StringBuilder();

    int length = html.length();
    for (int i = 0; i < length; i++) {

        char c = html.charAt(i);

        if ('<' == c) {
            inTag = true;
        } else if ('>' == c) {
            inTag = false;
        } else if (!inTag) {
            finalString.append(c);
        }

    }

    System.out.print(finalString);

}

How to clip HTML fragments without breaking up tags?

You should check out Tidy HTML. Just cut it after the first 50 non-HTML characters, then run it through Tidy to fix the HTML.

Using PHP Substr() and Strip_Tags() While Retaining Formatting and Without Breaking HTML