PHP Parsing Problem - &Nbsp; and Â

PHP Parsing Problem - and Â

The non-breaking space exist in UTF-8 of two bytes: 0xC2 and 0xA0.

When those bytes are represented in ISO-8859-1 (a single-byte encoding) instead of UTF-8 (a multi-byte encoding) then those bytes becomes respectively the characters Â and another non-breaking space .

Apparently you're parsing the HTML using UTF-8 and echoing the results using ISO-8859-1. To fix this problem, you need to either parse HTML using ISO-8859-1 or echo the results using UTF-8. I'd recommend to use UTF-8 all the way. Go through the PHP UTF-8 cheatsheet to align it all out.

unable to split a string that is parsed from a webpage?

@user1518659 here try this, to fix the issue just replace the with a space before passing to DOMDocument, I also added the split of firstname last name :) hope it helps.

<?php 
header('Content-Type: text/html; charset=utf-8'); //Required if your outputting, as the description contains utf-8 characters
//Load the source (input)
$html_source = file_get_contents('http://www.reuters.com/finance/stocks/companyOfficers?symbol=AOS');
$html_source = str_replace(' ',' ',$html_source);

//Dom document
$dom = new DOMDocument('1.0');
@$dom->loadHTML($html_source);

$out =array();
$i=0;
foreach($dom->getElementsByTagName('table') as $table) {
    if($table->getAttribute('class')=='dataTable'){

        foreach($table->getElementsByTagName('tr') as $tr){
            if(isset($tr->getElementsByTagName('td')->item(0)->nodeValue)){

                $out[$i]['fullname'] = $tr->getElementsByTagName('td')->item(0)->nodeValue;

                $name = explode(' ',$out[$i]['fullname']);
                $out[$i]['first_name'] = $name[0];
                $out[$i]['last_name'] = $name[1];

                if(!isset($tr->getElementsByTagName('td')->item(2)->nodeValue)){

                    foreach ($out as $key=>$value){
                        if($value['fullname'] == $tr->getElementsByTagName('td')->item(0)->nodeValue &&
                        !is_numeric(substr($tr->getElementsByTagName('td')->item(1)->nodeValue,0,1)) && 
                        $tr->getElementsByTagName('td')->item(1)->nodeValue != "--" ){
                            $out[$key]['description']= $tr->getElementsByTagName('td')->item(1)->nodeValue;
                        }
                    }

                }else{
                    if(!isset($tr->getElementsByTagName('td')->item(2)->nodeValue)){continue;}
                    if(isset($tr->getElementsByTagName('td')->item(3)->nodeValue)){
                        $out[$i]['age']= $tr->getElementsByTagName('td')->item(1)->nodeValue;
                        $out[$i]['since']= $tr->getElementsByTagName('td')->item(2)->nodeValue;
                        $out[$i]['position']= $tr->getElementsByTagName('td')->item(3)->nodeValue;
                    }
                }
                $i++;
            }
        }
    }
}

//Clean up
$return = array();
foreach ($out as $key=>$row){
    if(isset($row['fullname']) && isset($row['age']) && isset($row['since']) && isset($row['position']) && isset($row['description'])){
        $return[$key] = $out[$key];
    }
}

print_r($return);

/*
Array
(
    [0] => Array
        (
            [fullname] => Paul Jones
            [first_name] => Paul
            [last_name] => Jones
            [age] => 63
            [since] => 2011
            [position] => Chairman of the Board, Chief Executive Officer
            [description] => Mr. Paul W. Jones serves as the Chairman of the Board, Chief Executive Officer of A. O. Smith Corp. He has been a director of company since 2004. He is a member of the Investment Policy Committee of the Board. He was elected chairman of the board, president and chief executive officer effective December 31, 2005. He was president and chief operating officer from 2004 to 2005. Prior to joining the company, he was chairman and chief executive officer of U.S. Can Company, Inc. from 1998 to 2002. He previously was president and chief executive officer of Greenfield Industries, Inc. from 1993 to 1998 and president from 1989 to 1992. Mr. Jones has been a director of Federal Signal Corporation since 1998, where he chairs the Nominating and Governance Committee and is a member of the Compensation and Benefits Committee and the Executive Committee, and Integrys Energy Group, Inc. since 2011, where he is a member of the Compensation and Financial Committees. He was also a director of Bucyrus International, Inc. from 2006 until its acquisition by Caterpillar, Inc. in 2011, and chaired the Compensation Committee.
        )

    [1] => Array
        (
            [fullname] => Ajita Rajendra
            [first_name] => Ajita
            [last_name] => Rajendra
            [age] => 60
            [since] => 2011
            [position] => President, Chief Operating Officer, Director
            [description] => Mr. Ajita G. Rajendra serves as the President, Chief Operating Officer and Director of A. O. Smith Corp. He was elected a director of company in December 2011, based on the recommendation of the Nominating and Governance Committee, following his election as President and Chief Operating Officer in September 2011. Mr. Rajendra joined the company as President of A. O. Smith Water Products Company in 2005, and was named Executive Vice President of the company in 2006. Prior to joining the company, Mr. Rajendra was Senior Vice President at Kennametal, Inc., a manufacturer of cutting tools, from 1998 to 2004. Mr. Rajendra also serves on the board of Donaldson Company, Inc., where he is a member of the Audit Committee and Human Resources Committee. Further, Mr. Rajendra was a director of Industrial Distribution Group, Inc. from 2007 until its acquisition by Eiger Holdco, LLC in 2008.
        )
        ...
        ...
*/
?>

Does html_entity_decode replaces also? If not how to replace it?

Quote from html_entity_decode() manual:

You might wonder why
trim(html_entity_decode(' '));
doesn't reduce the string to an empty
string, that's because the ' '
entity is not ASCII code 32 (which is
stripped by trim()) but ASCII code 160
(0xa0) in the default ISO 8859-1
characterset.

You can use str_replace() to replace the ascii character #160 to a space:

<?php
$a = html_entity_decode('> <');
echo 'before ' . $a . PHP_EOL;
$a = str_replace("\xA0", ' ', $a);
echo ' after ' . $a . PHP_EOL;

replace characters that are hidden in text

This solution will work, I tested it:

$string = htmlentities($content, null, 'utf-8');
$content = str_replace(" ", "", $string);
$content = html_entity_decode($content);

Why can't I get rid of this Â ?

Alright - I think I've got a handle on this now - I want to expand on some of the encoding errors that people are getting at:

This seems to be an advanced case of Mojibake, but here is what I think is going on. MikeAinOz's original suspicion that this is UTF-8 data is probably true. If we take the following UTF-8 data:

4 minutes

Now, remove the HTML entity, and replace it with the character it actually corresponds with: U+00A0. (It's a non-breaking space, so I can't exactly "show" you. You get the string: "4 minutes". Encode this as UTF-8, and you get the following byte sequence:

characters:  4  [nbsp]   m   i   n ...
bytes     : 34  C2  A0  6D  69  6E ...

(I'm using [nbsp] above to mean a literal non-breaking space (the character, not the HTML entity , but the character that represents. It's just white-space, and thus, difficult.) Note that the [nbsp]/U+00A0 (non-breaking space) takes 2 bytes to encode in UTF-8.

Now, to go from byte stream back to readable text, we should decode using UTF-8, since that's what we encoded in. Let us use ISO-8859-1 ("latin1") - if you use the wrong one, this is almost always it.

bytes     : 34  C2      A0  6D  69  6E ...
characters:  4   Â  [nbsp]   m   i   n ...

And switch the raw non-breaking space into its HTML entity representation, and you get what you have.

So, either your PHP stuff is interpreting your text in the wrong character set, and you need to tell it otherwise, or you are outputting the result somehow in the wrong character set. More code would be useful here -- where are you getting the data you're passing to this loadHTML, and how are you going about getting the output you're seeing?

Some background: A "character encoding" is just a means of going from a series of characters, to a series of bytes. What bytes represent "é"? UTF-8 says C3 A9, whereas ISO-8859-1 says E9. To get the original text back from a series of bytes, we must know what we encoded it with. If we decode C3 A9 as UTF-8 data, we get "é" back, if we (mistakenly) decode it as ISO-8859-1, we get "Ã©". Junk. In psuedo-code:

utf8-decode ( utf8-encode ( text-data ) )           // OK
iso8859_1-decode ( iso8859_1-encode ( text-data ) ) // OK
iso8859_1-decode ( utf8-encode ( text-data ) )      // Fails
utf8-decode ( iso8859_1-encode ( text-data ) )      // Fails

This isn't PHP code, and isn't your fix... it's just the crux of the problem. Somewhere, over the large scale, that's happening, and things are confused.

simplexml_load_string is turning into Â

I think it parses correctly. It just the way that function works, replacing those codes with special characters.

You can fix the result string, converting it into cp1251

$str = iconv('utf-8', 'cp1251', $str);

Also I would delete double spaces before writing it into CSV file

$str = str_replace(chr(160), ' ', $str);
$str= trim(preg_replace('/\s+/', ' ', $str));

Parse error: syntax error, unexpected 'Â Â Â ' (T_STRING), expecting function (T_FUNCTION)

It's a problem with encoding.

Please convert your file encoding to UTF-8 without BOM

HTML encoding issues - "Â" character showing up instead of " "