How to Convert HTML to Plain Text

How do you convert Html to plain text?

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

  • http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx
  • http://www.google.com/search?hl=en&q=html+tag+stripping+&btnG=Search

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

How to convert html string into plain text in React?

You can write your own piece of code to make that happen, no library needed.

var htmlString = "<h1><b>test</b></h1>";
var plainString = htmlString.replace(/<[^>]+>/g, '');

console.log(plainString ); // you will have your plain text

or

function getText(html){
var divContainer= document.createElement("div");
divContainer.innerHTML = html;
return divContainer.textContent || divContainer.innerText || "";
}

var yourString= "<div><h1>Hello World</h1>\n<p>We are in SOF</p></div>";

console.log(getText(yourString));

Convert HTML to plain text in JS without browser environment

Converter HTML to plain text like Gmail:

html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, ' * ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');

If you can use jQuery :

var html = jQuery('<div>').html(html).text();

How to convert a paragraph html string to plain text without html tags in google app script?

In your situation, how about modifying toStringFromHtml as follows?

Modified script:

function toStringFromHtml(html) {
html = '<div>' + html + '</div>';
html = html.replace(/<br>/g, "").replace(/<p><\/p><p><\/p>/g, "<p></p>").replace(/<span>|<\/span>/g, "");
var document = XmlService.parse(html);
var strText = XmlService.getPrettyFormat().setIndent("").format(document);
strText = strText.replace(/<[^>]*>/g, "");
return strText.trim();
}
  • In this modified script, your following sample HTML is converted as follows.

    • From

        <p><span>Hi Katy</span></p>
      <p></p>
      <p><span>The illustration (examples) paragraph is useful when we want to explain or clarify something, such as an object, a person, a concept, or a situation. Sample Illustration Topics:</span></p>
      <p></p>
      <p></p>
      <p><span>1. Examples of annoying habits people have on the Skytrain.</span></p>
      <p><span>2. Positive habits that you admire in other people. </span></p>
      <p><span>3. Endangered animals in Asia. </span></p>
    • To

        <div>
      <p>Hi Katy</p>
      <p></p>
      <p>The illustration (examples) paragraph is useful when we want to explain or clarify something,
      such as an object,
      a person,
      a concept,
      or a situation. Sample Illustration Topics:</p>
      <p></p>
      <p>1. Examples of annoying habits people have on the Skytrain.</p>
      <p>2. Positive habits that you admire in other people. </p>
      <p>3. Endangered animals in Asia. </p>
      </div>
    • By this conversion, the following result is obtained.

        Hi Katy

      The illustration (examples) paragraph is useful when we want to explain or clarify something, such as an object, a person, a concept, or a situation. Sample Illustration Topics:

      1. Examples of annoying habits people have on the Skytrain.
      2. Positive habits that you admire in other people.
      3. Endangered animals in Asia.

Note:

  • When your sample HTML shown in your question is used, the modified script can achieve your goal. But, I'm not sure about your other HTML data. So I'm not sure whether this modified script can be used for your actual HTML data. Please be careful about this.

Converting HTML to plain text in PHP for e-mail

Use html2text (example HTML to text), licensed under the Eclipse Public License. It uses PHP's DOM methods to load from HTML, and then iterates over the resulting DOM to extract plain text. Usage:

// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);

// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);

Although incomplete, it is open source and contributions are welcome.

Issues with other conversion scripts:

  • Since html2text (GPL) is not EPL-compatible.
  • lkessler's link (attribution) is incompatible with most open source licenses.

Convert HTML to plain text keeping links, bold and italic in Javascript

I had some time on my hands and played around. This is what I came up with:

const copy=document.createElement("div");
copy.innerHTML=container.innerHTML.replace(/\n/g," ").replace(/[\t\n]+/g,"");
const tags={B:["**","**",1], // [<prefix>, <postfix>, <sequence-number> ]
I:["*","*",2],
H2:["##","\n",3],
P:["\n","\n",4],
DIV:["","\n",5],
TD:["","\t",6]};
[...copy.querySelectorAll(Object.keys(tags).join(","))]
.sort((a,b)=>tags[a.tagName][2]-tags[b.tagName][2])
.forEach(e=>{
const [a,b]=tags[e.tagName];
e.innerHTML=(e.matches("TD:first-child") ? "\n": a) + e.innerHTML + b;
});
console.log(copy.textContent.replace(/^ */mg,""));
<div id="container">
<H2>Second level heading</H2>
<div><div>
A <b>first div</b> with a
<a href="abc.html">link (abc)</a> and a
<p>paragraph having itself another <a href="def.html">link (def)</a> in it.</p>
</div>
</div>
And here is some more <i>"lost" text</i> ...
<table>
<tr><td>one</td><td><b>two</b></td><td>three</td></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
</table>
</div>


Related Topics



Leave a reply



Submit