How to Convert a Formatted String into Plain Text

How to convert a formatted string into plain text

Those letters are from the Mathematical Alphanumeric Symbols block.

Since they have a fixed offset to their ASCII counterparts, you could use tr to map them, e.g.:

" quot;.tr("-quot;, "a-z")
#=> "jovy debbie"

The same approach can be used for the other styles, e.g.

" quot;.tr("--quot;, "a-zA-Z")
#=> "Jenica Dugos"

This gives you full control over the character mapping.

Alternatively, you could try Unicode normalization. The NFKC / NFKD forms should remove most formatting and seem to work for your examples:

" quot;.unicode_normalize(:nfkc)
#=> "jovy debbie"

" quot;.unicode_normalize(:nfkc)
#=> "Jenica Dugos"

How do I convert a formatted email into plain text in Java?

import java.io.IOException;
import java.io.StringReader;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public class ExtractEmailBody
{
public static void main(String[] args) throws IOException
{
String email = "<div dir=\"ltr\">420</div><div class=\"gmail_extra\"><br><br><div class=\"gmail_quote\">On Thu, Aug 8, 2013 at 4:14 PM, <span dir=\"ltr\">< 3:50 AM+11111111111: (2/6)<a href=\"mailto:xxxxxx@gmail.com\" target=\"_blank\">xxxxxx@gmail.com</a>></span> wrote:<br> <blockquote class=\"gmail_quot 3:50 AM +14411111111: (3/6)e\" style=\"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex\">414<div class=\"HOEnZb\"><div class=\"h5\"><br>DO_NOT_REPLY:This i 3:50 AM" +
": (4/6)s an email notification that you have received a text message from a customer in Kaarma. If you reply to this email, a text message or 3:50 AM" +
"(5/6)email message will NOT go to the customer. Access the customer text message to send a reply. </div></div></blockquote></div> 3:50 AM" +
"(6/6)<br></div>";

class EmailCallback extends ParserCallback
{
private String body_;
private boolean divStarted_;

public String getBody()
{
return body_;
}

@Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos)
{
if (t.equals(Tag.DIV) && "ltr".equals(a.getAttribute(Attribute.DIR)))
{
divStarted_ = true;
}
}

@Override
public void handleEndTag(Tag t, int pos)
{
if (t.equals(Tag.DIV))
{
divStarted_ = false;
}
}

@Override
public void handleText(char[] data, int pos)
{
if (divStarted_)
{
body_ = new String(data);
}
}
}
EmailCallback callback = new EmailCallback();
Parser parser = new ParserDelegator();
StringReader reader = new StringReader(email);
parser.parse(reader, callback, true);
reader.close();
System.out.println(callback.getBody());
}
}

How to convert any string text into plain-text format in objective-c

I wonder if you're running into this problem...

By default, pasting formatted / rich text into a text field will not keep the formatting. It will be pasted as plain-text, and will be displayed with the font you've set for your field.

Try a test. Copy and paste this next line into your text field:

This should lose formatting.

Then try copy/paste with this line:

should formatting.

The first test line uses html tags. If you inspect the source, it should look like this:

<strong>This</strong> should <strong><em>lose</em></strong> formatting.

The second test line, however, uses unicode characters, and if you inspect the source it will look just like it looks here in this answer.

If that is what you're seeing, there is no (easy) way to "remove" the formatting, because there is no formatting.

It's much the same as if the user pastes an emoji, such as - - into your text field.

How to convert a paragraph html string to plain text without html tags in google app script?

In your situation, how about modifying toStringFromHtml as follows?

Modified script:

function toStringFromHtml(html) {
html = '<div>' + html + '</div>';
html = html.replace(/<br>/g, "").replace(/<p><\/p><p><\/p>/g, "<p></p>").replace(/<span>|<\/span>/g, "");
var document = XmlService.parse(html);
var strText = XmlService.getPrettyFormat().setIndent("").format(document);
strText = strText.replace(/<[^>]*>/g, "");
return strText.trim();
}
  • In this modified script, your following sample HTML is converted as follows.

    • From

        <p><span>Hi Katy</span></p>
      <p></p>
      <p><span>The illustration (examples) paragraph is useful when we want to explain or clarify something, such as an object, a person, a concept, or a situation. Sample Illustration Topics:</span></p>
      <p></p>
      <p></p>
      <p><span>1. Examples of annoying habits people have on the Skytrain.</span></p>
      <p><span>2. Positive habits that you admire in other people. </span></p>
      <p><span>3. Endangered animals in Asia. </span></p>
    • To

        <div>
      <p>Hi Katy</p>
      <p></p>
      <p>The illustration (examples) paragraph is useful when we want to explain or clarify something,
      such as an object,
      a person,
      a concept,
      or a situation. Sample Illustration Topics:</p>
      <p></p>
      <p>1. Examples of annoying habits people have on the Skytrain.</p>
      <p>2. Positive habits that you admire in other people. </p>
      <p>3. Endangered animals in Asia. </p>
      </div>
    • By this conversion, the following result is obtained.

        Hi Katy

      The illustration (examples) paragraph is useful when we want to explain or clarify something, such as an object, a person, a concept, or a situation. Sample Illustration Topics:

      1. Examples of annoying habits people have on the Skytrain.
      2. Positive habits that you admire in other people.
      3. Endangered animals in Asia.

Note:

  • When your sample HTML shown in your question is used, the modified script can achieve your goal. But, I'm not sure about your other HTML data. So I'm not sure whether this modified script can be used for your actual HTML data. Please be careful about this.

Python: Format string to appear as plaintext in Markdown or HTML?

Ok I finally solved it. Posting here for future reference.

I could never get HTML to work so I'm still unsure about that.

So first, I had my parse_mode=Markdown Instead it needs to be parse_mode=MarkdownV2

Next, there are a few specific characters that using the \ operator does work to display as a literal. Instead, you need to use Percent-Encoding to retain those symbols.

Here is the code I used to fix that portion.

message_body=message_body.replace('%', '\\%25')
message_body=message_body.replace('#', '\\%23')
message_body=message_body.replace('+', '\\%2B')
message_body=message_body.replace('*', '\\%2A')
message_body=message_body.replace('&', '\\%26')

Which fixes for %, #, +... I could probably make this more elegant/faster but this works for now.

Finally, there's a group of characters that the \ operator DOES work to create a literal for. Here's the code I used to fix those

    message_body = re.sub(r"([_*\[\]()~`>\#\+\-=|\.!{}])", r"\\\1", message_body)

How do you convert Html to plain text?

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

  • http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx
  • http://www.google.com/search?hl=en&q=html+tag+stripping+&btnG=Search

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

Convert plain text data into an object with arrays based on formatting

(1) Start off by split()ting with [ as the separator which should give you as many groups as there are in the string:

[ '', 'group a]\na\nb\nc\n', 'group b]\na\nd\nf' ]

(2) Use .filter() to remove undesired empty strings:

[ 'group a]\na\nb\nc\n', 'group b]\na\nd\nf' ]

(3) Now use .map() to transform each group using ] or \n as separators:

[ [ 'group a', '', 'a', 'b', 'c', '' ], [ 'group b', '', 'a', 'd', 'f' ] ]

(4) Use .filter() to remove undesired empty strings:

[ [ 'group a', 'a', 'b', 'c' ], [ 'group b', 'a', 'd', 'f' ] ]

(5) Lastly use .reduce() and Object.assign() to convert each group into a key-value pair:

{ 'group a': [ 'a', 'b', 'c' ], 'group b': [ 'a', 'd', 'f' ] }

Please note that the fragment,

.split(' ').map((a,i) => i === 0 ? a : a.toUpperCase()).join('')

is what converts group a into groupA, etc

DEMO

let data = `[group a]
a
b
c
[group b]
a
d
f
[group c]
g
h
t`;

let obj = data
.split(/\[/)
.filter(v => v)
.map(v => v.split(/\]|[\r\n]+/)
.filter(v => v))
.reduce((acc,val) =>
Object.assign(acc,
{[val[0].split(' ').map((a,i) => i === 0 ? a : a.toUpperCase()).join('')]:val.slice(1)}),{});

console.log( obj );

How to convert html string into plain text in React?

You can write your own piece of code to make that happen, no library needed.

var htmlString = "<h1><b>test</b></h1>";
var plainString = htmlString.replace(/<[^>]+>/g, '');

console.log(plainString ); // you will have your plain text

or

function getText(html){
var divContainer= document.createElement("div");
divContainer.innerHTML = html;
return divContainer.textContent || divContainer.innerText || "";
}

var yourString= "<div><h1>Hello World</h1>\n<p>We are in SOF</p></div>";

console.log(getText(yourString));


Related Topics



Leave a reply



Submit