How to Tokenize (Words) Classifying Punctuation as Space

How to tokenize (words) classifying punctuation as space

Already covered by a lot of questions is how to tokenize a stream in C++.

Example: How to read a file and get words in C++

But what is harder to find is how get the same functionality as strtok():

Basically strtok() allows you to split the string on a whole bunch of user defined characters, while the C++ stream only allows you to use white space as a separator. Fortunately the definition of white space is defined by the locale so we can modify the locale to treat other characters as space and this will then allow us to tokenize the stream in a more natural fashion.

#include <locale>
#include <string>
#include <sstream>
#include <iostream>

// This is my facet that will treat the ,.- as space characters and thus ignore them.
class WordSplitterFacet: public std::ctype<char>
{
public:
typedef std::ctype<char> base;
typedef base::char_type char_type;

WordSplitterFacet(std::locale const& l)
: base(table)
{
std::ctype<char> const& defaultCType = std::use_facet<std::ctype<char> >(l);

// Copy the default value from the provided locale
static char data[256];
for(int loop = 0;loop < 256;++loop) { data[loop] = loop;}
defaultCType.is(data, data+256, table);

// Modifications to default to include extra space types.
table[','] |= base::space;
table['.'] |= base::space;
table['-'] |= base::space;
}
private:
base::mask table[256];
};

We can then use this facet in a local like this:

    std::ctype<char>*   wordSplitter(new WordSplitterFacet(std::locale()));

<stream>.imbue(std::locale(std::locale(), wordSplitter));

The next part of your question is how would I store these words in an array. Well, in C++ you would not. You would delegate this functionality to the std::vector/std::string. By reading your code you will see that your code is doing two major things in the same part of the code.

  • It is managing memory.
  • It is tokenizing the data.

There is basic principle Separation of Concerns where your code should only try and do one of two things. It should either do resource management (memory management in this case) or it should do business logic (tokenization of the data). By separating these into different parts of the code you make the code more generally easier to use and easier to write. Fortunately in this example all the resource management is already done by the std::vector/std::string thus allowing us to concentrate on the business logic.

As has been shown many times the easy way to tokenize a stream is using operator >> and a string. This will break the stream into words. You can then use iterators to automatically loop across the stream tokenizing the stream.

std::vector<std::string>  data;
for(std::istream_iterator<std::string> loop(<stream>); loop != std::istream_iterator<std::string>(); ++loop)
{
// In here loop is an iterator that has tokenized the stream using the
// operator >> (which for std::string reads one space separated word.

data.push_back(*loop);
}

If we combine this with some standard algorithms to simplify the code.

std::copy(std::istream_iterator<std::string>(<stream>), std::istream_iterator<std::string>(), std::back_inserter(data));

Now combining all the above into a single application

int main()
{
// Create the facet.
std::ctype<char>* wordSplitter(new WordSplitterFacet(std::locale()));

// Here I am using a string stream.
// But any stream can be used. Note you must imbue a stream before it is used.
// Otherwise the imbue() will silently fail.
std::stringstream teststr;
teststr.imbue(std::locale(std::locale(), wordSplitter));

// Now that it is imbued we can use it.
// If this was a file stream then you could open it here.
teststr << "This, stri,plop";

cout << "die monster !";
std::vector<std::string> data;
std::copy(std::istream_iterator<std::string>(teststr), std::istream_iterator<std::string>(), std::back_inserter(data));

// Copy the array to cout one word per line
std::copy(data.begin(), data.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}

PHP tokenize a tweet in words, punctuation, hashtag, mentions, emoticons

You can use this pattern with preg_match_all:

~[#@]?\w+|\pP+|\S~u

online demo

Note: You can easily extend this pattern if you need to group another kind of characters. Example with currency:

~[#@]?\w+|\pP+|\p{Sc}+|\S~u

Why tokenize/preprocess words for language analysis?

Perhaps I'm being overly correct, but doesn't tokenization simply refer to splitting up the input stream (of characters, in this case) based on delimiters to receive whatever is regarded as a "token"?

Your tokens can be arbitrary: you can perform analysis on the word level where your tokens are words and the delimiter is any space or punctuation character. It's just as likely that you analyse n-grams, where your tokens correspond to a group of words and delimiting is done e.g. by sliding a window.

So in short, in order to analyse words in a stream of text, you need to tokenize to receive "raw" words to operate on.

Tokenization however is often followed by stemming and lemmatization to reduce noise. This becomes quite clear when thinking about sentiment analysis: if you see the tokens happy, happily and happiness, do you want to treat them each separately, or wouldn't you rather combine them to three instances of happy to better convey a stronger notion of "being happy"?

Tokenize sentence based on existing punctuation (TF-IDF vectorizer)

Using keraslibray for Tokenization the sentence in dataframe .Before Tokenization remove the punctation in dataset of dataframe .TF-IDFvectorizer

I am attack the link check it

Keras

Check the example code that is help to tokenization of sentence

Combining text stemming and removal of punctuation in NLTK and scikit-learn

There are several options, try remove the punctuation before tokenization. But this would mean that don't -> dont

import string

def tokenize(text):
text = "".join([ch for ch in text if ch not in string.punctuation])
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems

Or try removing punctuation after tokenization.

def tokenize(text):
tokens = nltk.word_tokenize(text)
tokens = [i for i in tokens if i not in string.punctuation]
stems = stem_tokens(tokens, stemmer)
return stems

EDITED

The above code will work but it's rather slow because it's looping through the same text multiple times:

  • Once to remove punctuation
  • Second time to tokenize
  • Third time to stem.

If you have more steps like removing digits or removing stopwords or lowercasing, etc.

It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps:

  • Applying NLTK-based text pre-proccessing on a pandas dataframe
  • Why is my NLTK function slow when processing the DataFrame?
  • https://www.kaggle.com/alvations/basic-nlp-with-nltk

How do I tokenize a string sentence in NLTK?

This is actually on the main page of nltk.org:

>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

Removing commas after processing lists of strings, when ' '.join(x) does not work

The result string really, really looks like a string representation of an otherwise perfectly normal list, so let's have Python convert it back to a list, safely, per Convert string representation of list to list:

import ast

result = """['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']"""

result_as_list = ast.literal_eval(result)

Now we have this

['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']

let's go over your steps again. First, "remove the quote marks". But there aren't any (obsolete) quote marks, because this is a list of strings; the extra quotes you see in the representation are only because that is how a string is represented in Python.

Next, "remove the beginning and end markers". As this is a list, they're just the first and last elements, no further counting needed:

result_as_list = result_as_list[1:-1]

Next, "remove the commas". As in the first step, there are no (obsolete) comma's; they are part of how Python shows a list and are not there in the actual data.

So we end up with

['You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.']

which can be joined back into the original string using

result_as_string = ' '.join(result_as_list)

and the only problem remaining is that BERT apparently treats apostrophes, commas and full stops as separate 'words':

You couldn ' t have done any better because if you could have , you would have .

which need a bit o'replacing:

result_as_string = result_as_string.replace(' ,', ',').replace(' .','.').replace(" ' ", "'")

and you have your sentence back:

You couldn't have done any better because if you could have, you would have.

The only problem I see is if there are leading or closing quotes that aren't part of a contraction. If this is necessary, you can replace the space-quote-space replacement with a more focused one targeting specifically "couldn't", "can't", "aren't" etc.

Include punctuation in keras tokenizer

This is possible if you do some pre-processing on the text.

First you want to make sure that the punctuation is not filtered out by the Tokenizer. You can see from the documentation that the Tokenizer takes a filter argument on initialization. You can replace the default value with the set of characters you would like to filter, and exclude the ones you want to have in your index.

The second part is making sure that the punctuation is recognized as its own token. If you tokenize the example sentence the result would take "cold." as a token instead of "cold" and ".". What you need is a seperator between the word and the punctuation. A naive approach is to replace the punctuation in the text with a space + punctuation.

Following code does what you ask:

from keras.preprocessing.text import Tokenizer

t = Tokenizer(filters='!"#$%&()*+,-/:;<=>?@[\\]^_`{|}~\t\n') # all without .
text = "Tomorrow will be cold."
text = text.replace(".", " .")
t.fit_on_texts([text])
print(t.word_index)

-> prints: {'will': 2, 'be': 3, 'cold': 4, 'tomorrow': 1, '.': 5}

The replace logic can be done in a smarter way (eg. with regex if you want to capture all punctuation), but you get the gist.



Related Topics



Leave a reply



Submit