How to Have Postgresql Not Collapse Punctuation and Spaces When Collating Using a Language

Postgresql invalid class for [:^punct]

The question you link to is tagged with r, where stringr library uses ICU regex flavor that supports POSIX character classes in its own way, not necessarily POSIX compatible.

To match any whitespace or any punctuation but / you may use

[^/[:alnum:]]

It matches any char that is not alphanumeric (and that means it is either a whitespace or punctuation) and not a / char.

How do I SQL query for words with punctuation in Postgresql?

tsvector

Use the tsvector type, which is part of the PostgreSQL text-search feature.

postgres> select 'What are Q-type Operations?'::tsvector;
tsvector
-------------------------------------
'Operations?' 'Q-type' 'What' 'are'
(1 row)

You can use familiar operators on tsvectors as well:

postgres> select 'What are Q-type Operations?'::tsvector
postgres> || 'A.B.C''s of Coding'::tsvector;
?column?
--------------------------------------------------------------
'A.B.C''s' 'Coding' 'Operations?' 'Q-type' 'What' 'are' 'of'

From tsvector documentation:

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details). Sorting and duplicate-elimination are done automatically during input

If you also want to do language-specific normalization, like removing common words ('the', 'a', etc) and multiplies, use the to_tsvector function. It also assigns weights to different words for text search:

postgres> select to_tsvector('english',
postgres> 'What are Q-type Operations? A.B.C''s of Coding');
to_tsvector
--------------------------------------------------------
'a.b.c':7 'code':10 'oper':6 'q':4 'q-type':3 'type':5
(1 row)

Full-blown text search

Obviously doing this for every row in a query will be expensive -- so you should store the tsvector in a separate column and use ts_query() to search for it. This also allows you to create a GiST index on the tsvector.

postgres> insert into text (phrase, tsvec)
postgres> values('What are Q-type Operations?',
postgres> to_tsvector('english', 'What are Q-type Operations?'));
INSERT 0 1

Searching is done using tsquery and the @@ operator:

postgres> select phrase from text where tsvec @@ to_tsquery('q-type');
phrase
-----------------------------
What are Q-type Operations?
(1 row)

How do you ignore white spaces and punctuation?

Check out String.replaceAll, you define what you want to replace, in this case, whitespaces and punctuation, so we will use \\W as what we want to find and replace it with nothing.

import java.util.Scanner;
public class StringUtil{

public static boolean Palindrome(String s)
{

if(s.length() == 0 || s.length() == 1)
return true;

if(s.charAt(0) == s.charAt(s.length()-1))
return Palindrome(s.substring(1, s.length()-1));

return false;

}

public static void main(String[]args)
{
Scanner check = new Scanner(System.in);
System.out.println("type in a string to check if its a palindrome or not");
String p = check.nextLine();

//We replace all of the whitespace and punctuation
p = p.replaceAll("\\W", "");

if(Palindrome(p))
System.out.println(p + " is a palindrome");
else
System.out.println(p+ " is not a palindrome");
}

}

Sample Output

type in a string to check if its a palindrome or not
r';:.,?!ace car
racecar is a palindrome

Is it possible to tokenize text in PL/PGSQL using regular expressions?

There is a number of functions for tasks like that.

To retrieve the 2nd word of a text:

SELECT split_part('split this up', ' ', 2);

Split the whole text and return one word per row:

SELECT regexp_split_to_table('split this up', E'\\s+');

Actually, the last example splits on any stretch of whitespace.)

Regex to remove special characters. Can't get rid of trailing ellipsis

You are probably looking for the fourth, optional parameter of regexp_replace():

SELECT regexp_replace('If...', '[^\w\s]', '', 'g');

g .. for "globally", i.e. replace every match in the string, not just the first.

Postgres and Word Clouds

There is a simple way, but it can be slow (depending on your table size). You can split your text into an array:

SELECT string_to_array(lower(words), ' ') FROM table;

With those arrays, you can use unnest to aggregate them:

WITH words AS (
SELECT unnest(string_to_array(lower(words), ' ')) AS word
FROM table
)
SELECT word, count(*) FROM words
GROUP BY word;

This is a simple way of doing that and, has some issues, like, it only split words by space not punctuation marks.

Other, and probably better option, is to use PostgreSQL full text search.

Python - How do I separate punctuation from words by white space leaving only one space between the punctuation and the word?

Looks like re can do it for you...

>>> import re
>>> re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", input)
"I love programming with Python-3 . 3 ! Do you ? It's great ... I give it a 10/10 . It's free- to-use , no $$$ involved ! "

and

>>> re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", "Hello. (hi)")
'Hello . ( hi ) '

If the trailing space is a problem, .rtrim(theresult, ' ') should fix it for you:-)



Related Topics



Leave a reply



Submit