Postgresql invalid class for [:^punct]
The question you link to is tagged with r, where stringr
library uses ICU regex flavor that supports POSIX character classes in its own way, not necessarily POSIX compatible.
To match any whitespace or any punctuation but /
you may use
[^/[:alnum:]]
It matches any char that is not alphanumeric (and that means it is either a whitespace or punctuation) and not a /
char.
How do I SQL query for words with punctuation in Postgresql?
tsvector
Use the tsvector
type, which is part of the PostgreSQL text-search feature.
postgres> select 'What are Q-type Operations?'::tsvector;
tsvector
-------------------------------------
'Operations?' 'Q-type' 'What' 'are'
(1 row)
You can use familiar operators on tsvectors as well:
postgres> select 'What are Q-type Operations?'::tsvector
postgres> || 'A.B.C''s of Coding'::tsvector;
?column?
--------------------------------------------------------------
'A.B.C''s' 'Coding' 'Operations?' 'Q-type' 'What' 'are' 'of'
From tsvector documentation:
A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details). Sorting and duplicate-elimination are done automatically during input
If you also want to do language-specific normalization, like removing common words ('the', 'a', etc) and multiplies, use the to_tsvector
function. It also assigns weights to different words for text search:
postgres> select to_tsvector('english',
postgres> 'What are Q-type Operations? A.B.C''s of Coding');
to_tsvector
--------------------------------------------------------
'a.b.c':7 'code':10 'oper':6 'q':4 'q-type':3 'type':5
(1 row)
Full-blown text search
Obviously doing this for every row in a query will be expensive -- so you should store the tsvector in a separate column and use ts_query() to search for it. This also allows you to create a GiST index on the tsvector.
postgres> insert into text (phrase, tsvec)
postgres> values('What are Q-type Operations?',
postgres> to_tsvector('english', 'What are Q-type Operations?'));
INSERT 0 1
Searching is done using tsquery and the @@ operator:
postgres> select phrase from text where tsvec @@ to_tsquery('q-type');
phrase
-----------------------------
What are Q-type Operations?
(1 row)
How do you ignore white spaces and punctuation?
Check out String.replaceAll
, you define what you want to replace, in this case, whitespaces and punctuation, so we will use \\W
as what we want to find and replace it with nothing.
import java.util.Scanner;
public class StringUtil{
public static boolean Palindrome(String s)
{
if(s.length() == 0 || s.length() == 1)
return true;
if(s.charAt(0) == s.charAt(s.length()-1))
return Palindrome(s.substring(1, s.length()-1));
return false;
}
public static void main(String[]args)
{
Scanner check = new Scanner(System.in);
System.out.println("type in a string to check if its a palindrome or not");
String p = check.nextLine();
//We replace all of the whitespace and punctuation
p = p.replaceAll("\\W", "");
if(Palindrome(p))
System.out.println(p + " is a palindrome");
else
System.out.println(p+ " is not a palindrome");
}
}
Sample Output
type in a string to check if its a palindrome or not
r';:.,?!ace car
racecar is a palindrome
Is it possible to tokenize text in PL/PGSQL using regular expressions?
There is a number of functions for tasks like that.
To retrieve the 2nd word of a text:
SELECT split_part('split this up', ' ', 2);
Split the whole text and return one word per row:
SELECT regexp_split_to_table('split this up', E'\\s+');
Actually, the last example splits on any stretch of whitespace.)
Regex to remove special characters. Can't get rid of trailing ellipsis
You are probably looking for the fourth, optional parameter of regexp_replace()
:
SELECT regexp_replace('If...', '[^\w\s]', '', 'g');
g
.. for "globally", i.e. replace every match in the string, not just the first.
Postgres and Word Clouds
There is a simple way, but it can be slow (depending on your table size). You can split your text into an array:
SELECT string_to_array(lower(words), ' ') FROM table;
With those arrays, you can use unnest
to aggregate them:
WITH words AS (
SELECT unnest(string_to_array(lower(words), ' ')) AS word
FROM table
)
SELECT word, count(*) FROM words
GROUP BY word;
This is a simple way of doing that and, has some issues, like, it only split words by space not punctuation marks.
Other, and probably better option, is to use PostgreSQL full text search.
Python - How do I separate punctuation from words by white space leaving only one space between the punctuation and the word?
Looks like re
can do it for you...
>>> import re
>>> re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", input)
"I love programming with Python-3 . 3 ! Do you ? It's great ... I give it a 10/10 . It's free- to-use , no $$$ involved ! "
and
>>> re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", "Hello. (hi)")
'Hello . ( hi ) '
If the trailing space is a problem, .rtrim(theresult, ' ')
should fix it for you:-)
Related Topics
How to Run .Exe Executable File from Linux Command Line
Why Can One Remove/Rename Open Files in Linux
Nvcc Cuda Cross Compiling Cannot Find "-Lcudart"
Javafx: Tested/Confirmed Hardware (Gpu) Acceleration on Linux
Tk Initialization Failed: No Display Name and No $Display Environment Variable
Do I Have to Pthread_Join Each Thread I Create
Operand Generation of Call Instruction on X86-64 Amd
Replace in a CSV File Value of a Column
How to Attach Domain Name to My Server
Linux Telnet Vt100 Return Key Sends ^M
Linux Process Context and Svc Call in Arm
How to Store Data Permanently in /Tmp Directory in Linux
Authenticating Gtk App to Run with Root Permissions