Remove Repeating Character

Remove repeating character

Use backrefrences

echo preg_replace("/(.)\\1+/", "$1", "cakkke");

Output:

cake

Explanation:

(.) captures any character

\\1 is a backreferences to the first capture group. The . above in this case.

+ makes the backreference match atleast 1 (so that it matches aa, aaa, aaaa, but not a)

Replacing it with $1 replaces the complete matched text kkk in this case, with the first capture group, k in this case.

How to remove repeating letter in a dataframe?

You may try this:

df["Col"] = df["Col"].str.replace(u"h{4,}", "")

Where you may set the number of characters to match in my case 4.

                                        Col
0  hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1                               Hello World
                     Col
0  hello, I'm today hh  
1            Hello World

I used unicode matching, since you mentioned you are in tweets.

How to remove repeated\same characters in a sequence from a string using C#?

According to my understanding, you want to eliminate duplicates only if it is in a consecutive sequence. You could achieve it using the following

Using List<string>

var nonDuplicates = new List<char>();

foreach (var element in str.ToCharArray())
{
    if(nonDuplicates.Count == 0 || nonDuplicates.Last() != element)
        nonDuplicates.Add(element);
}

var result = new string(nonDuplicates.ToArray());

Update

With reference to comment from , I have updated and the answer with two more solutions and ran the benchmark on them. The results are shown below.

Using String Append

 var str = "aaaabbcccghbcccciippppkkllk";
  var strResult = string.Empty;

  foreach (var element in str.ToCharArray())
  {
     if (strResult.Length == 0 || strResult[strResult.Length - 1] != element)
        strResult = $"{strResult}{element}";
  }

Using StringBuilder

  var str = "aaaabbcccghbcccciippppkkllk";
  var strResult = new StringBuilder();

  foreach (var element in str.ToCharArray())
  {
     if (strResult.Length == 0 || strResult[strResult.Length - 1] != element)
       strResult.Append(element);
  }
  var result = strResult.ToString();

Benchmark Results

             Method |       Mean |     Error |     StdDev |     Median |
------------------- |-----------:|----------:|-----------:|-----------:|
          UsingList |   809.7 ns | 11.975 ns |  11.202 ns |   806.5 ns |
  UsingStringAppend | 1,738.0 ns | 39.269 ns | 109.467 ns | 1,697.2 ns |
 UsingStringBuilder |   201.6 ns |  1.960 ns |   1.834 ns |   201.1 ns |

As seen in the results, the StrinbBuilder Approach is much fast when compared to List. The string append approach is slowest.

Input

aaaabbcccghbcccciippppkkllk

Output

abcghbcipklk

How can I remove repeated characters in a string with R?

I did not think very carefully on this, but this is my quick solution using references in regular expressions:

gsub('([[:alpha:]])\\1+', '\\1', 'BuenRemove Repeating Charactera Suerrrrte')
# [1] "Buena Suerte"

() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.

To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.

Remove repeating characters from sentence but retain the words meaning

You can combine regex and NLP here by iterating over all words in a string, and once you find one with identical consecutive letters reduce them to max 2 consecutive occurrences of the same letters and run the automatic spellcheck to fix the spelling.

See an example Python code:

import re
from textblob import TextBlob
from textblob import Word
rx = re.compile(r'([^\W\d_])\1{2,}')
print( re.sub(r'[^\W\d_]+', lambda x: Word(rx.sub(r'\1\1', x.group())).correct() if rx.search(x.group()) else x.group(), tweet) )
# => "I'm so happy about offline school"

The code uses the Textblob library, but you may use any you like.

Note that ([^\W\d_])\1{2,} matches any three or more consecutive letters, [^\W\d_]+ matches one or more letters.

How to remove duplicate chars in a string?

It seems from your example that you want to remove REPEATED SEQUENCES of characters, not duplicate chars across the whole string. So this is what I'm solving here.

You can use a regular expression.. not sure how horribly inefficient it is but it
works.

>>> import re
>>> phrase = str("oo rarato roeroeu aa rouroupa dodo rerei dde romroma")
>>> re.sub(r'(.+?)\1+', r'\1', phrase)
'o rato roeu a roupa do rei de roma'

How this substitution proceeds down the string:

oo -> o
" " -> " "
rara -> ra
to -> to
" "-> " "
roeroe -> roe

etc..

Edit: Works for the other example string which should not be modified:

>>> phrase = str("Barbara Bebe com Bernardo")
>>> re.sub(r'(.+?)\1+', r'\1', phrase)
'Barbara Bebe com Bernardo'

Regex remove repeated characters from a string by javascript

A lookahead like "this, followed by something and this":

var str = "aaabbbccccabbbbcccccc";console.log(str.replace(/(.)(?=.*\1)/g, "")); // "abc"

Remove characters which repeat more than twice in a string

Try using sub, with the pattern (.)\\1{2,}:

F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)

[1] "happy birthday"

Explanation of regex:

(.)          match and capture any single character
\\1{2,}      then match the same character two or more times

We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

How can we remove word with repeated single character?

A better approach here is to use a set

def modify(s):

    #Create a set from the string
    c = set(s)

    #If you have only one character in the set, convert set to string
    if len(c) == 1:
        return ''.join(c)
    #Else return original string
    else:
        return s

print(modify('good'))
print(modify('gggggggg'))

If you want to use regex, mark the start and end of the string in our regex by ^ and $ (inspired from @bobblebubble comment)

import re

def modify(s):

    #Create the sub string with a regex which only matches if a single character is repeated
    #Marking the start and end of string as well
    out = re.sub(r'^([a-z])\1+$', r'\1', s)
    return out

print(modify('good'))
print(modify('gggggggg'))

The output will be

good
g