Efficiently Find Repeated Characters in a String

Efficiently find repeated characters in a string

As this is a performance question, let's do some timings:

def test_set(xs):
seen = set() # O(1) lookups
for x in xs:
if x not in seen:
seen.add(x)
else:
return x

import collections

def test_counter(xs):
freq = collections.Counter(xs)
for k in freq:
if freq[k] > 1:
return k

def test_dict(xs):
d = {}
for x in xs:
if x in d:
return x
d[x] = 1

def test_sort(xs):
ys = sorted(xs)

for n in range(1, len(xs)):
if ys[n] == ys[n-1]:
return ys[n]

##

import sys, timeit
print (sys.version + "\n")
xs = list(range(10000)) + [999]
fns = [p for name, p in globals().items() if name.startswith('test')]
for fn in fns:
assert fn(xs) == 999
print ('%50s %.5f' % (fn, timeit.timeit(lambda: fn(xs), number=100)))

I'm testing on an list of integers rather than a string (because with a string you can't get more than 256 loops). The results on my machine look like this:

3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

<function test_set at 0x1020f7380> 0.19265
<function test_dict at 0x1020f7490> 0.12725
<function test_sort at 0x1020f7518> 0.04683
<function test_counter at 0x1020f7408> 0.92485

So the sort method appears to be the winner. I guess this is because it doesn't waste time creating hashes and allocating dict/set structures. Also, if you don't care about the source list being changed, you can do xs.sort() instead of ys = sorted(xs), which gives you zero memory footprint.

On the other side, if repeated items are more probable to occur towards the beginning of the input (as in xs = 'abcdef' * 10000), the set method will perform the best, as it, unlike sort or Counter, returns immediately once a repeat is found and doesn't need to preprocess the whole list. You should also use set if you need the first repeating element, not just one of them.

Counter is a nice tool, but it's not designed for performance, so if you really have to deal with "gigantic inputs", go with sets (if they fit in memory) or mergesort if they don't.

Efficiently find character in a string formed by repeated substring by integer

If you know that some string s is another string t repeated n times then the character with index k in string s is equal to the character with index k2 = k mod t.length in the string t. We can use that to solve this task:

  1. Determine the length of the result string:

    len = 0
    for each character ch in s
    if ch is digit
    len = len * digit
    else
    len = len + 1
  2. Iterate in reverse order through the string

     reverseS = reverse(s)
    curLen = len
    for each character ch in reverseS
    if ch is digit
    curLen = curLen / digit
    k = k mod curLen
    else
    if k == (curLen-1) then return ch as answer
    curLen = curLen - 1

As a result, you need no additional memory at all (O(1) actually) and algorithm has O(n) time complexity where n is the size of the input string.

Sample C++ code: https://ideone.com/l8JxdQ

Efficiently find first repeated character in a string without using any additional data structure in one traversal

(char)-1 is the same as \uffff, it will always be printed as ? because \uffff is not a valid unicode character.

What is the most efficient way to detect duplicate characters in a String in Java?

If you need to support Unicode characters that aren't represented by surrogate char pairs, this will do it:

private static boolean isUnique(String inputString) {
long[] used = new long[1024];
for (char c : inputString.toCharArray()) {
if ((used[c >>> 6] & (1 << c)) > 0) {
return false;
}
used[c >>> 6] |= 1 << c;
}
return true;
}

It's using bit flips to save memory. It's essentially the same thing as if you used an array of booleans:

private static boolean isUnique2(String inputString) {
boolean[] used = new boolean[65536];
for (char c : inputString.toCharArray()) {
if (used[c]) {
return false;
}
used[c] = true;
}
return true;
}

If you only need to support ASCII characters you could limit the size of used in either case to reduce the memory required (so long[4] and boolean[256]). Below a certain length of inputString it's probably faster to do the n^2 check than allocate the memory for this though. So ideally you do a combination of the two based on the length.

If you need to support all possible Unicode characters you'll have to modify this to support surrogate char pairs. You can detect them with Character.isHighSurrogate(c). See this page for some help and search Google for more details.

Finding repetitions of a string by length

Here what I did :)

import pandas as pd

# find frequency of each length 3 substring
Phrase = "Maryhadalittlarymbada"
substring = []
for i in range(len(Phrase)-3):
substring.append(Phrase[i:i+3])
Frequency = pd.Series(substring).value_counts()

# find repetition's position in string
for index, value in Frequency.iteritems():
positions = []
if value > 1:
for i in range(len(Phrase)-3):
if index == Phrase[i:i+3]:
positions.append(i)
print(index, ": ", positions)
else:
continue

Testing for repeated characters in a string

If the string is short, then just looping and testing may well be the simplest and most efficient way. I mean you could create a hash set (in whatever platform you're using) and iterate through the characters, failing if the character is already in the set and adding it to the set otherwise - but that's only likely to provide any benefit when the strings are longer.

EDIT: Now that we know it's sorted, mquander's answer is the best one IMO. Here's an implementation:

public static bool IsSortedNoRepeats(string text)
{
if (text.Length == 0)
{
return true;
}
char current = text[0];
for (int i=1; i < text.Length; i++)
{
char next = text[i];
if (next <= current)
{
return false;
}
current = next;
}
return true;
}

A shorter alternative if you don't mind repeating the indexer use:

public static bool IsSortedNoRepeats(string text)
{
for (int i=1; i < text.Length; i++)
{
if (text[i] <= text[i-1])
{
return false;
}
}
return true;
}

EDIT: Okay, with the "frequency" side, I'll turn the problem round a bit. I'm still going to assume that the string is sorted, so what we want to know is the length of the longest run. When there are no repeats, the longest run length will be 0 (for an empty string) or 1 (for a non-empty string). Otherwise, it'll be 2 or more.

First a string-specific version:

public static int LongestRun(string text)
{
if (text.Length == 0)
{
return 0;
}
char current = text[0];
int currentRun = 1;
int bestRun = 0;

for (int i=1; i < text.Length; i++)
{
if (current != text[i])
{
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = text[i];
}
currentRun++;
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}

Now we can also do this as a general extension method on IEnumerable<T>:

public static int LongestRun(this IEnumerable<T> source)
{
bool first = true;
T current = default(T);
int currentRun = 0;
int bestRun = 0;

foreach (T element in source)
{
if (first || !EqualityComparer<T>.Default(element, current))
{
first = false;
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = element;
}
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}

Then you can call "AABCD".LongestRun() for example.

Finding repeated character combinations in string

This is in Python 2 because I'm not doing Python 3 at this time. So you'll have to adapt it to Python 3 yourself.

#!python2

# import module
from collections import Counter

# get the indices
def getIndices(length):
# holds the indices
specific_range = []; all_sets = []

# start building the indices
for i in range(0, length - 2):

# build a set of indices of a specific range
for j in range(1, length + 2):
specific_range.append([j - 1, j + i + 3])

# append 'specific_range' to 'all_sets', reset 'specific_range'
if specific_range[j - 1][1] == length:
all_sets.append(specific_range)
specific_range = []
break

# return all of the calculated indices ranges
return all_sets

# store search strings
tmplst = []; combos = []; found = []

# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"

# get length of string
length = len(mystring)

# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)

# get the search string combinations
for sublst in all_sets:
for subsublst in sublst:
tmplst.append(mystring[subsublst[0]: subsublst[1]])
combos.append(tmplst)
tmplst = []

# search for matching string patterns
for sublst in all_sets:
for subsublst in sublst:
for sublstitems in combos:
if mystring[subsublst[0]: subsublst[1]] in sublstitems:
found.append(mystring[subsublst[0]: subsublst[1]])

# make a dictionary containing the strings and their counts
d1 = Counter(found)

# filter out counts of 2 or more and print them
for k, v in d1.items():
if v > 1:
print k, v

Find no of repeated characters in a string using one for loop with no variables

this one follows both the rules

x='ABCDEAB'
for i in x:
try:
if(i in x[x.index(i)+1:]):
print(i,end=" ")
x=x.replace(i,"",1)
except ValueError:
pass


Related Topics



Leave a reply



Submit