I Want to Find the Maximum Frequency of a Common Digit in a Consecutive Subset of Integer Array

Better Solution for Finding out Maximums for Consecutive Array Subsequences

O(N) Space and O(N) Time complexity algorithm.

Traverse the array from left to right. Maintain an external stack of elements.

A[0] = 34, push 34 onto stack. Count[0] = 0.  
A[1] = 10,
Now keep popping out from stack till either it is empty
or you encounter a number greater than the current element (10).
Here 34 is > 10, hence we push 10 onto stack and make Count[1] = 0.

A[2] = 15.
So we pop out 10 from the stack, push 15 onto stack,
and make Count[2] = Count[ last popped out index ] + Number of indices popped out
= 0 + 1 = 1.
A[3] = 14.
So we push 14 onto stack and make Count[3] = 0.
A[4] = 30.
So we pop out 14 and 15 from the array, push 30 onto the array
and make Count[4] = Count[ index of 15 ] + Number of indices popped out(2)
= 1 + 2 = 3.

Do this till you reach the end of array.

Since each item is pushed exactly once and popped out exactly once from the stack, the time complexity is O( 2*N) = O(N).

EDIT: As suggested, it will be better if you store the indices in the stack, instead of values.

Detecting consecutive integers in a list

From the docs:

>>> from itertools import groupby
>>> from operator import itemgetter
>>> data = [ 1, 4,5,6, 10, 15,16,17,18, 22, 25,26,27,28]
>>> for k, g in groupby(enumerate(data), lambda (i, x): i-x):
... print map(itemgetter(1), g)
...
[1]
[4, 5, 6]
[10]
[15, 16, 17, 18]
[22]
[25, 26, 27, 28]

You can adapt this fairly easily to get a printed set of ranges.

how to count distinct elements of an array in one pass?

Supposing that your elements are integers and their values are between 0 and MAXVAL-1.

#include 
#include

#define MAXVAL 50

unsigned int CountDistinctsElements(unsigned int* iArray, unsigned int iNbElem) {
unsigned int ret = 0;

//this array will contains the count of each value
//for example, c[3] will contain the count of the value 3 in your original array
unsigned int c[MAXVAL];
memset(c, 0, MAXVAL*sizeof(unsigned int));

for (unsigned int i=0; i unsigned int elem = iArray[i];
if (elem < MAXVAL && c[elem] == 0) {
ret++;
}
c[elem]++;
}
return ret;
}

int main() {
unsigned int myElements[10] = {0, 25, 42, 42, 1, 2, 42, 0, 24, 24};
printf("Distincts elements : %d\n", CountDistinctsElements(myElements, 10));
return 0;
}

Output : (Ideone link)

Distincts elements : 6

Write a program to find 100 largest numbers out of an array of 1 billion numbers

You can keep a priority queue of the 100 biggest numbers, iterate through the 1 billion numbers. Whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.

A priority queue implemented with a heap has insert + delete complexity of O(log K). (Where K = 100, the number of elements to find. N = 1 billion, the number of total elements in the array).

In the worst case you get billion*log2(100) which is better than billion*log2(billion) for an O(N log N) comparison-based sort1.

In general, if you need the largest K numbers from a set of N numbers, the complexity is O(N log K) rather than O(N log N), this can be very significant when K is very small comparing to N.


The expected time of this priority queue algorithm is pretty interesting, since in each iteration an insertion may or may not occur.

The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least i-K random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability.

For example, lets assume the numbers were randomly selected uniformly from {0, 1}, the expected value of (i-K)th number (out of i numbers) is (i-k)/i, and chance of a random variable being larger than this value is 1-[(i-k)/i] = k/i.

Thus, the expected number of insertions is:

Sample Image

And the expected running time can be expressed as:

Sample Image

(k time to generate the queue with the first k elements, then n-k comparisons, and the expected number of insertions as described above, each takes an average log(k)/2 time)

Note that when N is very large comparing to K, this expression is a lot closer to n rather than N log K. This is somewhat intuitive, as in the case of the question, even after 10,000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.

But we don't know that the array values are uniformly distributed. They might trend towards increasing, in which case most or all numbers will be be new candidates for the set of 100 largest numbers seen. The worst case for this algorithm is O(N log K).

Or if they trend towards decreasing, most of the largest 100 numbers will be very early, and our best-case run time is essentially O(N + K log K), which is just O(N) for K much smaller than N.


Footnote 1: O(N) integer sorting / histogramming

Counting Sort or Radix Sort are both O(N), but often have larger constant factors that make them worse than comparison sorts in practice. In some special cases they're actually quite fast, primarily for narrow integer types.

For example, Counting Sort does well if the numbers are small. 16-bit numbers would only need an array of 2^16 counters. And instead of actually expanding back into a sorted array, you could just scan the histogram you build as part of Counting Sort.

After histogramming an array, you can quickly answer queries for any order statistic, e.g. the 99 largest numbers, the 200 to 100th largest numbers.) 32-bit numbers would scatter the counts over a much larger array or hash table of counters, potentially needing 16 GiB of memory (4 bytes for each of 2^32 counters). And on real CPUs, probably getting lots of TLB and cache misses, unlike an array of 2^16 elements where L2 cache would typically hit.

Similarly, Radix Sort could look at only the top buckets after a first pass. But the constant factors may still be larger than log K, depending on K.

Note that the size of each counter is large enough to not overflow even if all N integers are duplicates. 1 billion is somewhat below 2^30, so a 30-bit unsigned counter would be sufficient. And a 32-bit signed or unsigned integer is just fine.

If you had many more, you might need 64-bit counters, taking twice the memory footprint to initialize to zero and to randomly access. Or a sentinel value for the few counters that overflow a 16 or 32-bit integer, to indicate that the rest of the count is elsewhere (in a small dictionary such as a hash table mapping to 64-bit counters).

How to find the minimum number of operation(s) to make the string balanced?

I don't think you really need dynamic programming here.

One approach in O(length(S)) time:

  • Iterate over S, build a frequency-map (a mapping from distinct letters A–Z to counts). For your ABCB example, that would be A->1 B->2 C->1 D->0 E->0 ... Z->0, which we can represent as the array [1, 2, 1, 0, 0, ..., 0].
    • We can do this because we don't actually care about the positions of letters at all; there's no real difference between ABCB and ABBC, in that each can be balanced by replacing its C with an A.
  • Sort the array. For your example, that gives [0, 0, ..., 0, 1, 1, 2].
    • We can do this because we don't actually care about which letter was which; there's no real difference between ABCB and ABDB, in that each can be balanced by replacing one of its singleton letters with its other one.
  • Assuming the string is nonempty (since if it's empty then the answer is just 0), the final balanced string will contain at least 1 and at most 26 distinct letters. For each integer i between 1 and 26, determine how many operations you'd need to perform in order to produce a balanced string with i distinct letters:
    • First, check that length(S) is a multiple of i; if not, this isn't possible, so skip to the next integer.
    • Next, compute length(S)/i, the count of each distinct letter in the final balanced string.
    • To count how many operations need to be performed, we're going to examine all the counts that need to be increased. (We could, equivalently, examine the counts that need to be decreased: they'll have to match.)
    • We're only interested in the last i elements of the sorted array. Any elements before that point represent letters that won't occur in the balanced string: either the counts are already zero and will remain so, or they are nonzero but will be decreased to zero. Either way, since we're only tracking increases, we can ignore them.
    • For each of the last i counts that are less than length(S)/i, add the difference. This sum is the number of operations. (Note that, since the counts are sorted, you can exit this inner loop as soon as you get to a count that's greater than or equal to the target count.)
    • You can exit this loop after the first i that's greater than or equal to the number of distinct letters in the original S (aside from values of i that we had to skip because they don't evenly divide length(S)). For example, if length(S) = 100, and the original S has 19 distinct letters, then we only need to consider i as high as 20. (Hat-tip to Eric Wang for suggesting something along these lines.)
  • Return the least of these up-to-26 sums. (Note that you don't actually need to store all the sums; you can just keep track of the minimum.)


Related Topics



Leave a reply



Submit