Splitting (1:N)[Boolean] into Contiguous Sequences

How do I find ranges of successive numbers in a vector in R

From @eddi's answer to my similar question...

runs = split(seq_along(data), cumsum(c(0, diff(data) > 1)))
lapply(runs[lengths(runs) > 1], range)

# $`2`
# [1] 3 6
# 
# $`4`
# [1]  8 13

How it works:

seq_along(data) are the indices of data, from 1..length(data)
c(0, diff(data) > 1) is has a 1 at each index where data "jumps"
cumsum(c(0, diff(data) > 1)) is an identifier for consecutive runs between jumps

So runs is a division of data's indices into runs where data's values are consecutive.

How to group sequential event time sequences (with breaks between events) to find duration of events

Suppose first we have just Bob's rows of the dataframe, called bob.
We will assume bob is already ordered by event, increasing.

Along the same lines as you mentioned (looking at diff(event) > 1), you can additionally use cumsum to group each event to the 'run' of events it belongs to:

library(plyr)
bob2 <- mutate(bob, start = c(1, diff(bob$event) > 1), run=cumsum(start))
   event seconds person start run
1      1     0.0    Bob     1   1
2      2    15.0    Bob     0   1
3      3    28.5    Bob     0   1
4     41   256.0    Bob     1   2
5     42   261.0    Bob     0   2
6     43   266.0    Bob     0   2
7     44   268.5    Bob     0   2
8     45   272.0    Bob     0   2
9     46   273.0    Bob     0   2
10    49   569.0    Bob     1   3
11    80   570.5    Bob     1   4
12    81   581.0    Bob     0   4

start indicates whether this starts a run of sequential events, and run is which such set of events we are in.

Then you can just find the duration:

ddply(bob2, .(run), summarize, length=diff(range(seconds)))
  run length
1   1   28.5
2   2   17.0
3   3    0.0
4   4   10.5

Now supposing you have your original dataframe with everyone mixed together in it, we can use ddply again to split it up by person:

tmp <- ddply(df, .(person), transform, run=cumsum(c(1, diff(event) != 1)))
ddply(tmp, .(person, run), summarize, length=diff(range(seconds)), start_event=first(event), end_event=last(event))

    person run length start_event end_event
1   Anne   1    3.0           8         9
2    Bob   1   28.5           1         3
3    Bob   2   17.0          41        46
4    Bob   3    0.0          49        49
5    Bob   4   10.5          80        81
6    Joe   1   10.5           4         7

Note: my df is your bob table rbind-ed to your other table, unique()d (just to show it works when there are more than one run per person).
There is probably a clever way to do this that combines the two ddply calls (or uses the dplyr pipe-y syntax that I am not familiar with), but I do not know what it is.

How to find the groups of consecutive elements in a NumPy array

Here's a lil func that might help:

def group_consecutives(vals, step=1):
    """Return list of consecutive lists of numbers from vals (number list)."""
    run = []
    result = [run]
    expect = None
    for v in vals:
        if (v == expect) or (expect is None):
            run.append(v)
        else:
            run = [v]
            result.append(run)
        expect = v + step
    return result

>>> group_consecutives(a)
[[0], [47, 48, 49, 50], [97, 98, 99]]
>>> group_consecutives(a, step=47)
[[0, 47], [48], [49], [50, 97], [98], [99]]

P.S. This is pure Python. For a NumPy solution, see unutbu's answer.

How can I partition (split up, divide) a list based on a condition?

good = [x for x in mylist if x in goodvals]
bad  = [x for x in mylist if x not in goodvals]

How can I do this more elegantly?

That code is already perfectly elegant.

There might be slight performance improvements using sets, but the difference is trivial. set based approaches will also discard duplicates and will not preserve the order of elements. I find the list comprehension far easier to read, too.

In fact, we could even more simply just use a for loop:

good, bad = [], []

for x in mylist:
    if x in goodvals:
        good.append(f)
    else:
        bad.append(f)

This approach makes it easier to add additional logic. For example, the code is easily modified to discard None values:

good, bad = [], []

for x in mylist:
    if x is None:
        continue
    if x in goodvals:
        good.append(f)
    else:
        bad.append(f)

Transforming a numeric vector into list of intervals

It is not clear waht you are looking to do ,but here how you can reproduce the expected result:

lapply(split(x,c(0,cumsum(diff(x)!=1))),
       function(y)if(length(y)>2) c(y[1],tail(y,1))else y)

$`0`
[1] 1

$`1`
[1] 3 5

$`2`
[1]  8 10

$`3`
[1] 13

Split string with delimiters in C

You can use the strtok() function to split a string (and specify the delimiter to use). Note that strtok() will modify the string passed into it. If the original string is required elsewhere make a copy of it and pass the copy to strtok().

EDIT:

Example (note it does not handle consecutive delimiters, "JAN,,,FEB,MAR" for example):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

char** str_split(char* a_str, const char a_delim)
{
    char** result    = 0;
    size_t count     = 0;
    char* tmp        = a_str;
    char* last_comma = 0;
    char delim[2];
    delim[0] = a_delim;
    delim[1] = 0;

    /* Count how many elements will be extracted. */
    while (*tmp)
    {
        if (a_delim == *tmp)
        {
            count++;
            last_comma = tmp;
        }
        tmp++;
    }

    /* Add space for trailing token. */
    count += last_comma < (a_str + strlen(a_str) - 1);

    /* Add space for terminating null string so caller
       knows where the list of returned strings ends. */
    count++;

    result = malloc(sizeof(char*) * count);

    if (result)
    {
        size_t idx  = 0;
        char* token = strtok(a_str, delim);

        while (token)
        {
            assert(idx < count);
            *(result + idx++) = strdup(token);
            token = strtok(0, delim);
        }
        assert(idx == count - 1);
        *(result + idx) = 0;
    }

    return result;
}

int main()
{
    char months[] = "JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC";
    char** tokens;

    printf("months=[%s]\n\n", months);

    tokens = str_split(months, ',');

    if (tokens)
    {
        int i;
        for (i = 0; *(tokens + i); i++)
        {
            printf("month=[%s]\n", *(tokens + i));
            free(*(tokens + i));
        }
        printf("\n");
        free(tokens);
    }

    return 0;
}

Output:

$ ./main.exe
months=[JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC]

month=[JAN]
month=[FEB]
month=[MAR]
month=[APR]
month=[MAY]
month=[JUN]
month=[JUL]
month=[AUG]
month=[SEP]
month=[OCT]
month=[NOV]
month=[DEC]

How to test if a list contains another list as a contiguous subsequence?

Here is my version:

def contains(small, big):
    for i in xrange(len(big)-len(small)+1):
        for j in xrange(len(small)):
            if big[i+j] != small[j]:
                break
        else:
            return i, i+len(small)
    return False

It returns a tuple of (start, end+1) since I think that is more pythonic, as Andrew Jaffe points out in his comment. It does not slice any sublists so should be reasonably efficient.

One point of interest for newbies is that it uses the else clause on the for statement - this is not something I use very often but can be invaluable in situations like this.

This is identical to finding substrings in a string, so for large lists it may be more efficient to implement something like the Boyer-Moore algorithm.

Note: If you are using Python3, change xrange to range.

How to split text without spaces into list of words

A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

Before: thumbgreenappleactiveassignmentweeklymetaphor.

After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.

Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.