Why Does 1..99,999 == "1".."99,999" in R, But 100,000 != "100,000"

Why does 1..99,999 == 1..99,999 in R, but 100,000 != 100,000?

Have a look at as.character(100000). Its value is not equal to "100000" (have a look for yourself), and R is essentially just telling you so.

as.character(100000)
# [1] "1e+05"

Here, from ?Comparison, are R's rules for applying relational operators to values of different types:

If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of
precedence being character, complex, numeric, integer, logical and
raw.

Those rules mean that when you test whether 1=="1", say, R first converts the numeric value on the LHS to a character string, and then tests for equality of the character strings on the LHS and RHS. In some cases those will be equal, but in other cases they will not. Which cases produce inequality will be dependent on the current settings of options("scipen") and options("digits")

So, when you type 100000=="100000", it is as if you were actually performing the following test. (Note that internally, R may well/probably does use something different than as.character() to perform the conversion):

as.character(100000)=="100000"
# [1] FALSE

Why does one 2 equal FALSE in R?

From help("<"):

If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of precedence
being character, complex, numeric, integer, logical and raw.

So in this case, the numeric is of lower precedence than the character. So 2 is coerced to the character "2". Comparison of strings in character vectors is lexicographic which, as I understand it, is alphabetic but locale-dependent.

Why is the expression 1==1 evaluating to TRUE?

From the help("=="):

If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of precedence
being character, complex, numeric, integer, logical and raw.

So 1 should be converted to "1".

random forest by group processing/scoring

Assuming your sales dataset is 3,000 * 300 = 900,000 rows and both dataframes have a customer_id column, you can do something like:

pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
# pred_groups is now a list, with names the customer_id's and each list
# element an integer vector of row numbers. Now iterate over each customer
# and make predictions on the training set.
preds <- unsplit(structure(lapply(names(pred_groups), function(customer_id) {
  # Train using only observations for this customer.
  # Note we are comparing character to integer but R's natural type
  # coercion should still give the correct answer.
  train_rows <- sales$customer_id == customer_id
  sales.rf <- randomForest(Sales ~ ., ntree = 500,
                           data = sales[train_rows, ],importance=TRUE)

  # Now make predictions only for this customer.
  predict(sales.rf, sales_score[pred_groups[[customer_id]], ])
}), .Names = names(pred_groups)), sales_score$customer_id)

print(head(preds)) # Should now be a vector of predicted scores of length
  # the number of rows in the train set.

Edit: Per @joran, here is a solution with a for:

pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
preds <- numeric(nrow(sales_score))
for(customer_id in names(pred_groups)) {
  train_rows <- sales$customer_id == customer_id
  sales.rf <- randomForest(Sales ~ ., ntree = 500,
                           data = sales[train_rows, ],importance=TRUE)
  pred_rows <- pred_groups[[customer_id]]
  preds[pred_rows] <- predict(sales.rf, sales_score[pred_rows, ])
})

My FB HackerCup code too slow of large inputs

Here is my O(k) solution, which is based on the same idea as above, but runs much faster.

import os, sys

f = open(sys.argv[1], 'r')

T = int(f.readline())

def next(ary, start):
    j = start
    l = len(ary)
    ret = start - 1
    while j < l and ary[j]:
        ret = j
        j += 1
    return ret

for t in range(T):
    n, k = map(int, f.readline().strip().split(' '))
    a, b, c, r = map(int, f.readline().strip().split(' '))

    m = [0] * (4 * k)
    s = [0] * (k+1)
    m[0] = a
    if m[0] <= k:
        s[m[0]] = 1
    for i in xrange(1, k):
        m[i] = (b * m[i-1] + c) % r
        if m[i] < k+1:
            s[m[i]] += 1

    p = next(s, 0)
    m[k] = p + 1
    p = next(s, p+2)

    for i in xrange(k+1, n):
        if m[i-k-1] > p or s[m[i-k-1]] > 1:
            m[i] = p + 1
            if m[i-k-1] <= k:
                s[m[i-k-1]] -= 1
            s[m[i]] += 1
            p = next(s, p+2)
        else:
            m[i] = m[i-k-1]
        if p == k:
            break

    if p != k:
        print 'Case #%d: %d' % (t+1, m[n-1])
    else:
        print 'Case #%d: %d' % (t+1, m[i-k + (n-i+k+k) % (k+1)])

The key point here is, m[i] will never exceeds k, and if we remember the consecutive numbers we can find in previous k numbers from 0 to p, then p will never reduce.

If number m[i-k-1] is larger than p, then it's obviously we should set m[i] to p+1, and p will increase at least 1.

If number m[i-k-1] is smaller or equal to p, then we should consider whether the same number exists in m[i-k:i], if not, m[i] should set equal to m[i-k-1], if yes, we should set m[i] to p+1 just as the "m[i-k-1]-larger-than-p" case.

Whenever p is equal to k, the loop begin, and the loop size is (k+1), so we can jump out of the calculation and print out the answer now.

How to round a integer to the close hundred?

Try the Math.Round method. Here's how:

Math.Round(76d / 100d, 0) * 100;
Math.Round(121d / 100d, 0) * 100;
Math.Round(9660d / 100d, 0) * 100;

knitr vs. interactive R behaviour

Thanks to Aleksey Vorona and Duncan Murdoch, this bug is now fixed in R-devel!

See: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15411

Why Does 1..99,999 == "1".."99,999" in R, But 100,000 != "100,000"