Automated Formula Construction

Automated formula construction

reformulate(), a nifty function for creating formulas from character vectors, might come in handy. Here's an example of what it does:

reformulate(response="Y", termlabels=c("X1", "X2", "X3"))
# Y ~ X1 + X2 + X3

And here's how you might use it in practice. (Note that I here create the formulas inside of the lm() calls. Because formula objects carry with them info about the environment they were created in, I'd be a bit hesitant to create them outside of the lm() call within which you actually want to use them.):

evars <- names(mtcars)[2:5]
ii <- lapply(1:4, seq_len)

lapply(ii, 
       function(X) {
          coef(lm(reformulate(response="mpg", termlabels=evars[X]), data=mtcars))
})
# [[1]]
# (Intercept)         cyl 
#    37.88458    -2.87579 
# 
# [[2]]
# (Intercept)         cyl        disp 
# 34.66099474 -1.58727681 -0.02058363 
# 
# [[3]]
# (Intercept)         cyl        disp          hp 
# 34.18491917 -1.22741994 -0.01883809 -0.01467933 
# 
# [[4]]
# (Intercept)         cyl        disp          hp        drat 
# 23.98524441 -0.81402201 -0.01389625 -0.02317068  2.15404553

Variable length formula construction

How about something like:

diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))

simp <- function(x, species) { 
  spcs <- grep(species, colnames(x)) # which column names have "nuse"
  total <- rowSums(x[,spcs]) # sum by row
  div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
  return(div)
}

diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse

# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1     0     5     0     5    10   0.5556
# 2    20     5     2     8    35   0.6151
# 3    40     3     8     2    53   0.4107
# 4    20    20    20    20    80   0.7595

All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.

SymPy: automatic construction of equations

The expressions above are sums of products of symbols, in sympy instances of Add

From the documentation:

All symbolic things are implemented using subclasses of the Basic
class. First, you need to create symbols using Symbol("x") or numbers
using Integer(5) or Float(34.3). Then you construct the expression
using any class from SymPy. For example Add(Symbol("a"), Symbol("b"))
gives an instance of the Add class. You can call all methods, which
the particular class supports.

In your example you need to declare x1, x2, x3 as symbols:

x1, x2, x3 = symbols('x1 x2 x3')

Once the symbols are defined, sympy converts strings automatically to its own expressions for computing. To check the type of sympy expressions use sympify

a = Symbol("a")
b = Symbol("b")
c = 'a**2 + b'
print c
(a + b)**2
type(c)
# string

from sympy import sympify
sympify(c)
type(sympify(c))
# <′...′>

How use to write programme that involve with a lot of active calculation? In excel 1M+ Row and 20+ column

I can't believe I'm about to say this (for most things I do it would be the wrong choice) but:

If the computations aren't that complex (just lots of them) Python might be a good bet.

If you can get the input as a CSV file than, for about 10 lines of code, you can write a loop that will be run for each line of input and hands you the values to play with.

for line in open('filename', 'w')
  values = line.split(',')
  #values has the values from this line as strings.
  #these can be converted to numbers:

  x = float(values[0])
  n = int(values[1])
  #... and then processed

That might not be the cleanest/best approach but it's simple and straight forward.

p.s. For 1M+ rows, don't expect it to be blazing fast (10 sec to a min or so, depending on what you do to the data)

Google Sheets: Arrayformula for summarizing textjoined arrays (query/filter...)

solution

Try this unique arrayformula that will take into account added lines (German Notation)

=ARRAY_CONSTRAIN( transpose({transpose(unique(A2:A));arrayformula(trim(query(arrayformula(if(A2:A=transpose(unique(A2:A));B2:B&",";));;9^9)))});counta(unique(A2:A));2)

Sample Image

explanation about the construction

the most important step is in D2 (US notation)=arrayformula(if(A2:A10=transpose(unique(A2:A10)),B2:B10,)), then apply query(,,9*9) to gather all items in each column

Sample Image

How to automatically proof that two first-order formulas are equivalent?

As mentioned, to prove that F <=> G where both are closed (universally quantified) formulas, you need to prove F => G and also G => F. To prove each of these two formulas, you can use various calculi. I'll describe [resolution calculus]:

Negate the conjecture, so F => G becomes F & -G.
Convert to CNF.
Run resolution procedure.
If you derive an empty clause, you've proved the original conjecture F => G. If the search saturates and no more new clauses can be derived, the conjecture doesn't hold.

Under your conditions, all atomic formulas coming from F will be predicate symbols applied only to variables and all atomic formulas from G will be predicate symbols applied only toto skolem constants. So the resolution procedure will only produce substitutions that either map variables to other variables, or variables to those skolem constants. This implies that it can only derive a finite amount of distinct literals, and so the resolution procedure will always stop - it will be decidable.

You can also use automated tool for this purpose that will do all that work for you. I use The E Theorem Prover for such problems. As the input language I use the language of The TPTP Problem Library, which is easy to read/write for humans.

To give an example: Input file:

fof(my_formula_name, conjecture, (![X]: p(X)) <=> (![Y]: p(Y)) ).

then I run

eprover --tstp-format -xAuto -tAuto myfile

(-tAuto and -xAuto do some auto-configurations, most likely not needed in your case), and the result is

# Garbage collection reclaimed 59 unused term cells.

# Auto-Ordering is analysing problem.
# Problem is type GHNFGFFSF00SS
# Auto-mode selected ordering type KBO6
# Auto-mode selected ordering precedence scheme <invfreq>
# Auto-mode selected weight ordering scheme <precrank20>
#
# Auto-Heuristic is analysing problem.
# Problem is type GHNFGFFSF00SS
# Auto-Mode selected heuristic G_E___107_C41_F1_PI_AE_Q4_CS_SP_PS_S0Y
# and selection function SelectMaxLComplexAvoidPosPred.
#
# No equality, disabling AC handling.
#
# Initializing proof state
#
#cnf(i_0_2,negated_conjecture,(~p(esk1_0)|~p(esk2_0))).
#
#cnf(i_0_1,negated_conjecture,(p(X1)|p(X2))).
# Presaturation interreduction done
#
#cnf(i_0_2,negated_conjecture,(~p(esk1_0)|~p(esk2_0))).
#
#cnf(i_0_1,negated_conjecture,(p(X2)|p(X1))).
#
#cnf(i_0_3,negated_conjecture,(p(X3))).

# Proof found!
# SZS status Theorem
# Parsed axioms                        : 1
# Removed by relevancy pruning         : 0
# Initial clauses                      : 2
# Removed in clause preprocessing      : 0
# Initial clauses in saturation        : 2
# Processed clauses                    : 5
# ...of these trivial                  : 0
# ...subsumed                          : 0
# ...remaining for further processing  : 5
# Other redundant clauses eliminated   : 0
# Clauses deleted for lack of memory   : 0
# Backward-subsumed                    : 1
# Backward-rewritten                   : 1
# Generated clauses                    : 4
# ...of the previous two non-trivial   : 4
# Contextual simplify-reflections      : 0
# Paramodulations                      : 2
# Factorizations                       : 2
# Equation resolutions                 : 0
# Current number of processed clauses  : 1
#    Positive orientable unit clauses  : 1
#    Positive unorientable unit clauses: 0
#    Negative unit clauses             : 0
#    Non-unit-clauses                  : 0
# Current number of unprocessed clauses: 0
# ...number of literals in the above   : 0
# Clause-clause subsumption calls (NU) : 0
# Rec. Clause-clause subsumption calls : 0
# Unit Clause-clause subsumption calls : 1
# Rewrite failures with RHS unbound    : 0
# Indexed BW rewrite attempts          : 4
# Indexed BW rewrite successes         : 4
# Unification attempts                 : 12
# Unification successes                : 9
# Backwards rewriting index :     2 leaves,   1.00+/-0.000 terms/leaf
# Paramod-from index        :     1 leaves,   1.00+/-0.000 terms/leaf
# Paramod-into index        :     1 leaves,   1.00+/-0.000 terms/leaf

where the most important lines are

# Proof found!
# SZS status Theorem

Nested lapply with substitute in R

Thanks for the suggestion Josh O'Brien! This worked:

DVs <- c('mpg', 'wt')
IVs <- c('disp', 'hp')

lapply(DVs, function(x) lapply(IVs, function(y) {lm(reformulate(response=x, termlabels=y), data=mtcars)}))

I couldn't figure out how to scale my data within the call, but I can just scale my whole dataframe:

mtcarsSC <- as.data.frame(scale(mtcars))

DVs <- c('mpg', 'wt')
IVs <- c('disp', 'hp')

lapply(DVs, function(x) lapply(IVs, function(y) {lm(reformulate(response=x, termlabels=y), data=mtcarsSC)}))

Automated Formula Construction