SQL Join or R's Merge() Function in Numpy

SQL join or R's merge() function in NumPy?

If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:

In [1]: import numpy as np
...: import numpy.lib.recfunctions as rfn
...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])

In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]:
array([(2, 20.0, 200.0), (3, 30.0, 300.0)],
dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])

Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).

Numpy simple join 2 arrays

if all consecutive rows in b are present:

z = np.zeros((b.shape[0],1),dtype=int)
c = np.hstack((b,z))
ai = a[:, 0] - 1
c[ai,2] = a[:, 1]
print c

A more general solution, if both a and b have missing rows:

d = np.union1d(a[:, 0],b[:, 0]).reshape(-1,1)
z = np.zeros((d.shape[0],2),dtype=int)
c = np.hstack((d,z))
mask = np.in1d(c[:, 0], b[:, 0])
c[mask,1] = b[:, 1]
mask = np.in1d(c[:, 0], a[:, 0])
c[mask,2] = a[:, 1]
print c

How to merge 2 numpy ndarray on a ndarray using values of a column?

You can use numpy.searchsorted:

c = np.c_[a, b[np.searchsorted(a[:, 0], b[:, 0]), 1]]

print(c)

array([[ 1, 2, 10],
[ 5, 0, 20],
[ 6, 4, 30]])

Breaking this down, note the row indexing applied to b retrieves the indices of a[:, 0] for each value in b[:, 0]:

print(np.searchsorted(a[:, 0], b[:, 0]))

[0 2 1]

Making Loop-Finding Faster: Numpy Equivalent of Pandas Merge?

You could try:

import pandas as pd
import networkx as nx

df = pd.DataFrame(columns=['parent', 'child'])
df.loc[0] = (0, 1)
df.loc[1] = (1, 2)
df.loc[2] = (2, 0)

dg = nx.from_pandas_edgelist(df, source='parent', target='child', create_using=nx.DiGraph)
res = list(nx.simple_cycles(dg))
print(res)

Output

[[0, 1, 2]]

From the documentation on simple_cycles:

Find simple cycles (elementary circuits) of a directed graph.

A simple cycle, or elementary circuit, is a closed path where no node
appears twice. Two elementary circuits are distinct if they are not
cyclic permutations of each other.

In the documentation link above there are some links to additional algorithms that may be of interest.

SQL style inner join in Python?

I would suggest a hash discriminator join like method:

l = [('a', 'beta'), ('b', 'alpha'), ('c', 'beta')]
r = [('b', 37), ('c', 22), ('j', 93)]
d = {}
for t in l:
d.setdefault(t[0], ([],[]))[0].append(t[1:])
for t in r:
d.setdefault(t[0], ([],[]))[1].append(t[1:])
from itertools import product
ans = [ (k,) + l + r for k,v in d.items() for l,r in product(*v)]

results in:

[('c', 'beta', 22), ('b', 'alpha', 37)]

This has lower complexity closer to O(n+m) than O(nm) because it avoids computing the product(l,r) and then filtering as the naive method would.

Mostly from: Fritz Henglein's Relational algebra with discriminative joins and lazy products

It can also be written as:

def accumulate(it):
d = {}
for e in it:
d.setdefault(e[0], []).append(e[1:])
return d
l = accumulate([('a', 'beta'), ('b', 'alpha'), ('c', 'beta')])
r = accumulate([('b', 37), ('c', 22), ('j', 93)])
from itertools import product
ans = [ (k,) + l + r for k in l&r for l,r in product(l[k], r[k])]

This accumulates both lists separately (turns [(a,b,...)] into {a:[(b,...)]}) and then computes the intersection between their sets of keys. This looks cleaner. if l&r is not supported between dictionaries replace it with set(l)&set(r).

Replicating SQL's 'Join' in Python

Suppose you represent the equivalent of a SQL table, in Python, as a list of dicts, all dicts having the same (assume string) keys (other representations, including those enabled by numpy, can be logically boiled down to an equivalent form). Now, an inner join is (again, from a logical point of view) a projection of their cartesian product -- in the general case, taking a predicate argument on (which takes two arguments, one "record" [[dict]] from each table, and returns a true value if the two records need to be joined), a simple approach would be (using per-table prefixes to disambiguate, against the risk that the two tables might otherwise have homonimous "fields"):

def inner_join(tab1, tab2, prefix1, prefix2, on):
for r1 in tab1:
for r2 in tab2:
if on(r1, r2):
row = dict((prefix1 + k1, v1) for k1, v1 in r1.items())
row.update((prefix2 + k2, v2) for k2, v2 in r2.items())
yield row

Now, of course you don't want to do it this way, because performance is O(M * N) -- but, for the generality you've specified ("simulate SQL's join by clause (inner, outer, full)") there is really no alternative, because the ON clause of a JOIN is pretty unrestricted.

For outer and full joins, you need in addition to keep info identifying which records [[from one or both tables]] have not been yielded yet, and otherwise yield -- e.g. for a left join you'd add a bool, reset to yielded = False before the for r2 inner loop, set to True if the yield executes, and after the inner loop, if not yielded:, produce an artificial joined record (presumably using None to stand for NULL in place of the missing v2 values, since there's no r2 to actually use for the purpose).

To get any substantial efficiency improvements, you need to clarify what constraints you're willing to abide on regarding the on predicate and the tables -- we already know from your question that you can't live with a unique constraint on either table's keys, but there are many other constraints that could potentially help, and to have us guessing at what such constraints actually apply in your case would be a pretty unproductive endeavor.



Related Topics



Leave a reply



Submit