SQL join or R's merge() function in NumPy?
If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by
function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:
In [1]: import numpy as np
...: import numpy.lib.recfunctions as rfn
...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])
In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]:
array([(2, 20.0, 200.0), (3, 30.0, 300.0)],
dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])
Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).
Numpy simple join 2 arrays
if all consecutive rows in b are present:
z = np.zeros((b.shape[0],1),dtype=int)
c = np.hstack((b,z))
ai = a[:, 0] - 1
c[ai,2] = a[:, 1]
print c
A more general solution, if both a and b have missing rows:
d = np.union1d(a[:, 0],b[:, 0]).reshape(-1,1)
z = np.zeros((d.shape[0],2),dtype=int)
c = np.hstack((d,z))
mask = np.in1d(c[:, 0], b[:, 0])
c[mask,1] = b[:, 1]
mask = np.in1d(c[:, 0], a[:, 0])
c[mask,2] = a[:, 1]
print c
How to merge 2 numpy ndarray on a ndarray using values of a column?
You can use numpy.searchsorted
:
c = np.c_[a, b[np.searchsorted(a[:, 0], b[:, 0]), 1]]
print(c)
array([[ 1, 2, 10],
[ 5, 0, 20],
[ 6, 4, 30]])
Breaking this down, note the row indexing applied to b
retrieves the indices of a[:, 0]
for each value in b[:, 0]
:
print(np.searchsorted(a[:, 0], b[:, 0]))
[0 2 1]
Making Loop-Finding Faster: Numpy Equivalent of Pandas Merge?
You could try:
import pandas as pd
import networkx as nx
df = pd.DataFrame(columns=['parent', 'child'])
df.loc[0] = (0, 1)
df.loc[1] = (1, 2)
df.loc[2] = (2, 0)
dg = nx.from_pandas_edgelist(df, source='parent', target='child', create_using=nx.DiGraph)
res = list(nx.simple_cycles(dg))
print(res)
Output
[[0, 1, 2]]
From the documentation on simple_cycles:
Find simple cycles (elementary circuits) of a directed graph.
A simple cycle, or elementary circuit, is a closed path where no node
appears twice. Two elementary circuits are distinct if they are not
cyclic permutations of each other.
In the documentation link above there are some links to additional algorithms that may be of interest.
SQL style inner join in Python?
I would suggest a hash discriminator join like method:
l = [('a', 'beta'), ('b', 'alpha'), ('c', 'beta')]
r = [('b', 37), ('c', 22), ('j', 93)]
d = {}
for t in l:
d.setdefault(t[0], ([],[]))[0].append(t[1:])
for t in r:
d.setdefault(t[0], ([],[]))[1].append(t[1:])
from itertools import product
ans = [ (k,) + l + r for k,v in d.items() for l,r in product(*v)]
results in:
[('c', 'beta', 22), ('b', 'alpha', 37)]
This has lower complexity closer to O(n+m) than O(nm) because it avoids computing the product(l,r)
and then filtering as the naive method would.
Mostly from: Fritz Henglein's Relational algebra with discriminative joins and lazy products
It can also be written as:
def accumulate(it):
d = {}
for e in it:
d.setdefault(e[0], []).append(e[1:])
return d
l = accumulate([('a', 'beta'), ('b', 'alpha'), ('c', 'beta')])
r = accumulate([('b', 37), ('c', 22), ('j', 93)])
from itertools import product
ans = [ (k,) + l + r for k in l&r for l,r in product(l[k], r[k])]
This accumulates both lists separately (turns [(a,b,...)]
into {a:[(b,...)]}
) and then computes the intersection between their sets of keys. This looks cleaner. if l&r
is not supported between dictionaries replace it with set(l)&set(r)
.
Replicating SQL's 'Join' in Python
Suppose you represent the equivalent of a SQL table, in Python, as a list of dicts, all dicts having the same (assume string) keys (other representations, including those enabled by numpy
, can be logically boiled down to an equivalent form). Now, an inner join is (again, from a logical point of view) a projection of their cartesian product -- in the general case, taking a predicate argument on
(which takes two arguments, one "record" [[dict]] from each table, and returns a true value if the two records need to be joined), a simple approach would be (using per-table prefixes to disambiguate, against the risk that the two tables might otherwise have homonimous "fields"):
def inner_join(tab1, tab2, prefix1, prefix2, on):
for r1 in tab1:
for r2 in tab2:
if on(r1, r2):
row = dict((prefix1 + k1, v1) for k1, v1 in r1.items())
row.update((prefix2 + k2, v2) for k2, v2 in r2.items())
yield row
Now, of course you don't want to do it this way, because performance is O(M * N)
-- but, for the generality you've specified ("simulate SQL's join by clause (inner, outer, full)") there is really no alternative, because the ON
clause of a JOIN
is pretty unrestricted.
For outer and full joins, you need in addition to keep info identifying which records [[from one or both tables]] have not been yielded yet, and otherwise yield -- e.g. for a left join you'd add a bool, reset to yielded = False
before the for r2
inner loop, set to True
if the yield
executes, and after the inner loop, if not yielded:
, produce an artificial joined record (presumably using None
to stand for NULL in place of the missing v2
values, since there's no r2
to actually use for the purpose).
To get any substantial efficiency improvements, you need to clarify what constraints you're willing to abide on regarding the on
predicate and the tables -- we already know from your question that you can't live with a unique
constraint on either table's keys, but there are many other constraints that could potentially help, and to have us guessing at what such constraints actually apply in your case would be a pretty unproductive endeavor.
Related Topics
Python Parse CSV Ignoring Comma with Double-Quotes
Replicating Jupyter Notebook Pandas Dataframe HTML Printout
Dealing with the Class Imbalance in Binary Classification
Equivalent to Python's Findall() Method in Ruby
How to Print Variable and String on Same Line in Python
Possible to Share In-Memory Data Between 2 Separate Processes
Installation Issue with Matplotlib Python
Writing String to a File on a New Line Every Time
Remove Xticks in a Matplotlib Plot
How to Highlight Searched Queries in Result Page of Django Template
Find in Files Using Ruby or Python
Create Dynamic Urls in Flask with Url_For()
How to Do Multiple Substitutions Using Regex
What Is the Performance Impact of Non-Unique Indexes in Pandas