Why Does Merge Result in More Rows Than Original Data

Why does merge result in more rows than original data?

First, from ?merge:

The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.

Using your link in the comments:

url    <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen <- read.table(url, header=T, sep="\t")
nrow(koppen)
# [1] 3594
length(unique(koppen$FIPS))
# [1] 2789

So clearly koppen has duplicated FIPS codes. Examining the dataset and the website, it appears that many of the counties are in more than one climate class, so for example, the county of Ankorage, Alaska has three climate classes:

koppen[koppen$FIPS==2020,]
# STATE COUNTY FIPS CLS PROP
# 73 Alaska Anchorage 2020 Dsc 0.010
# 74 Alaska Anchorage 2020 Dfc 0.961
# 75 Alaska Anchorage 2020 ET 0.029

The solution depends on what you are trying to accomplish. If you want to extract all rows in all with any FIPS that appear in koppen, either of these should work:

merge(all,unique(koppen$FIPS))

all[all$FIPS %in% unique(koppen$FIPS),]

If you need to append the county and state name to all, use this:

merge(all,unique(koppen[c("STATE","COUNTY","FIPS")]),by="FIPS")

EDIT Based on the exchange below in the comments.

So, since there are sometimes multiple rows in koppen with the same FIPS, but different CLS, we need a way to decide which of the rows (e.g., which CLS) to pick. Here are two ways:

# this extracts the row with the largest value of PROP, for that FIPS
url <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen <- read.csv(url, header=T, sep="\t")
koppen <- with(koppen,koppen[order(FIPS,-PROP),])
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),head,n=1)
result <- merge(all, sub.koppen, by="FIPS")

# this extracts a row at random
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),
function(x)x[sample(1:length(x),1)])
result <- merge(all, sub.koppen, by="FIPS")

R: why when merging data - got a result that is larger than both of the data

Each newPCODE1 match will add a row to your merged data frame, including repeats of a code. For example, if T306 appears twice in dataA and three times in dataB, you'll get six rows in your merged data frame from that value of newPCODE1 (because the first T306 in dataA matches three rows in dataB and the second T306 in dataA matches those same three rows in dataB).

To calculate the number of rows in the merged data frame:

First, only newPCODE1 values that are in both data frames will be returned from the merge (in the form used in your question). In other words, we need the intersection of newPCODE1 from the two data frames:

common.codes = intersect(dataA$newPCODE1,dataB$newPCODE1)

The number of rows returned for each value of newPCODE1 will be the product of the number of rows with that value of newPCODE1 in each of the two data frames. The total number of rows in the merged data frame is the sum of these products:

sum(table(dataA$newPCODE1[dataA$newPCODE1 %in% common.codes]) * 
table(dataB$newPCODE1[dataB$newPCODE1 %in% common.codes]))

In the example below, note that the merged data frame has 25 rows, even though the original two data frames have only a total of 20. If all 10 rows had the same newPCODE1 value in both data frames, the merged data frame would have had 100 rows. (If, say, T306 appeared in all 400,000 rows of dataA and all 40,000 rows of dataB, your merged data frame would have 400,000*40,000 = 16 billion rows!)

dataA = data.frame(newPCODE1=c(1,3,4,4,5,5,6,6,6,6), value1=letters[1:10])
dataB = data.frame(newPCODE1=c(3,4,5,5,5,6,6,6,6,10), value2=LETTERS[1:10])
merge(dataA,dataB, by="newPCODE1")
   newPCODE1 value1 value2
1 3 b A
2 4 c B
3 4 d B
4 5 e C
5 5 e D
6 5 e E
7 5 f C
8 5 f D
9 5 f E
10 6 g G
11 6 g H
12 6 g I
13 6 g F
14 6 h G
15 6 h H
16 6 h I
17 6 h F
18 6 i G
19 6 i H
20 6 i I
21 6 i F
22 6 j G
23 6 j H
24 6 j I
25 6 j F
common.codes = intersect(dataA$newPCODE1,dataB$newPCODE1)

sum(table(dataA$newPCODE1[dataA$newPCODE1 %in% common.codes]) *
table(dataB$newPCODE1[dataB$newPCODE1 %in% common.codes]))
[1] 25

If you want to keep rows from one or both data frames even when there is no matching row in the other data frame, you can do this:

merge(dataA,dataB, by="newPCODE1", all.x=TRUE)   # Keep all rows from first data frame
merge(dataA,dataB, by="newPCODE1", all.y=TRUE) # Keep all rows from second data frame
merge(dataA,dataB, by="newPCODE1", all=TRUE) # Keep all rows from both data frames

pandas: merged (inner join) data frame has more rows than the original ones

Because you have duplicates of the merge column in both data sets, you'll get k * m rows with that merge column value, where k is the number of rows with that value in data set 1 and m is the number of rows with that value in data set 2.

try drop_duplicates

dfa = df_A.drop_duplicates(subset=['my_icon_number'])
dfb = df_B.drop_duplicates(subset=['my_icon_number'])

new_df = pd.merge(dfa, dfb, how='inner', on='my_icon_number')

Example

In this example, the only value in common is 4 but I have it 3 times in each data set. That means I should get 9 total rows in the resulting merge, one for every combination.

df_A = pd.DataFrame(dict(my_icon_number=[1, 2, 3, 4, 4, 4], other_column1=range(6)))
df_B = pd.DataFrame(dict(my_icon_number=[4, 4, 4, 5, 6, 7], other_column2=range(6)))

pd.merge(df_A, df_B, how='inner', on='my_icon_number')

my_icon_number other_column1 other_column2
0 4 3 0
1 4 3 1
2 4 3 2
3 4 4 0
4 4 4 1
5 4 4 2
6 4 5 0
7 4 5 1
8 4 5 2

Python - Old pandas merge results in more rows than new pandas

At the end the issue lied within new Pandas' approach to handle NaN values.
While in the old Pandas the code changed the NaN values with <NA> (as string), in the new Pandas it just left it as nan (pd.nan type).

I made sure to do df.fillna('<NA>', inplace=True) and it worked fine. The resulted DataFrame now has the same number of rows as produced by the old Pandas.



Related Topics



Leave a reply



Submit