Why does merge result in more rows than original data?
First, from ?merge
:
The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.
Using your link in the comments:
url <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen <- read.table(url, header=T, sep="\t")
nrow(koppen)
# [1] 3594
length(unique(koppen$FIPS))
# [1] 2789
So clearly koppen
has duplicated FIPS codes. Examining the dataset and the website, it appears that many of the counties are in more than one climate class, so for example, the county of Ankorage, Alaska has three climate classes:
koppen[koppen$FIPS==2020,]
# STATE COUNTY FIPS CLS PROP
# 73 Alaska Anchorage 2020 Dsc 0.010
# 74 Alaska Anchorage 2020 Dfc 0.961
# 75 Alaska Anchorage 2020 ET 0.029
The solution depends on what you are trying to accomplish. If you want to extract all rows in all
with any FIPS
that appear in koppen
, either of these should work:
merge(all,unique(koppen$FIPS))
all[all$FIPS %in% unique(koppen$FIPS),]
If you need to append the county and state name to all
, use this:
merge(all,unique(koppen[c("STATE","COUNTY","FIPS")]),by="FIPS")
EDIT Based on the exchange below in the comments.
So, since there are sometimes multiple rows in koppen
with the same FIPS
, but different CLS
, we need a way to decide which of the rows (e.g., which CLS
) to pick. Here are two ways:
# this extracts the row with the largest value of PROP, for that FIPS
url <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen <- read.csv(url, header=T, sep="\t")
koppen <- with(koppen,koppen[order(FIPS,-PROP),])
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),head,n=1)
result <- merge(all, sub.koppen, by="FIPS")
# this extracts a row at random
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),
function(x)x[sample(1:length(x),1)])
result <- merge(all, sub.koppen, by="FIPS")
R: why when merging data - got a result that is larger than both of the data
Each newPCODE1
match will add a row to your merged data frame, including repeats of a code. For example, if T306
appears twice in dataA
and three times in dataB
, you'll get six rows in your merged data frame from that value of newPCODE1
(because the first T306
in dataA
matches three rows in dataB
and the second T306
in dataA
matches those same three rows in dataB
).
To calculate the number of rows in the merged data frame:
First, only newPCODE1
values that are in both data frames will be returned from the merge
(in the form used in your question). In other words, we need the intersection of newPCODE1
from the two data frames:
common.codes = intersect(dataA$newPCODE1,dataB$newPCODE1)
The number of rows returned for each value of newPCODE1
will be the product of the number of rows with that value of newPCODE1
in each of the two data frames. The total number of rows in the merged data frame is the sum of these products:
sum(table(dataA$newPCODE1[dataA$newPCODE1 %in% common.codes]) *
table(dataB$newPCODE1[dataB$newPCODE1 %in% common.codes]))
In the example below, note that the merged data frame has 25 rows, even though the original two data frames have only a total of 20. If all 10 rows had the same newPCODE1
value in both data frames, the merged data frame would have had 100 rows. (If, say, T306
appeared in all 400,000 rows of dataA
and all 40,000 rows of dataB
, your merged data frame would have 400,000*40,000 = 16 billion rows!)
dataA = data.frame(newPCODE1=c(1,3,4,4,5,5,6,6,6,6), value1=letters[1:10])
dataB = data.frame(newPCODE1=c(3,4,5,5,5,6,6,6,6,10), value2=LETTERS[1:10])
merge(dataA,dataB, by="newPCODE1")
newPCODE1 value1 value2
1 3 b A
2 4 c B
3 4 d B
4 5 e C
5 5 e D
6 5 e E
7 5 f C
8 5 f D
9 5 f E
10 6 g G
11 6 g H
12 6 g I
13 6 g F
14 6 h G
15 6 h H
16 6 h I
17 6 h F
18 6 i G
19 6 i H
20 6 i I
21 6 i F
22 6 j G
23 6 j H
24 6 j I
25 6 j F
common.codes = intersect(dataA$newPCODE1,dataB$newPCODE1)
sum(table(dataA$newPCODE1[dataA$newPCODE1 %in% common.codes]) *
table(dataB$newPCODE1[dataB$newPCODE1 %in% common.codes]))
[1] 25
If you want to keep rows from one or both data frames even when there is no matching row in the other data frame, you can do this:
merge(dataA,dataB, by="newPCODE1", all.x=TRUE) # Keep all rows from first data frame
merge(dataA,dataB, by="newPCODE1", all.y=TRUE) # Keep all rows from second data frame
merge(dataA,dataB, by="newPCODE1", all=TRUE) # Keep all rows from both data frames
pandas: merged (inner join) data frame has more rows than the original ones
Because you have duplicates of the merge column in both data sets, you'll get k * m
rows with that merge column value, where k
is the number of rows with that value in data set 1 and m
is the number of rows with that value in data set 2.
try drop_duplicates
dfa = df_A.drop_duplicates(subset=['my_icon_number'])
dfb = df_B.drop_duplicates(subset=['my_icon_number'])
new_df = pd.merge(dfa, dfb, how='inner', on='my_icon_number')
Example
In this example, the only value in common is 4
but I have it 3 times in each data set. That means I should get 9 total rows in the resulting merge, one for every combination.
df_A = pd.DataFrame(dict(my_icon_number=[1, 2, 3, 4, 4, 4], other_column1=range(6)))
df_B = pd.DataFrame(dict(my_icon_number=[4, 4, 4, 5, 6, 7], other_column2=range(6)))
pd.merge(df_A, df_B, how='inner', on='my_icon_number')
my_icon_number other_column1 other_column2
0 4 3 0
1 4 3 1
2 4 3 2
3 4 4 0
4 4 4 1
5 4 4 2
6 4 5 0
7 4 5 1
8 4 5 2
Python - Old pandas merge results in more rows than new pandas
At the end the issue lied within new Pandas' approach to handle NaN
values.
While in the old Pandas the code changed the NaN
values with <NA>
(as string), in the new Pandas it just left it as nan
(pd.nan
type).
I made sure to do df.fillna('<NA>', inplace=True)
and it worked fine. The resulted DataFrame now has the same number of rows as produced by the old Pandas.
Related Topics
Why Does As.Factor Return a Character When Used Inside Apply
How Subset a Data Frame by a Factor and Repeat a Plot for Each Subset
How to Move or Position a Legend in Ggplot2
What Is Integer Overflow in R and How Can It Happen
Change the Default Colour Palette in Ggplot
Sample Rows of Subgroups from Dataframe with Dplyr
Marker Mouse Click Event in R Leaflet for Shiny
Subfigures or Subcaptions with Knitr
Ggmap Error: Geomrasterann Was Built with an Incompatible Version of Ggproto
How to Concatenate Factors, Without Them Being Converted to Integer Level
Displaying a Greater Than or Equal Sign
Emulate Split() with Dplyr Group_By: Return a List of Data Frames
How to Change Order of Boxplots When Using Ggplot2
Subset Based on Variable Column Name