Working of Merge in SAS (with IN=)
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
SAS data step, however, process merges very differently. MERGE
is really nothing more than a special case of SET
. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
What it does:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
So, if left
has 3 records and right
has 4 records for key value a
, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4
about data merge in SAS
yourdata(in=a)
creates a flag variable in the program data vector called 'a' that contains 1 if the record is from yourdata and 0 if it isn't. You can then use these variables to perform conditional operations based on the source of the record.
It might be easier to understand if you saw
data newdata;
merge yourdata(in=ThisRecordIsFromYourData) otherdata(in=ThisRecordIsFromOtherData);
by permno date;
run;
Suppose that records from yourdata needed to be manipulated in this step, but not those from otherdata, you could then do something like
data newdata;
merge yourdata(in=ThisRecordIsFromYourData) otherdata(in=ThisRecordIsFromOtherData);
by permno date;
if ThisRecordIsFromYourData then do;
* some operation here for yourdata records only ;
end;
run;
An obvious use for these variables is to control what kind of 'merge' will occur, using if
statements. For example, if ThisRecordIsFromYourData and ThisRecordIsFromOtherData;
will make SAS only include rows that match on the by variables from both input data sets (like an inner join).
Horizontally combine three SAS datasets with renaming of columns
data want;
merge a(rename=apples=green_apples)
b(rename=apples=red_apples)
c(rename=apples=yellow_apples)
;
run;
That will do as you say, merge the datasets together with no by
statement. There is an option (options mergenoby=ERROR|WARN|NOWARN
) which will possibly make this statement error or produce a warning (as this is something that often is done by mistake), but you can turn it to NOWARN
to avoid that.
There isn't a good way to do this in SQL without creating an artificial row identifier.
Related Topics
Update All SQL Null Values in Multiple Columns Using Column Level Where Clause
Is Too Many Left Joins a Code Smell
Order by Column1 If Column1 Is Not Null, Otherwise Order by Column2
Oracle SQL Syntax - Check Multiple Columns for Is Not Null
Crosstab Query with Dynamic Columns in SQL Server 2005 Up
How to Set a Default Value for One Column in SQL Based on Another Column
Is There a Tool to Generate a Full Database Ddl for SQL Server? What About Postgres and MySQL
How to Remove The Default Value from a Column in Oracle
How to Count Unique Pairs of Values in Sql
Increment Counter or Insert Row in One Statement, in Sqlite
Inserting Guid into SQL Server
What Is This Operand (*= Star-Equals) in SQL Server 2000
Sql Select Rows Containing Part of String
Sql Help: Select Statement Concatenate a One to Many Relationship