Working of Merge in SAS (With In=)

Working of Merge in SAS (with IN=)

Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.

SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).

SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.

What it does:

For each unique BY value
  Take the next record from the left side dataset, if one exists with that BY value
  Take the next record from the right side dataset, if one exists with that BY value
  Output a row
Continue until both datasets are exhausted for that BY value

With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).

So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):

left  right
1     1
2     2
3     3
3     4

about data merge in SAS

yourdata(in=a) creates a flag variable in the program data vector called 'a' that contains 1 if the record is from yourdata and 0 if it isn't. You can then use these variables to perform conditional operations based on the source of the record.

It might be easier to understand if you saw

data newdata;
merge yourdata(in=ThisRecordIsFromYourData) otherdata(in=ThisRecordIsFromOtherData);
by permno date;
run;

Suppose that records from yourdata needed to be manipulated in this step, but not those from otherdata, you could then do something like

data newdata;
merge yourdata(in=ThisRecordIsFromYourData) otherdata(in=ThisRecordIsFromOtherData);
by permno date;
if ThisRecordIsFromYourData then do;
  * some operation here for yourdata records only ;
end;
run;

An obvious use for these variables is to control what kind of 'merge' will occur, using if statements. For example, if ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS only include rows that match on the by variables from both input data sets (like an inner join).

Horizontally combine three SAS datasets with renaming of columns

data want;
  merge a(rename=apples=green_apples) 
        b(rename=apples=red_apples) 
        c(rename=apples=yellow_apples)
  ;
run;

That will do as you say, merge the datasets together with no by statement. There is an option (options mergenoby=ERROR|WARN|NOWARN) which will possibly make this statement error or produce a warning (as this is something that often is done by mistake), but you can turn it to NOWARN to avoid that.

There isn't a good way to do this in SQL without creating an artificial row identifier.

Working of Merge in SAS (With In=)