How to Replicate a Sas Merge

How to replicate a SAS merge

Gordon's answer is close; but it misses one point. Here's its output:

person  visit   code1   type1   seqnum  person  visit   code2   type2   seqnum
1       1       1       1       1       1       1       1       1       1
1       1       2       2       2       1       1       2       2       2
NULL    NULL    NULL    NULL    NULL    1       1       3       3       3
1       2       1       3       1       NULL    NULL    NULL    NULL    NULL

The third row's nulls are incorrect, while the fourth's are correct.

As far as I know, in SQL there's not a really good way to do this other than splitting things up into a few queries. I think there are five possibilities:

Matching person/visit, Matching seqnums
Matching person/visit, Left has more seqnums
Matching person/visit, Right has more seqnums
Left has unmatched person/visit
Right has unmatched person/visit

I think the last two might be workable into one query, but I think the second and third have to be separate queries. You can union everything together, of course.

So here's an example, using some temporary tables that are a little more well suited to see what's going on. Note that the third row is now filled in for code1 and type1, even though those are 'extra'. I've only added three of the five criteria - the three you had in your initial example - but the other two aren't too hard.

Note that this is an example of something far faster in SAS - because SAS has a row-wise concept, ie, it's capable of going one row at a time. SQL tends to take a lot longer at these, with large tables, unless it's possible to partition things very neatly and have very good indexes - and even then I've never seen a SQL DBA do anywhere near as well as SAS at some of these types of problems. That's something you'll have to accept of course - SQL has its own advantages, one of which being probably price...

Here's my example code. I'm sure it's not terribly elegant, hopefully one of the SQL folk can improve it. This is written to work in SQL Server (using table variables), same thing should work with some changes (to use temporary tables) in other variants, assuming they implement windowing. (SAS of course can't do this particular thing - as even FedSQL implements ANSI 1999, not ANSI 2008.) This is based on Gordon's initial query, then modified with the additional bits at the end. Anyone who wants to improve this please feel free to edit and/or copy to a new/existing answer any bit you wish.

declare @t1 table (person INT, visit INT, code1 INT, type1 INT);
declare @t2 table (person INT, visit INT, code2 INT, type2 INT);

insert into @t1 values (1,1,1,1)
insert into @t1 values (1,1,2,2)
insert into @t1 values (1,2,1,3)

insert into @t2 values (1,1,1,1)
insert into @t2 values (1,1,2,2)
insert into @t2 values (1,1,3,3)

select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
                t1.code1, t1.type1, t2.code2, t2.type2
from (select *,
             row_number() over (partition by person, visit order by type1) as seqnum
      from @t1
     ) t1 inner join
     (select *,
             row_number() over (partition by person, visit order by type2) as seqnum
      from @t2
     ) t2
     on t1.person = t2.person and t1.visit = t2.visit and
        t1.seqnum = t2.seqnum
 union all

select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
                t1.code1, t1.type1, t2.code2, t2.type2
from (
      (select person, visit, MAX(seqnum) as max_rownum from (
        select person, visit, 
             row_number() over (partition by person, visit order by type1) as seqnum
      from @t1) t1_f 
      group by person, visit
     ) t1_m inner join
     (select *, row_number() over (partition by person, visit order by type1) as seqnum
       from @t1
      ) t1 
        on t1.person=t1_m.person and t1.visit=t1_m.visit
        and t1.seqnum=t1_m.max_rownum
        inner join
     (select *,
             row_number() over (partition by person, visit order by type2) as seqnum
      from @t2
     ) t2
     on t1.person = t2.person and t1.visit = t2.visit and
        t1.seqnum < t2.seqnum 
     )
 union all
 select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
     from @t1 t1 left join @t2 t2
    on t2.person=t1.person and t2.visit=t1.visit
    where t2.code2 is null

How to repeat or merge rows in sas

Use retain to maintain and track the value of the variables retrieved or assigned in the prior row.

data want;
  set have;

  retain pcol1 pcol2 pcol3;

  if missing(col1) then col1 = pcol1; 
  if missing(col2) then col2 = pcol2; 
  if missing(col3) then col3 = pcol3; 

  pcol1 = col1;
  pcol2 = col2;
  pcol3 = col3;

  drop pcol:;
run;

Two arrays can be used for the case of many columns:

Variable based array for referencing values from data set
Temporary array for tracking prior row values. Temporary arrays are not affected by standard DATA Step behavior that 'reset PDV to missing'.

data want;
  set have;

  array priors(1000) _temporary_;
  array values col1-col40;

  do _n_ = 1 to dim(values);  * repurpose _n_ for loop indexing;

    if missing(values(_n_)) 
      then values(_n_) = priors(_n_); * repeat prior value into missing value;
      else priors(_n_) = values(_n_); * track most recent non-missing value;

  end;        
run;

Reproducing SAS merge in R

For the sample data, a SAS merge merges every row of data in the first data set at least once with the data in the second data set, based on the sort order of the by groups.

After writing the sample data to CSV and reading it back into SAS, the data step that merges the data results in 35 observations.

filename data_a "/folders/myfolders/data/dataA.csv";
filename data_b "/folders/myfolders/data/dataB.csv";

data dataA;
  infile data_a dlm="," firstobs = 2;
  input facility $ facilityuid $ totalPVLS_NQ;
  run;

data dataB;
  /* read starting second row */ 
  infile data_b dlm="," firstobs = 2;
  input facility $ facilityuid $ totalPVLS_DQ;
  run;

/* sort data */
proc sort data = dataA;
   by facility facilityuid;
run;

proc sort data = dataB;
   by facility facilityuid;
run; 

/*
 * merge the two data sets
 */

data both;
  merge dataA(in=inA) dataB(in=inB);
  by facility facilityuid;
  in_a = inA;
  in_b = inB;
run;

For the first by group, since there are three observations in dataA and four observations in dataB with the key 1 de Mai uBjjsyie, the output data set both contains 4 rows with this key. We've assigned two new variables to capture the data from the in =option to illustrate which data set contributed to the output, and printed it.

proc print data = both;
run;

Sample Image

From the printout we can see that the third row from dataA was duplicated to match it with the fourth row from dataB.

The default behavior of the R merge() function is to perform a cartesian product between every row in dataA where the by variables match dataB. Since there are three rows in dataA and four rows in dataB where the key is 1 de Mai uBjjsyie, we print the first 12 rows of the output to illustrate the cartesian product of the data.

dataC <- merge(dataA, dataB, by=c("facility","facilityuid"))
head(dataC,n = 12)

       facility facilityuid totalPVLS_NQ4 totalPVLS_DQ4
1  1 de Maio CS uBjjsyieqtr          2713          3186
2  1 de Maio CS uBjjsyieqtr          2713            NA
3  1 de Maio CS uBjjsyieqtr          2713            NA
4  1 de Maio CS uBjjsyieqtr          2713          3562
5  1 de Maio CS uBjjsyieqtr            NA          3186
6  1 de Maio CS uBjjsyieqtr            NA            NA
7  1 de Maio CS uBjjsyieqtr            NA            NA
8  1 de Maio CS uBjjsyieqtr            NA          3562
9  1 de Maio CS uBjjsyieqtr          2019          3186
10 1 de Maio CS uBjjsyieqtr          2019            NA
11 1 de Maio CS uBjjsyieqtr          2019            NA
12 1 de Maio CS uBjjsyieqtr          2019          3562
>

Even dplyr join functions use a cartesian product to resolve multiple matches between dataA and dataB, so we are unable to exactly replicate the SAS output in R solely with an "out of the box" R function. When we read the help for mutate-joins, we find the following.

mutate-joins {dplyr} R Documentation
Mutating joins
Description
The mutating joins add columns from y to x, matching rows based on the keys:

inner_join(): includes all rows in x and y.

left_join(): includes all rows in x.

right_join(): includes all rows in y.

full_join(): includes all rows in x or y.

If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x

The cartesian product effect is consistent across left_join(), inner_join(), full_join() and right_join().

library(dplyr)
dataA %>% right_join(.,dataB)  -> dataD
head(dataD,n = 12)

...and the output:

> head(dataD,n = 12)
       facility facilityuid totalPVLS_NQ4 totalPVLS_DQ4
1  1 de Maio CS uBjjsyieqtr            NA          3186
2  1 de Maio CS uBjjsyieqtr            NA            NA
3  1 de Maio CS uBjjsyieqtr            NA            NA
4  1 de Maio CS uBjjsyieqtr            NA          3562
5  1 de Maio CS uBjjsyieqtr          2019          3186
6  1 de Maio CS uBjjsyieqtr          2019            NA
7  1 de Maio CS uBjjsyieqtr          2019            NA
8  1 de Maio CS uBjjsyieqtr          2019          3562
9  1 de Maio CS uBjjsyieqtr          2713          3186
10 1 de Maio CS uBjjsyieqtr          2713            NA
11 1 de Maio CS uBjjsyieqtr          2713            NA
12 1 de Maio CS uBjjsyieqtr          2713          3562
>

Bottom line: one needs to clean the data to eliminate or resolve the duplicates for facilityuid in both dataA and dataB.

For example, is it acceptable to eliminate rows where totalPVLS_DQ4 is NA in dataB or where totalPVLS_NQ4 in dataA is NA? Can we add a sequential identifier to each file, assuming that the multiple observations for a facility were taken in the sequential order in the file?

If we take the above listed questions as constraints, we can join the data from the two files as follows:

library(dplyr)
dataA %>% filter(!is.na(totalPVLS_NQ4)) %>% group_by(.,facility,facilityuid) %>% 
     mutate(., n = seq_along(facilityuid)) -> dataA
dataB %>% filter(!is.na(totalPVLS_DQ4)) %>% group_by(.,facility,facilityuid) %>% 
     mutate(., n = seq_along(facilityuid)) %>%
     full_join(.,dataA) -> resultData 
resultData

...and the output:

# A tibble: 19 x 5
# Groups:   facility, facilityuid [11]
   facility          facilityuid totalPVLS_DQ4     n totalPVLS_NQ4
   <chr>             <chr>               <dbl> <int>         <dbl>
 1 1 de Maio CS      uBjjsyieqtr          3186     1          2019
 2 1 de Maio CS      uBjjsyieqtr          3562     2          2713
 3 16 de Junho CS    rpmWXDmHjsq           746     1           556
 4 16 de Junho CS    rpmWXDmHjsq          1050     2           856
 5 17 de Setembro CS fJy2O3xh1fV          3535     1          2805
 6 17 de Setembro CS fJy2O3xh1fV          2957     2          2381
 7 1º Maio CS        l7dBO27dhA6          2405     1          1919
 8 1º Maio CS        l7dBO27dhA6          4096     2          3352
 9 1º de Junho CS    wXnoTA2MDy9          1604     1          1299
10 1º de Junho CS    wXnoTA2MDy9          5404     2          4769
11 1º de Maio CS     oDTFc2FqImi            NA     1          2576
12 1º de Maio CS     oDTFc2FqImi            NA     2          6268
13 24 de Julho CS    vniKeY4S1Ru            NA     1          1512
14 24 de Julho CS    vniKeY4S1Ru            NA     2          1617
15 25 de Setembro CS WUFgPdTN02g            NA     1          2292
16 25 de Setembro CS WUFgPdTN02g            NA     2          4389
17 25 de Setembro CS cGvhZR5kDxQ            NA     1            89
18 25 de Setembro CS j8DdqpZn2qc            NA     1           197
19 3 de Fevereiro CS rnGH1Q5IyIU            NA     1            76
>

Merge two datasets duplicate BY variables Or I want to make following form

To get the desired result you should do a cartesian product (Cross join) which returns all the rows in all tables. Each row in table1 is paired with all the rows in table2. I have used Proc SQL to do this and I am eager to see how this can be done using Data step. Here's what I know,

Proc Sql;
create table test_merge as 
select a.*, b.type_rhs, b.rhs1, b.rhs2
from test a, test11 b 
where a.yearmonth=b.yearmonth
;
quit;

Again, I am new to SAS as well and I think this is one of the ways to create the desired output.

When working with huge data, you will see a note in log that says "The execution of this query involves performing one or more Cartesian product joins that can not be optimized."

Merging multiple datasets together

I tried to convert your logic into a classical SQL.

proc sql;
  create table qs as
    select a.*
        ,b.qnam as SOFASCS
        ,qval as avalc_qs
    from trans.qs as a
    left join trans.suppqs as b
        on a.usubjid = b.usubjid and a.qsseq = input(b.idvarval,best.) and qnam='SOFASCS'
    where qscat="SOFA" ;
quit;

proc sql;
  create table qs01 as
    select qs.*, a.*
    from qs 
    full /* left? */ join adb.adsofa as a
        on a.usubjid = qs.usubjid and a.qsseq = qs.qsseq and mittfl = "Y"
  ;
quit;

I assume that you did not really want to have a full join but a simple left loin in the last one.

Repeat the value for duplicate records in SAS

I cannot understand where this would be a reasonable thing to do, but it is pretty easy. You could make a new variable and use RETAIN.

data want ;
  set have ;
  by id ;
  if first.id then new=status ;
  if not missing(status) then new=status ;
  retain new ;
  rename new=status status=old_status;
run;

Or you could do it using merge.

data want ;
  merge have(drop=status) have(keep=id status where=(not missing(status)));
  by id;
run;

Merge two dataset with identical columns'name

Chapter 4 of the SAS Language reference explains the process in section "DATA Step Processing during Match-Merging". My bold.

DATA Step Processing during a One-to-One Reading

Compilation phase

SAS reads the descriptor information of each data set named in the SET statement and then creates a program data vector that contains all the variables from all data sets as well as variables created by the DATA step.

Execution — Step 1

When SAS executes the first SET statement, SAS reads the first observation from the first data set into the program data vector. The second SET statement reads the first observation from the second data set into the program data vector. If both data sets contain the same variables, the values from the second data set replace the values from the first data set, even if the value is missing. After reading the first observation from the last data set and executing any other statements in the DATA step, SAS writes the contents of the program data vector to the new data set. The SET statement does not reset the values in the program data vector to missing, except for those variables that were created or assigned values during the DATA step.

Execution — Step 2

SAS continues reading from one data set and then the other until it detects an end-of-file indicator in one of the data sets. SAS stops processing with the last observation of the shortest data set and does not read the remaining observations from the longer data set.

One important difference between MERGE and UPDATE statements is that in UPDATE the values from the second data set replace the values from the first data set, ONLY if the value is NOT missing.

The concept is extended when more than two data sets are listed:

For a MERGE statement the outcome is that the rightmost, or last read, value of a variable is placed in the PDV
For an UPDATE statement the outcome is that the rightmost, or last read, NON-MISSING value of a variable is placed in the PDV

NOTE: The MODIFY statement replaces values in-place in the master data set and does not write output data sets anew, as does MERGE and UPDATE.

How to Replicate a Sas Merge