How to Replicate a Sas Merge

How to replicate a SAS merge

Gordon's answer is close; but it misses one point. Here's its output:

person  visit   code1   type1   seqnum  person  visit   code2   type2   seqnum
1 1 1 1 1 1 1 1 1 1
1 1 2 2 2 1 1 2 2 2
NULL NULL NULL NULL NULL 1 1 3 3 3
1 2 1 3 1 NULL NULL NULL NULL NULL

The third row's nulls are incorrect, while the fourth's are correct.

As far as I know, in SQL there's not a really good way to do this other than splitting things up into a few queries. I think there are five possibilities:

  • Matching person/visit, Matching seqnums
  • Matching person/visit, Left has more seqnums
  • Matching person/visit, Right has more seqnums
  • Left has unmatched person/visit
  • Right has unmatched person/visit

I think the last two might be workable into one query, but I think the second and third have to be separate queries. You can union everything together, of course.

So here's an example, using some temporary tables that are a little more well suited to see what's going on. Note that the third row is now filled in for code1 and type1, even though those are 'extra'. I've only added three of the five criteria - the three you had in your initial example - but the other two aren't too hard.

Note that this is an example of something far faster in SAS - because SAS has a row-wise concept, ie, it's capable of going one row at a time. SQL tends to take a lot longer at these, with large tables, unless it's possible to partition things very neatly and have very good indexes - and even then I've never seen a SQL DBA do anywhere near as well as SAS at some of these types of problems. That's something you'll have to accept of course - SQL has its own advantages, one of which being probably price...

Here's my example code. I'm sure it's not terribly elegant, hopefully one of the SQL folk can improve it. This is written to work in SQL Server (using table variables), same thing should work with some changes (to use temporary tables) in other variants, assuming they implement windowing. (SAS of course can't do this particular thing - as even FedSQL implements ANSI 1999, not ANSI 2008.) This is based on Gordon's initial query, then modified with the additional bits at the end. Anyone who wants to improve this please feel free to edit and/or copy to a new/existing answer any bit you wish.

declare @t1 table (person INT, visit INT, code1 INT, type1 INT);
declare @t2 table (person INT, visit INT, code2 INT, type2 INT);

insert into @t1 values (1,1,1,1)
insert into @t1 values (1,1,2,2)
insert into @t1 values (1,2,1,3)

insert into @t2 values (1,1,1,1)
insert into @t2 values (1,1,2,2)
insert into @t2 values (1,1,3,3)

select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select *,
row_number() over (partition by person, visit order by type1) as seqnum
from @t1
) t1 inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from @t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum
union all

select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (
(select person, visit, MAX(seqnum) as max_rownum from (
select person, visit,
row_number() over (partition by person, visit order by type1) as seqnum
from @t1) t1_f
group by person, visit
) t1_m inner join
(select *, row_number() over (partition by person, visit order by type1) as seqnum
from @t1
) t1
on t1.person=t1_m.person and t1.visit=t1_m.visit
and t1.seqnum=t1_m.max_rownum
inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from @t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum < t2.seqnum
)
union all
select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
from @t1 t1 left join @t2 t2
on t2.person=t1.person and t2.visit=t1.visit
where t2.code2 is null

How to repeat or merge rows in sas

Use retain to maintain and track the value of the variables retrieved or assigned in the prior row.

data want;
set have;

retain pcol1 pcol2 pcol3;

if missing(col1) then col1 = pcol1;
if missing(col2) then col2 = pcol2;
if missing(col3) then col3 = pcol3;

pcol1 = col1;
pcol2 = col2;
pcol3 = col3;

drop pcol:;
run;

Two arrays can be used for the case of many columns:

  • Variable based array for referencing values from data set
  • Temporary array for tracking prior row values. Temporary arrays are not affected by standard DATA Step behavior that 'reset PDV to missing'.
data want;
set have;

array priors(1000) _temporary_;
array values col1-col40;

do _n_ = 1 to dim(values); * repurpose _n_ for loop indexing;

if missing(values(_n_))
then values(_n_) = priors(_n_); * repeat prior value into missing value;
else priors(_n_) = values(_n_); * track most recent non-missing value;

end;
run;

Reproducing SAS merge in R

For the sample data, a SAS merge merges every row of data in the first data set at least once with the data in the second data set, based on the sort order of the by groups.

After writing the sample data to CSV and reading it back into SAS, the data step that merges the data results in 35 observations.

filename data_a "/folders/myfolders/data/dataA.csv";
filename data_b "/folders/myfolders/data/dataB.csv";

data dataA;
infile data_a dlm="," firstobs = 2;
input facility $ facilityuid $ totalPVLS_NQ;
run;

data dataB;
/* read starting second row */
infile data_b dlm="," firstobs = 2;
input facility $ facilityuid $ totalPVLS_DQ;
run;

/* sort data */
proc sort data = dataA;
by facility facilityuid;
run;

proc sort data = dataB;
by facility facilityuid;
run;

/*
* merge the two data sets
*/

data both;
merge dataA(in=inA) dataB(in=inB);
by facility facilityuid;
in_a = inA;
in_b = inB;
run;

For the first by group, since there are three observations in dataA and four observations in dataB with the key 1 de Mai uBjjsyie, the output data set both contains 4 rows with this key. We've assigned two new variables to capture the data from the in =option to illustrate which data set contributed to the output, and printed it.

proc print data = both;
run;

Sample Image

From the printout we can see that the third row from dataA was duplicated to match it with the fourth row from dataB.

The default behavior of the R merge() function is to perform a cartesian product between every row in dataA where the by variables match dataB. Since there are three rows in dataA and four rows in dataB where the key is 1 de Mai uBjjsyie, we print the first 12 rows of the output to illustrate the cartesian product of the data.

dataC <- merge(dataA, dataB, by=c("facility","facilityuid"))
head(dataC,n = 12)

facility facilityuid totalPVLS_NQ4 totalPVLS_DQ4
1 1 de Maio CS uBjjsyieqtr 2713 3186
2 1 de Maio CS uBjjsyieqtr 2713 NA
3 1 de Maio CS uBjjsyieqtr 2713 NA
4 1 de Maio CS uBjjsyieqtr 2713 3562
5 1 de Maio CS uBjjsyieqtr NA 3186
6 1 de Maio CS uBjjsyieqtr NA NA
7 1 de Maio CS uBjjsyieqtr NA NA
8 1 de Maio CS uBjjsyieqtr NA 3562
9 1 de Maio CS uBjjsyieqtr 2019 3186
10 1 de Maio CS uBjjsyieqtr 2019 NA
11 1 de Maio CS uBjjsyieqtr 2019 NA
12 1 de Maio CS uBjjsyieqtr 2019 3562
>

Even dplyr join functions use a cartesian product to resolve multiple matches between dataA and dataB, so we are unable to exactly replicate the SAS output in R solely with an "out of the box" R function. When we read the help for mutate-joins, we find the following.

mutate-joins {dplyr} R Documentation
Mutating joins
Description
The mutating joins add columns from y to x, matching rows based on the keys:

inner_join(): includes all rows in x and y.

left_join(): includes all rows in x.

right_join(): includes all rows in y.

full_join(): includes all rows in x or y.

If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x

The cartesian product effect is consistent across left_join(), inner_join(), full_join() and right_join().

library(dplyr)
dataA %>% right_join(.,dataB) -> dataD
head(dataD,n = 12)

...and the output:

> head(dataD,n = 12)
facility facilityuid totalPVLS_NQ4 totalPVLS_DQ4
1 1 de Maio CS uBjjsyieqtr NA 3186
2 1 de Maio CS uBjjsyieqtr NA NA
3 1 de Maio CS uBjjsyieqtr NA NA
4 1 de Maio CS uBjjsyieqtr NA 3562
5 1 de Maio CS uBjjsyieqtr 2019 3186
6 1 de Maio CS uBjjsyieqtr 2019 NA
7 1 de Maio CS uBjjsyieqtr 2019 NA
8 1 de Maio CS uBjjsyieqtr 2019 3562
9 1 de Maio CS uBjjsyieqtr 2713 3186
10 1 de Maio CS uBjjsyieqtr 2713 NA
11 1 de Maio CS uBjjsyieqtr 2713 NA
12 1 de Maio CS uBjjsyieqtr 2713 3562
>

Bottom line: one needs to clean the data to eliminate or resolve the duplicates for facilityuid in both dataA and dataB.

For example, is it acceptable to eliminate rows where totalPVLS_DQ4 is NA in dataB or where totalPVLS_NQ4 in dataA is NA? Can we add a sequential identifier to each file, assuming that the multiple observations for a facility were taken in the sequential order in the file?

If we take the above listed questions as constraints, we can join the data from the two files as follows:

library(dplyr)
dataA %>% filter(!is.na(totalPVLS_NQ4)) %>% group_by(.,facility,facilityuid) %>%
mutate(., n = seq_along(facilityuid)) -> dataA
dataB %>% filter(!is.na(totalPVLS_DQ4)) %>% group_by(.,facility,facilityuid) %>%
mutate(., n = seq_along(facilityuid)) %>%
full_join(.,dataA) -> resultData
resultData

...and the output:

# A tibble: 19 x 5
# Groups: facility, facilityuid [11]
facility facilityuid totalPVLS_DQ4 n totalPVLS_NQ4
<chr> <chr> <dbl> <int> <dbl>
1 1 de Maio CS uBjjsyieqtr 3186 1 2019
2 1 de Maio CS uBjjsyieqtr 3562 2 2713
3 16 de Junho CS rpmWXDmHjsq 746 1 556
4 16 de Junho CS rpmWXDmHjsq 1050 2 856
5 17 de Setembro CS fJy2O3xh1fV 3535 1 2805
6 17 de Setembro CS fJy2O3xh1fV 2957 2 2381
7 1º Maio CS l7dBO27dhA6 2405 1 1919
8 1º Maio CS l7dBO27dhA6 4096 2 3352
9 1º de Junho CS wXnoTA2MDy9 1604 1 1299
10 1º de Junho CS wXnoTA2MDy9 5404 2 4769
11 1º de Maio CS oDTFc2FqImi NA 1 2576
12 1º de Maio CS oDTFc2FqImi NA 2 6268
13 24 de Julho CS vniKeY4S1Ru NA 1 1512
14 24 de Julho CS vniKeY4S1Ru NA 2 1617
15 25 de Setembro CS WUFgPdTN02g NA 1 2292
16 25 de Setembro CS WUFgPdTN02g NA 2 4389
17 25 de Setembro CS cGvhZR5kDxQ NA 1 89
18 25 de Setembro CS j8DdqpZn2qc NA 1 197
19 3 de Fevereiro CS rnGH1Q5IyIU NA 1 76
>

Merge two datasets duplicate BY variables Or I want to make following form

To get the desired result you should do a cartesian product (Cross join) which returns all the rows in all tables. Each row in table1 is paired with all the rows in table2. I have used Proc SQL to do this and I am eager to see how this can be done using Data step. Here's what I know,

Proc Sql;
create table test_merge as
select a.*, b.type_rhs, b.rhs1, b.rhs2
from test a, test11 b
where a.yearmonth=b.yearmonth
;
quit;

Again, I am new to SAS as well and I think this is one of the ways to create the desired output.

When working with huge data, you will see a note in log that says "The execution of this query involves performing one or more Cartesian product joins that can not be optimized."

Merging multiple datasets together

I tried to convert your logic into a classical SQL.

proc sql;
create table qs as
select a.*
,b.qnam as SOFASCS
,qval as avalc_qs
from trans.qs as a
left join trans.suppqs as b
on a.usubjid = b.usubjid and a.qsseq = input(b.idvarval,best.) and qnam='SOFASCS'
where qscat="SOFA" ;
quit;

proc sql;
create table qs01 as
select qs.*, a.*
from qs
full /* left? */ join adb.adsofa as a
on a.usubjid = qs.usubjid and a.qsseq = qs.qsseq and mittfl = "Y"
;
quit;

I assume that you did not really want to have a full join but a simple left loin in the last one.

Repeat the value for duplicate records in SAS

I cannot understand where this would be a reasonable thing to do, but it is pretty easy. You could make a new variable and use RETAIN.

data want ;
set have ;
by id ;
if first.id then new=status ;
if not missing(status) then new=status ;
retain new ;
rename new=status status=old_status;
run;

Or you could do it using merge.

data want ;
merge have(drop=status) have(keep=id status where=(not missing(status)));
by id;
run;

Merge two dataset with identical columns'name

Chapter 4 of the SAS Language reference explains the process in section "DATA Step Processing during Match-Merging". My bold.

DATA Step Processing during a One-to-One Reading


  • Compilation phase

    • SAS reads the descriptor information of each data set named in the SET statement and then creates a program data vector that contains all the variables from all data sets as well as variables created by the DATA step.
  • Execution — Step 1

    • When SAS executes the first SET statement, SAS reads the first observation from the first data set into the program data vector. The second SET statement reads the first observation from the second data set into the program data vector. If both data sets contain the same variables, the values from the second data set replace the values from the first data set, even if the value is missing. After reading the first observation from the last data set and executing any other statements in the DATA step, SAS writes the contents of the program data vector to the new data set. The SET statement does not reset the values in the program data vector to missing, except for those variables that were created or assigned values during the DATA step.
  • Execution — Step 2

    • SAS continues reading from one data set and then the other until it detects an end-of-file indicator in one of the data sets. SAS stops processing with the last observation of the shortest data set and does not read the remaining observations from the longer data set.

One important difference between MERGE and UPDATE statements is that in UPDATE the values from the second data set replace the values from the first data set, ONLY if the value is NOT missing.

The concept is extended when more than two data sets are listed:

  • For a MERGE statement the outcome is that the rightmost, or last read, value of a variable is placed in the PDV
  • For an UPDATE statement the outcome is that the rightmost, or last read, NON-MISSING value of a variable is placed in the PDV

NOTE: The MODIFY statement replaces values in-place in the master data set and does not write output data sets anew, as does MERGE and UPDATE.



Related Topics



Leave a reply



Submit