Subset Panel Data by Group

How do I subset a panel data set with three criteria in Stata?

There might be an even more succinct way to do this, but I would split it up in these three steps:

clear
input int person_id str6 year int cash
222 "2020q4" 6000
222 "2021q1" 7000
222 "2021q2" 8000
321 "2020q4" 4000
321 "2021q4" 11000
321 "2021q2" 15000
end

*Test if obs has cash>10000 in 2021 q2
gen subset_obs = (cash > 10000 & year == "2021q2")

*By ID, get the max value in subset_obs to copy 1s to all rows for the ID
bysort person_id : egen subset_id = max(subset_obs)

*Keep only IDs with subset_id is 1
keep if subset_id == 1

Subset panel data with more than one match for a criteria over multiple columns

Try using this with dplyr 1.0.0

library(dplyr)

cols <- c('Col2', 'Col3')

df %>%
group_by(Col1) %>%
filter(sum(colSums(cur_data()[cols] == criteria) >= 1) > 1)

# Col1 Col2 Col3
# <chr> <chr> <chr>
#1 A A50 A40
#2 A C50 A50

cur_data()[cols] selects only cols column, colSums counts number of matches in each column and sum ensures that the match is in different columns.

Sub setting panel data

Using your picture, here's a stab at what you want. It may be long-winded and others may have a more elegant way of doing it, but it gets the job done:

library("reshape2")

con <- textConnection("Person Year Income Age Sex
1 2003 1500 15 1
1 2004 1700 16 1
1 2005 2000 17 1
2 2003 1400 25 0
2 2004 1900 NA 0
2 2005 2000 27 0
3 2003 NA 25 0
3 2004 1900 NA 0
3 2005 2000 27 0")
pnls <- read.table(con, header=TRUE)

# reformat table for easier processing
pnls2 <- melt(pnls, id=c("Person"))
# and select those rows that relate to values
# of income and age
pnls2 <- subset(pnls2,
variable == "Income" | variable == "Age")

# create column of names in desired format (e.g Person1Age etc)
pnls2$name <- paste("Person", pnls2$Person, pnls2$variable, sep="")

# Collect full set of unique names
name.set <- unique(pnls2$name)
# find the incomplete set
incomplete <- unique( pnls2$name[ is.na(pnls2$value) ])
# then find the complement of the incomplete set
complete <- setdiff(name.set, incomplete)

# These two now contain list of complete and incomplete variables
complete
incomplete

If you are not familiar with melting and the reshape2 package, you may want to run it line by line, and examine the value of pnls2 at different stages to see how this works.

EDIT: adding code to compile the values as requested by @bstockton. I am sure there is a much more appropriate R idiom to do this, but once again, in the absence of better answers: this works

# use these lists of complete and incomplete variable names
# as keys to collect lists of values for each variable name
compile <- function(keys) {
holder = list()
for (n in keys) {
holder[[ n ]] <- subset(pnls2, pnls2$name == n)[,3]
}
return( as.data.frame(holder) )
}

complete.recs <- compile(complete)
incomplete.recs <- compile(incomplete)

r subset stratify dataframe by group; subset the max amount of observations per group as long as true and false boolean are balanced

In python, the code looks like this

Make a new empty dataset and write a for loop that groups by restaurant_id and find the minimum amount of n per subgroup Restaurant_Bool

create a catch where if n is 0, then go to the next restaurant_id

merge recommended and not recommended into a temporary group_reviews dataframe and append group_reviews reviews into the balanced_reviews dataframe while asserting that the mean of Restaurant_Bool is 0.5

after the loop is over per group, assert that the mean of Restaurant_Bool is 0.5 for the whole dataframe balanced_reviews

balanced_reviews = pd.DataFrame()
for restaurant_id, group in reviews.groupby('restaurant_id'):
take_n = min((group['Restaurant_Bool'] == 0).sum(), (group['Restaurant_Bool'] == 1).sum())
if take_n == 0:
continue
reg_reviews = group[group['Restaurant_Bool'] == 1].sample(n=take_n, random_state=0)
not_reviews = group[group['Restaurant_Bool'] == 0].sample(n=take_n, random_state=0)
group_reviews = reg_reviews.append(not_reviews)

assert group_reviews['Restaurant_Bool'].mean() == .5
balanced_reviews = balanced_reviews.append(group_reviews)

assert balanced_reviews['Restaurant_Bool'].mean() == .5

Create a panel data frame in R - where one of the values within a variable changes through time

We expand the rows with complete (after doing a group by 'id') by getting the sequence of min to max + 1 of 'year' for entire data and get the cumulative sum of 'y' after grouping by 'id'

library(dplyr)
library(tidyr)
data %>%
group_by(id, year) %>%
mutate(y = cumsum(y)) %>%
group_by(id) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
complete(year = min(.$year):(max(.$year) + 1), fill = list(y = 0)) %>%
mutate(y = cumsum(y))
# A tibble: 18 x 3
# Groups: id [3]
# id year y
# <dbl> <dbl> <dbl>
# 1 10420 2002 0
# 2 10420 2003 0
# 3 10420 2004 1
# 4 10420 2005 1
# 5 10420 2006 1
# 6 10420 2007 1
# 7 10740 2002 1
# 8 10740 2003 1
# 9 10740 2004 1
#10 10740 2005 1
#11 10740 2006 1
#12 10740 2007 1
#13 12060 2002 0
#14 12060 2003 2
#15 12060 2004 2
#16 12060 2005 2
#17 12060 2006 3
#18 12060 2007 3

Beta estimation over panel data by group

As I mentioned in the comment above, this solution might be a bit off since I'm not able to reproduce your desired output 100%. Still, the functionality of what you're trying to accomplish is there. Have a look at it and let me know if this is something you could use or I could develop further.

EDIT: The code below does not reproduce the desired output as specified above, but turned out to be what the OP was looking for after all.

Here goes:

# Datasource
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)

# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)

# Target check #################################################################
# Subset last three rows in original dataframe
df_check <- funddata[funddata$fund == 1,]
df_check <- tail(df_check,3)

# Run regression check
mod_check <- lm(df_check$ret_riskfree~df_check$benchmark_riskfree)
coef(mod_check)

# My suggestion ################################################################
# The following function takes three arguments:
# 1. a dataframe, myDf
# 2. a column that you'd like to myDf on
# 3. a window length for a sliding window, myWin

fun_rollreg <- function(myDf, subCol, varY, varX, myWin){
df_main <- myDf

# Make an empty data frame to store results in
df_data <- data.frame()

# Identify unique funds
unFunds <- unique(unlist(df_main[subCol]))

# Loop through your subset
for (fundx in unFunds){

# Subset
df <- df_main
df <- df[df$fund == fundx,]

# Keep a copy of the original until later
df_new <- df

# Specify a container for your beta estimates
betas <- c()

# Specify window length
wlength <- myWin

# Retrieve some data dimensions to loop on
rows = dim(df)[1]
periods <- rows - wlength

# Loop through each subset of the data
# and run regression
for (i in rows:(rows - periods)){

# Split dataframe in subsets
# according to the window length
df1 <- df[(i-(wlength-1)):i,]

# Run regression
beta <- coef(lm(df1[[varY]]~df1[[varX]]))[2]

# Keep regression ressults
betas[[i]] <- beta
}
# Add regression data to dataframe
df_new <- data.frame(df, betas)

# Keep the new dataset for later concatenation
df_data <- rbind(df_data, df_new)
}
return(df_data)
}

# Run the function:
df_roll <- fun_rollreg(myDf = funddata, subCol = 'fund',
varY <- 'ret_riskfree', varX <- 'benchmark_riskfree',
myWin = 3)
# Show the results
print(head(df_roll,8))

For the first 8 rows in the new dataframe (fund = 1), this is the result:

  date         fund return benchmark riskfree ret_riskfree benchmark_riskfree       betas
1 2010-07-30 1 1 1 1 0 0 NA
2 2010-08-31 1 2 13 5 -3 8 NA
3 2010-09-30 1 3 14 1 2 13 0.10465116
4 2010-10-31 1 4 20 2 2 18 0.50000000
5 2010-11-30 1 5 14 1 4 13 -0.20000000
6 2010-12-31 1 6 32 6 0 26 -0.30232558
7 2011-01-30 1 7 4 4 3 0 -0.11538462
8 2011-02-28 1 8 1 7 1 -6 -0.05645161

Extracting specific countries in Panel Data

This is very basic subsetting, and you can find several answers on SO and in any introductory manual.

Assuming you have read your csv file in as an object named "df", something like this should do the job:

df[df$country %in% c("United States", "Albania"), ]

In the future:

  1. Screenshots of your data are of little use. Please use something like dput(head(yourdata)) instead.
  2. Show what you have tried. Don't simply write "I've been toying with the subset function". If you want to use the subset function in particular but haven't had success, it is helpful to show what you have done to help others troubleshoot.

A minimal example

Sample data:

set.seed(1)
df <- data.frame(country = sample(letters[1:5], 15, replace = TRUE),
somerandomvalue = rnorm(15),
anotherrandomvalue = rnorm(15))

Some summary data about the "country" column. Shows us that there are five unique countries, and there are 15 cases (rows) overall.

> summary(df$country)
a b c d e
2 5 1 4 3

Take just a subset:

> df[df$country %in% c("a", "b"), ]
country somerandomvalue anotherrandomvalue
1 b -0.005767173 0.80418951
2 b 2.404653389 -0.05710677
5 b -1.147657009 -0.69095384
10 a -0.891921127 -0.43331032
11 b 0.435683299 -0.64947165
12 a -1.237538422 0.72675075
14 b 0.377395646 0.99216037

Or, using the subset function:

subset(df, country %in% c("a", "b"))

How to filter panel data to only include people who participated in all years?

You can try this:

library(tidyverse)

df <- tribble(
~syear, ~pid,
2008,201,
2008,203,
2008,602,
2012,602,
2008,604,
2008,901,
2012,901,
2016,901,
2008,1501,
2012,1501,
2016,1501
)

df %>%
group_by(pid) %>%
mutate(cnt = n()) %>%
filter(cnt == 3)

# alternatively, the cnt column can be dropped
df %>%
group_by(pid) %>%
mutate(cnt = n()) %>%
filter(cnt == 3) %>%
select(-cnt)


Related Topics



Leave a reply



Submit