How do I subset a panel data set with three criteria in Stata?
There might be an even more succinct way to do this, but I would split it up in these three steps:
clear
input int person_id str6 year int cash
222 "2020q4" 6000
222 "2021q1" 7000
222 "2021q2" 8000
321 "2020q4" 4000
321 "2021q4" 11000
321 "2021q2" 15000
end
*Test if obs has cash>10000 in 2021 q2
gen subset_obs = (cash > 10000 & year == "2021q2")
*By ID, get the max value in subset_obs to copy 1s to all rows for the ID
bysort person_id : egen subset_id = max(subset_obs)
*Keep only IDs with subset_id is 1
keep if subset_id == 1
Subset panel data with more than one match for a criteria over multiple columns
Try using this with dplyr
1.0.0
library(dplyr)
cols <- c('Col2', 'Col3')
df %>%
group_by(Col1) %>%
filter(sum(colSums(cur_data()[cols] == criteria) >= 1) > 1)
# Col1 Col2 Col3
# <chr> <chr> <chr>
#1 A A50 A40
#2 A C50 A50
cur_data()[cols]
selects only cols
column, colSums
counts number of matches in each column and sum
ensures that the match is in different columns.
Sub setting panel data
Using your picture, here's a stab at what you want. It may be long-winded and others may have a more elegant way of doing it, but it gets the job done:
library("reshape2")
con <- textConnection("Person Year Income Age Sex
1 2003 1500 15 1
1 2004 1700 16 1
1 2005 2000 17 1
2 2003 1400 25 0
2 2004 1900 NA 0
2 2005 2000 27 0
3 2003 NA 25 0
3 2004 1900 NA 0
3 2005 2000 27 0")
pnls <- read.table(con, header=TRUE)
# reformat table for easier processing
pnls2 <- melt(pnls, id=c("Person"))
# and select those rows that relate to values
# of income and age
pnls2 <- subset(pnls2,
variable == "Income" | variable == "Age")
# create column of names in desired format (e.g Person1Age etc)
pnls2$name <- paste("Person", pnls2$Person, pnls2$variable, sep="")
# Collect full set of unique names
name.set <- unique(pnls2$name)
# find the incomplete set
incomplete <- unique( pnls2$name[ is.na(pnls2$value) ])
# then find the complement of the incomplete set
complete <- setdiff(name.set, incomplete)
# These two now contain list of complete and incomplete variables
complete
incomplete
If you are not familiar with melt
ing and the reshape2
package, you may want to run it line by line, and examine the value of pnls2
at different stages to see how this works.
EDIT: adding code to compile the values as requested by @bstockton. I am sure there is a much more appropriate R idiom to do this, but once again, in the absence of better answers: this works
# use these lists of complete and incomplete variable names
# as keys to collect lists of values for each variable name
compile <- function(keys) {
holder = list()
for (n in keys) {
holder[[ n ]] <- subset(pnls2, pnls2$name == n)[,3]
}
return( as.data.frame(holder) )
}
complete.recs <- compile(complete)
incomplete.recs <- compile(incomplete)
r subset stratify dataframe by group; subset the max amount of observations per group as long as true and false boolean are balanced
In python, the code looks like this
Make a new empty dataset and write a for loop that groups by restaurant_id and find the minimum amount of n per subgroup Restaurant_Bool
create a catch where if n is 0, then go to the next restaurant_id
merge recommended and not recommended into a temporary group_reviews dataframe and append group_reviews reviews into the balanced_reviews dataframe while asserting that the mean of Restaurant_Bool is 0.5
after the loop is over per group, assert that the mean of Restaurant_Bool is 0.5 for the whole dataframe balanced_reviews
balanced_reviews = pd.DataFrame()
for restaurant_id, group in reviews.groupby('restaurant_id'):
take_n = min((group['Restaurant_Bool'] == 0).sum(), (group['Restaurant_Bool'] == 1).sum())
if take_n == 0:
continue
reg_reviews = group[group['Restaurant_Bool'] == 1].sample(n=take_n, random_state=0)
not_reviews = group[group['Restaurant_Bool'] == 0].sample(n=take_n, random_state=0)
group_reviews = reg_reviews.append(not_reviews)
assert group_reviews['Restaurant_Bool'].mean() == .5
balanced_reviews = balanced_reviews.append(group_reviews)
assert balanced_reviews['Restaurant_Bool'].mean() == .5
Create a panel data frame in R - where one of the values within a variable changes through time
We expand the rows with complete
(after doing a group by 'id') by getting the seq
uence of min
to max
+ 1 of 'year' for entire data and get the cumulative sum of 'y' after grouping by 'id'
library(dplyr)
library(tidyr)
data %>%
group_by(id, year) %>%
mutate(y = cumsum(y)) %>%
group_by(id) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
complete(year = min(.$year):(max(.$year) + 1), fill = list(y = 0)) %>%
mutate(y = cumsum(y))
# A tibble: 18 x 3
# Groups: id [3]
# id year y
# <dbl> <dbl> <dbl>
# 1 10420 2002 0
# 2 10420 2003 0
# 3 10420 2004 1
# 4 10420 2005 1
# 5 10420 2006 1
# 6 10420 2007 1
# 7 10740 2002 1
# 8 10740 2003 1
# 9 10740 2004 1
#10 10740 2005 1
#11 10740 2006 1
#12 10740 2007 1
#13 12060 2002 0
#14 12060 2003 2
#15 12060 2004 2
#16 12060 2005 2
#17 12060 2006 3
#18 12060 2007 3
Beta estimation over panel data by group
As I mentioned in the comment above, this solution might be a bit off since I'm not able to reproduce your desired output 100%. Still, the functionality of what you're trying to accomplish is there. Have a look at it and let me know if this is something you could use or I could develop further.
EDIT: The code below does not reproduce the desired output as specified above, but turned out to be what the OP was looking for after all.
Here goes:
# Datasource
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
# Target check #################################################################
# Subset last three rows in original dataframe
df_check <- funddata[funddata$fund == 1,]
df_check <- tail(df_check,3)
# Run regression check
mod_check <- lm(df_check$ret_riskfree~df_check$benchmark_riskfree)
coef(mod_check)
# My suggestion ################################################################
# The following function takes three arguments:
# 1. a dataframe, myDf
# 2. a column that you'd like to myDf on
# 3. a window length for a sliding window, myWin
fun_rollreg <- function(myDf, subCol, varY, varX, myWin){
df_main <- myDf
# Make an empty data frame to store results in
df_data <- data.frame()
# Identify unique funds
unFunds <- unique(unlist(df_main[subCol]))
# Loop through your subset
for (fundx in unFunds){
# Subset
df <- df_main
df <- df[df$fund == fundx,]
# Keep a copy of the original until later
df_new <- df
# Specify a container for your beta estimates
betas <- c()
# Specify window length
wlength <- myWin
# Retrieve some data dimensions to loop on
rows = dim(df)[1]
periods <- rows - wlength
# Loop through each subset of the data
# and run regression
for (i in rows:(rows - periods)){
# Split dataframe in subsets
# according to the window length
df1 <- df[(i-(wlength-1)):i,]
# Run regression
beta <- coef(lm(df1[[varY]]~df1[[varX]]))[2]
# Keep regression ressults
betas[[i]] <- beta
}
# Add regression data to dataframe
df_new <- data.frame(df, betas)
# Keep the new dataset for later concatenation
df_data <- rbind(df_data, df_new)
}
return(df_data)
}
# Run the function:
df_roll <- fun_rollreg(myDf = funddata, subCol = 'fund',
varY <- 'ret_riskfree', varX <- 'benchmark_riskfree',
myWin = 3)
# Show the results
print(head(df_roll,8))
For the first 8 rows in the new dataframe (fund = 1), this is the result:
date fund return benchmark riskfree ret_riskfree benchmark_riskfree betas
1 2010-07-30 1 1 1 1 0 0 NA
2 2010-08-31 1 2 13 5 -3 8 NA
3 2010-09-30 1 3 14 1 2 13 0.10465116
4 2010-10-31 1 4 20 2 2 18 0.50000000
5 2010-11-30 1 5 14 1 4 13 -0.20000000
6 2010-12-31 1 6 32 6 0 26 -0.30232558
7 2011-01-30 1 7 4 4 3 0 -0.11538462
8 2011-02-28 1 8 1 7 1 -6 -0.05645161
Extracting specific countries in Panel Data
This is very basic subsetting, and you can find several answers on SO and in any introductory manual.
Assuming you have read your csv file in as an object named "df", something like this should do the job:
df[df$country %in% c("United States", "Albania"), ]
In the future:
- Screenshots of your data are of little use. Please use something like
dput(head(yourdata))
instead. - Show what you have tried. Don't simply write "I've been toying with the subset function". If you want to use the
subset
function in particular but haven't had success, it is helpful to show what you have done to help others troubleshoot.
A minimal example
Sample data:
set.seed(1)
df <- data.frame(country = sample(letters[1:5], 15, replace = TRUE),
somerandomvalue = rnorm(15),
anotherrandomvalue = rnorm(15))
Some summary data about the "country" column. Shows us that there are five unique countries, and there are 15 cases (rows) overall.
> summary(df$country)
a b c d e
2 5 1 4 3
Take just a subset:
> df[df$country %in% c("a", "b"), ]
country somerandomvalue anotherrandomvalue
1 b -0.005767173 0.80418951
2 b 2.404653389 -0.05710677
5 b -1.147657009 -0.69095384
10 a -0.891921127 -0.43331032
11 b 0.435683299 -0.64947165
12 a -1.237538422 0.72675075
14 b 0.377395646 0.99216037
Or, using the subset function:
subset(df, country %in% c("a", "b"))
How to filter panel data to only include people who participated in all years?
You can try this:
library(tidyverse)
df <- tribble(
~syear, ~pid,
2008,201,
2008,203,
2008,602,
2012,602,
2008,604,
2008,901,
2012,901,
2016,901,
2008,1501,
2012,1501,
2016,1501
)
df %>%
group_by(pid) %>%
mutate(cnt = n()) %>%
filter(cnt == 3)
# alternatively, the cnt column can be dropped
df %>%
group_by(pid) %>%
mutate(cnt = n()) %>%
filter(cnt == 3) %>%
select(-cnt)
Related Topics
Remove the Columns with the Colsums=0
Removing Everything After First 'Backslash' in a String
Shiny - Custom Warning/Error Messages
Difference of Two Character Vectors with Substring
Stopping the Script Until a Value Is Entred from Keyboard in R
Calculating Inter-Purchase Time in R
R - Pivoting Duplicate Rows into Multiple Column with Unknown Number of Columns
Non-Equi-Joins in R with Data.Table - Backticked Column Name Trouble
Reshape Data for Values in One Column
Rvest Not Recognizing CSS Selector
Http Error 400 on Google_Elevation() Call
Stacked Bar Chart with Group by and Facet
Filter Group of Rows Based on Sum of Values from Different Column
Finding If Boolean Is Ever True by Groups in R