How to Plot a Subset of a Data Frame in R

How to plot a subset of a data frame in R?

with(dfr[dfr$var3 < 155,], plot(var1, var2)) should do the trick.

Edit regarding multiple conditions:

with(dfr[(dfr$var3 < 155) & (dfr$var4 > 27),], plot(var1, var2))

Plotting a subset of a dataframe with R?

Without a sample of your data, I can't test the answers below, but you have some errors in your code, which I've tried to fix:

  1. When you use with or subset you don't need to restate the name
    of the data frame when your refer to individual columns.

    Original code:

    with(subset(fin,fin$Species == "TRAT"), plot(fin$FR.CoYear, fin$Young /fin$Sample))

    Change to:

    with(subset(fin, Species == "TRAT"), plot(FR.CoYear, Young/Sample))
  2. Here you misplaced a parenthesis in addition to not needing to restate the name of the data frame in the call to plot:

    Original code:

    with(fin[fin$Species == "TRAT",], plot((fin$FR.CoYear, fin$Young / fin$Sample))
    ##gives the error: unexpected ',' in "with(fin[fin$Species == "TRAT",], plot((fin$FR.CoYear,"

    Change to:

    with(fin[fin$Species == "TRAT",], plot(FR.CoYear, Young / Sample))
  3. fin$Young must also be indexed by Species

    Original code:

        plot(fin$FR.CoYear[fin$Species == "BLKI"],fin$Young / fin$Sample[fin$Species == "BLKI"])
    ##Error in xy.coords(x, y, xlabel, ylabel, log) :
    'x' and 'y' lengths differ

    Change to:

        plot(fin$FR.CoYear[fin$Species == "BLKI"], 
    fin$Young[fin$Species == "BLKI"]/ fin$Sample[fin$Species == "BLKI"])

If you're willing to learn ggplot2, you can easily create separate plots for each value of Species. For example (once again, I couldn't test this without a sample of your data):

library(ggplot2)

# One panel, separate lines for each species
ggplot(fin, aes(FR.CoYear, Young/Sample, group=Species, colour=Species)) +
geom_point() + geom_line()

# One panel for each species
ggplot(fin, aes(FR.CoYear, Young/Sample)) +
geom_point() + geom_line() +
facet_grid(Species ~ .)

Spatial subset of data frame in R

The point data needs to be in a specific format (i.e., a matrix with x and y) when you use plot and for getpoly to recognize the coordinates.

library(splancs)
library(tidyverse)
library(sf)

set.seed(543)
xy <-
cbind(x = runif(n = 25, min = -118, max = -117),
y = runif(n = 25, min = 40, max = 42))

plot(xy)

# Draw a polygon for study area.
poly <- getpoly()

# Convert to sf objects.
polysf <- st_as_sf(as.data.frame(poly), coords = c("V1", "V2"), crs = 4326) %>%
dplyr::summarise() %>%
st_cast("POLYGON") %>%
st_convex_hull()

xysf <- st_as_sf(as.data.frame(xy), coords = c("x", "y"), crs = 4326)

# Do an intersection to keep only points inside the drawn polygon.
xy_intersect <- st_intersection(polysf, xysf)

Output

Simple feature collection with 9 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -117.7913 ymin: 40.82405 xmax: -117.4264 ymax: 41.7448
Geodetic CRS: WGS 84
geometry
1 POINT (-117.4264 41.18712)
2 POINT (-117.5756 41.7448)
3 POINT (-117.7913 40.82405)
4 POINT (-117.7032 41.15077)
5 POINT (-117.5634 41.23936)
6 POINT (-117.7441 40.84163)
7 POINT (-117.692 41.27514)
8 POINT (-117.6864 40.98462)
9 POINT (-117.5759 40.88477)

Plotted with mapview::mapview(xy_intersect) from library(mapview)

Sample Image

However, if you want to extract rows from your original dataframe, then here is another hack for extracting the points that fall within a drawn polygon (when the polygon coordinates look like 0.003456 for example).

library(splancs)
library(tidyverse)

set.seed(543)
xy <-
cbind(x = runif(n = 25, min = -118, max = -117),
y = runif(n = 25, min = 40, max = 42))

plot(xy)

# Draw a polygon for study area.
poly <- getpoly()

# Plot the results.
plot(xy)
polygon(poly)

# This will return a logical vector for points in the polygon
io <- inout(xy, poly)
points(xy[io,], pch = 16, col = "blue")

# Then, can use the index from io to extract the points that
# are inside the polygon from the original set of points.
extract_points <- as.data.frame(xy)[which(io == TRUE),]

extract_points

Output

           x        y
2 -117.4506 41.17794
3 -117.4829 40.71030
8 -117.4679 40.71702
19 -117.3354 40.53687
21 -117.5219 40.47077
22 -117.4876 40.18188
25 -117.2015 40.86243

Sample Image

subset dataframe and plot all the subsets with a loop [R]

Here's how to plot the charts in a loop. In the example you gave, we only have one file number. However, it should create a chart for every number in the file column. On Windows, you can use savePlot to save to your drive. I simplified your example because I was getting errors.

DataOzono <- read.table(text="pressure    height  Temperature RH  Ozone   file    LogP
753.6 2541 16.8 76 0 80131 0.3475673
748.0 2604 17.7 32 0 80131 0.347959
743.5 2656 15.9 38 0 80131 0.3482766
739.8 2697 15.4 39 0 80131 0.3485396
736.6 2734 15.0 41 0 80131 0.3487685
731.8 2790 14.5 42 0 80131 0.3491142", header=TRUE, stringsAsFactors=FALSE)

original_par <- par()
par(mar=c(5.1, 8.1, 4.1, 3.1))

for (i in unique(DataOzono$file)){
DataOzono_subset <- DataOzono[DataOzono$file==i,] #keep only rows for that file number

plot(DataOzono_subset$LogP, DataOzono_subset$Temperature, axes= F,type="l",col="red", ylab = "", xlab = 'LogP',xaxt="n",yaxt="n" )
axis(2,col="red",col.axis="red")
mtext(text = 'T',line = 2,side = 2,col="red",col.lab="red")
par(new=TRUE)
plot(DataOzono_subset$LogP, DataOzono_subset$RH,type="l",col="blue",xaxt="n",yaxt="n",xlab="",ylab="")
axis(4,col="blue",col.axis="blue")
mtext("RH",side=4,line=2,col="blue",col.lab="blue" )
par(new=TRUE)
plot(DataOzono_subset$LogP, DataOzono_subset$Ozone,type="l",col="darkgreen",xaxt="n",yaxt="n",xlab="",ylab="")
mtext("O3",side=2,line=6,,col="darkgreen",col.lab="darkgreen")
axis(2, line = 4,col="darkgreen",col.axis="darkgreen")

savePlot(filename=paste0("c:/temp/",i,".png"),type="png")
}

par() <- original_par #restore par to initial value.

Sample Image

Subsetting data for ggplot2

For your specific case the problem is that you are not quoting Male/Female and Weighted Average Income. Also your data and basic aesthetics should likely be part of ggplot and not geom_line. Doing so isolates these to the single layer, and you would have to add the code to every layer of your plot if you were to add for example geom_smooth.

So to fix your problem you could do

library(tidyverse)
plot <- ggplot(data = dt[Country == 'Germany'],
aes(x = Birthyear,
y = sym("Weighted Average Income"),
col = sym("Weighted Average Income")
) + #Could use "`x`" instead of sym(x)
geom_line() +
facet_grid(Country ~ sym("Male/Female")) ##Could use "`x`" instead of sym(x)
plot

Now ggplot2 actually has a (lesser known) builtin functionality for changing your data, so if you wanted to compare this to the plot with all of your countries included you could do:

plot %+% dt # `%+%` is used to change the data used by one or more layers. See help("+.gg")

Subset and plot data frames with the same column names in ggplot in R

We get the unique column names from all the list elements ('un1'), loop over the names, extract the column names that are the same from each of the 'samp' in a nested lapply, use cbind.fill from rowr to cbind the list elements (while filling the unequal rows with NA for those datasets that have less number of rows) to create 'lst1'. Another list is created to get the index the list element where the column names comes from ('lst2'). Use these two lists in Map to extract the corresponding 'h' column based on the index from 'lst2', and cbind with each of the datasets of 'lst1'

library(rowr)
un1 <- setdiff(unique(unlist(lapply(samp, names))), "h")
lst1 <- lapply(un1, function(nm) do.call(cbind.fill,
c(Filter(length, lapply(samp, function(x)
x[colnames(x) == nm])), fill = NA)))
lst2 <- lapply(un1, function(nm) which(do.call(c,
lapply(samp, function(x) any(names(x) == nm)))))
out <- Map(function(dat1, ind) {
tmp <- do.call(cbind.fill, c(lapply(samp[ind], `[[`, 'h'), fill = NA))
names(tmp) <- paste0("h", seq_along(tmp))
cbind(dat1, tmp)},
lst1, lst2)

length(out)
#[1] 22

-checking the output

lapply(out, head, 2)
#[[1]]
# DLC12s h1
#1 86.19998 -52.500
#2 83.16610 -43.375

#[[2]]
# DLC17p h1
#1 0.5184452 -52.500
#2 1.5012423 -43.375

#[[3]]
# DLC17q h1
#1 0.2929875 -52.500
#2 0.3105346 -43.375

#[[4]]
# DLC21gs h1
#1 12.7175189 -52.500
#2 0.1544069 -43.375

#[[5]]
# DLC24as h1
#1 0.2228264 -52.500
#2 0.2411541 -43.375

#[[6]]
# DLC24bs h1
#1 0.02773543 -52.500
#2 0.04170485 -43.375

#[[7]]
# DLC31s h1
#1 0.001799534 -52.500
#2 0.451788609 -43.375

#[[8]]
# DLC41es h1
#1 0.0003281455 -52.500
#2 0.0094817520 -43.375

#[[9]]
# DLC41is h1
#1 0.001144196 -52.500
#2 0.369375492 -43.375

#[[10]]
# DLC41ms h1
#1 0.003163386 -52.500
#2 0.121520955 -43.375

#[[11]]
# DLC64h DLC64h DLC64h h1 h2 h3
#1 0.003437833 0.01828710 0.0682039 -52.500 -69.3 -75.4
#2 1.063494100 0.08393471 0.3838715 -43.375 -65.0 -66.0

#[[12]]
# DLC64l DLC64l DLC64l h1 h2 h3
#1 2.456927e-16 0.07751714 0.0491324765 -52.500 -69.3 -75.4
#2 1.902683e+00 0.13670254 0.0006464645 -43.375 -65.0 -66.0

#[[13]]
# DLC72 DLC72 DLC72 h1 h2 h3
#1 0.01063255 12.82851 8.336495 -52.500 -69.3 -75.4
#2 10.66651137 27.71747 36.174530 -43.375 -65.0 -66.0

#[[14]]
# DLC12 DLC12 h1 h2
#1 86.53149 54.44353 -69.3 -75.4
#2 70.64820 60.40582 -65.0 -66.0

#[[15]]
# DLC24a DLC24a h1 h2
#1 0.2187664 0.1598862 -69.3 -75.4
#2 0.1533400 0.1716777 -65.0 -66.0

#[[16]]
# DLC24b DLC24b h1 h2
#1 0.04532141 0.01841368 -69.3 -75.4
#2 0.04852150 0.02924072 -65.0 -66.0

#[[17]]
# DLC31 DLC31 h1 h2
#1 0.1142758 0.1051915 -69.3 -75.4
#2 0.4196964 0.3760683 -65.0 -66.0

#[[18]]
# DLC41e DLC41e h1 h2
#1 0.001120229 0.001992596 -69.3 -75.4
#2 0.005298573 0.009939579 -65.0 -66.0

#[[19]]
# DLC41i DLC41i h1 h2
#1 0.1384648 0.0763053 -69.3 -75.4
#2 0.6957711 0.4806988 -65.0 -66.0

#[[20]]
# DLC41m DLC41m h1 h2
#1 0.02624807 0.1084238 -69.3 -75.4
#2 0.09105723 0.2136423 -65.0 -66.0

#[[21]]
# DLCE4 h1
#1 31.8570262 -75.4
#2 0.2500975 -66.0

#[[22]]
# DLCE7 h1
#1 4.775404 -75.4
#2 1.503764 -66.0

If we don't have rowr, then an option is to create rows for the list elements that have less number of rows with NA

un1 <- setdiff(unique(unlist(lapply(samp, names))), "h")   
lst1 <- lapply(un1, function(nm) {
tmplst <- Filter(length, lapply(samp, function(x)
x[colnames(x) == nm]))
mx <- max(sapply(tmplst, nrow))
do.call(cbind, lapply(tmplst, function(x) {
if(mx > nrow(x)) x[nrow(x):mx, ] <- NA
x}))})

lst2 <- lapply(un1, function(nm) which(do.call(c,
lapply(samp, function(x) any(names(x) == nm)))))

out <- Map(function(dat1, ind) {
tmplst <- lapply(samp[ind], `[[`, 'h')
mx <- max(lengths(tmplst))
tmplst1 <- do.call(cbind, lapply(tmplst, `length<-`, mx))
colnames(tmplst1) <- paste0('h', seq_len(ncol(tmplst1)))
cbind(dat1, tmplst1)
}, lst1, lst2)

sapply(out, dim)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] #[,15] [,16] [,17] [,18] [,19] [,20]
#[1,] 22 22 22 22 22 22 22 22 22 22 38 38 38 38 #38 38 38 38 38 38
#[2,] 2 2 2 2 2 2 2 2 2 2 6 6 6 4 #4 4 4 4 4 4
# [,21] [,22]
#[1,] 24 24
#[2,] 2 2

Update

With the named list, we can change the

 colnames(tmplst1) <- paste0('h', seq_len(ncol(tmplst1)))

to

colnames(tmplst1) <- paste0('h', colnames(tmplst1))

ie.

out <- Map(function(dat1, ind) {
tmplst <- lapply(samp[ind], `[[`, 'h')
mx <- max(lengths(tmplst))
tmplst1 <- do.call(cbind, lapply(tmplst, `length<-`, mx))
colnames(tmplst1) <- paste0('h', colnames(tmplst1))
cbind(dat1, tmplst1)
}, lst1, lst2)

Plotly Express - plot subset of dataframe columns by default and the rest as option

You can use the visible property of the traces to state it is only in the legend. Below shows all columns in the figure then first two columns are set as visible, all other columns are only in the legend.

import plotly.express as px
import pandas as pd
import numpy as np

# simulate dataframe
df = pd.DataFrame(
{c: np.random.uniform(0, 1, 100) + cn for cn, c in enumerate("ABCDEF")}
)

fig = px.line(df, x=df.index, y=df.columns)

# for example only display first two columns of data frame, all others can be displayed
# by clicking on legend item
fig.for_each_trace(
lambda t: t.update(visible=True if t.name in df.columns[:2] else "legendonly")
)

Sample Image

Use index to subset dataframe based on unique values in a column

You should subset with a logical vector:

df[df$ID %in% unique(df$ID)[1:5], ]
df[df$ID %in% unique(df$ID)[6:10], ]

You can also use split with cut to split your dataframe into n datasets (here, 2) by group.

split(df, cut(as.numeric(as.factor(df$ID)), 2))

Basic bar plot with subset of data frame in R

There are a few typos in your code... but if I'm interpreting correctly what you are trying to accomplish then this is what you want:

library(dplyr)
library(ggplot2)

Df1 <- data.frame(
education = c("high", "high", "high", "high", "high", "college", "college", "college", "college", "grad", "grad", "grad", "grad", "grad"),
salary = c("65", "65", "65", "90", "65", "65", "65", "90", "90", "90", "90", "65", "75", "75")
)

Df2 <- Df1 %>%
filter(education == "high") %>%
group_by(education, salary) %>%
summarise(SCount = n())

ggplot(Df2, aes(x = salary, y = SCount)) +
geom_bar(stat = "identity") +
coord_flip()

..which produces this plot

Sample Image



Related Topics



Leave a reply



Submit