Why am I getting X. in my column names when reading a data frame?
read.csv()
is a wrapper around the more general read.table()
function. That latter function has argument check.names
which is documented as:
check.names: logical. If ‘TRUE’ then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
‘make.names’) so that they are, and also to ensure that there
are no duplicates.
If your header contains labels that are not syntactically valid then make.names()
will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X
:
R> make.names("$Foo")
[1] "X.Foo"
This is documented in ?make.names
:
Details:
A syntactically valid name consists of letters, numbers and the
dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as ‘".2way"’ are not valid,
and neither are the reserved words.
The definition of a _letter_ depends on the current locale, but
only ASCII digits are considered to be digits.
The character ‘"X"’ is prepended if necessary. All invalid
characters are translated to ‘"."’. A missing value is translated
to ‘"NA"’. Names which match R keywords have a dot appended to
them. Duplicated values are altered by ‘make.unique’.
The behaviour you are seeing is entirely consistent with the documented way read.table()
loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names
that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?
I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; \t
?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names
as indicating that the file is OK.
Posting the output of sessionInfo()
would also be useful.
Remove the X letter from the column names of a new dataframe
While not really advisable, what you want is check.names = FALSE
in the data.frame
call:
data.frame(YG %>% group_by(year) %>%
summarise(n = round(sum(weight)), g = n()) %>%
select(-g) %>% spread(year, n, fill = 0),
check.names = FALSE)
# 2000 2001 2002
# 1 2 1 2
R: Why am I getting an extra column titled X.1 in my dataframe after reading my .txt file?
If all other column names are correct, you have probably a trailing \t
in the text file. R tries to include it and gives it the generic column name X.1
.
You could try and read the file first as 'plain text' and remove the trailing \t
and only then use read.csv
:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t
(this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text
)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)
How to Drop X in Column names after Merge
It is better to have column names not start with numbers. By default, the make.names
or make.unique
adds the X
prefix when it starts with numbers. To remove it, one option is sub
names(z) <- sub("^X", "", names(z))
z
# ID x y V1 198101 198102 198103 198104 198105 198106
#1 410320 -122.5417 37.75 NA 119.45 33.15 104.23 5.61 4.85 0
#2 410321 -122.5000 37.75 NA 129.49 37.76 114.94 5.28 5.24 0
#3 410322 -122.4583 37.75 NA 163.68 42.80 131.22 7.25 6.94 0
#4 410323 -122.4167 37.75 NA 141.14 32.26 110.45 7.77 4.62 0
#5 410324 -122.3750 37.75 NA 130.87 25.87 102.15 8.38 4.13 0
#6 410325 -122.3333 37.75 NA 129.03 25.21 102.37 9.42 4.35 0
If we apply make.names
make.names(names(z))
#[1] "ID" "x" "y" "V1" "X198101" "X198102"
#[7] "X198103" "X198104" "X198105" "X198106"
The 'X' prefix is returned. So, in general, it is safe to have column names with 'character' prefix instead of just numbers. Also, if we wanted to extract say '198101' column, we need a backtick
z$198104
#Error: unexpected numeric constant in "z$198104"
z$`198104`
#[1] 5.61 5.28 7.25 7.77 8.38 9.42
Why are Xs added to data frame variable names when using read.csv?
read.table
and read.csv
have a check.names=
argument that you can set to FALSE
.
For example, try it with this input consisting of just a header:
> read.csv(text = "a,1,b")
[1] a X1 b
<0 rows> (or 0-length row.names)
versus
> read.csv(text = "a,1,b", check.names = FALSE)
[1] a 1 b
<0 rows> (or 0-length row.names)
Join dataframes, retaining column names
You need to include drop = FALSE
in the indexing step so that the things you're binding together retain all of their dimensions. I couldn't figure out a way to do this by passing drop = FALSE
as an extra argument to [
, so I resorted to using an anonymous function instead.
NEW <- do.call(cbind, lapply(list_datf, function(x) x[n_r, , drop = FALSE]))
Alternatively, you could convert your components to tibbles, which (unlike data frames) never drop "unneeded" dimensions:
NEW <- do.call(cbind, lapply(list_datf, function(x) tibble::as_tibble(x)[n_r, ]))
If you want to go full tidyverse:
library(dplyr)
list_datf %>% purrr::map(~ tibble::as_tibble(.)[n_r, ]) %>% bind_cols()
Select specific columns, where the column names are in another df in r
The problem is that Y.variable.names
is a data.frame
which you cannot use to subset another data.frame
.
You can check by typing class(Y.variable.names)
.
So the solution to your problem is subsetting Y.variable.names
:
Y.Data = data %>% select(Y.variable.names[,1])
How do I keep the column names in a data frame when I am trying to drop all of the rows that don't start with specific names?
list = ['GOOG', 'AAPL', 'AMZN', 'NFLX']
first = True
for tickers in list:
df1 = df[df.ticker == tickers]
if first:
df1.to_csv("20CompanyAnalysisData1.csv", mode='a', header=True)
first = False
else:
df1.to_csv("20CompanyAnalysisData1.csv", mode='a', header=False)
continue
or more compactly
list = ['GOOG', 'AAPL', 'AMZN', 'NFLX']
needheader = True
for tickers in list:
df1 = df[df.ticker == tickers]
df1.to_csv("20CompanyAnalysisData1.csv", mode='a', header=neadheader)
needheader = False
continue
Related Topics
How to Detect the Right Encoding for Read.Csv
Ggplot2: Facet_Wrap Strip Color Based on Variable in Data Set
Create Dynamic Number of Input Elements with R/Shiny
Extract Every Nth Element of a Vector
Ggplot2 Heatmap with Colors for Ranged Values
Avoid Ggplot Sorting the X-Axis While Plotting Geom_Bar()
What Does the Capital Letter "I" in R Linear Regression Formula Mean
Explain Ggplot2 Warning: "Removed K Rows Containing Missing Values"
Why am I Getting X. in My Column Names When Reading a Data Frame
Scraping a Dynamic Ecommerce Page with Infinite Scroll
Remove Rows in R Matrix Where All Data Is Na
Using Gsub to Extract Character String Before White Space in R
Detecting Operating System in R (E.G. for Adaptive .Rprofile Files)
Efficiently Sum Across Multiple Columns in R