Technique for Finding Bad Data in Read.CSV in R

data cleaning with read.csv

While the ultimate form you'll need will be very dependent on the particulars of the file in question, for what you've presented here you can hack the data out without too much insanity:

library(tidyverse)

df <- read_csv2(file, col_names = FALSE) %>%
filter(rowSums(!is.na(.)) > 0) %>%
magrittr::set_rownames(.[[1]]) %>%
select(-1) %>%
t() %>%
as_data_frame() %>%
type_convert(col_types = cols(Date = col_date('%d.%m.%Y')),
locale = locale(decimal_mark = ','))

df
#> # A tibble: 3 x 12
#> Name Correction Date Time `T_int [ms]` `Ev [lx]`
#> <chr> <chr> <date> <time> <int> <dbl>
#> 1 #1 <NA> 2016-09-19 12:05:03 806 1310
#> 2 #2 <NA> 2016-09-19 12:06:01 800 1350
#> 3 #3 <NA> 2016-09-19 12:07:00 884 1270
#> # ... with 6 more variables: `Ee [W/sqm] (380-780nm)` <dbl>, `Chrom.
#> # Coord.` <chr>, x <dbl>, y <dbl>, `u'` <dbl>, `v'` <dbl>

Data

file <- ";;;
;;;
;;;
Name;#1;#2;#3
Correction;;;
Date;19.09.2016;19.09.2016;19.09.2016
Time;12:05:03;12:06:01;12:07:00
T_int [ms];806;800;884
Ev [lx];1,31E+03;1,35E+03;1,27E+03
Ee [W/sqm] (380-780nm);4,22E+00;4,38E+00;4,17E+00
;;;
;;;
Chrom. Coord.;;;
x;0,3657;0,3642;0,3643
y;0,3842;0,3831;0,3833
u';0,2126;0,2121;0,2121
v';0,5026;0,502;0,5021
;;;"

In R, how to read a special csv with some rows skipped the first value?

Updated

you can use the na.strings parameter to replace the empty dates ("") with missing values (NA),

data = read.csv(your_file, header = TRUE, na.strings = c(""))

then,

data$Date = as.Date(data$Date)
data$Date = zoo::na.locf(data$Date)

to fill the missing values.

However, credit to @Taran, who commented your initial question, as I wasn't aware of the zoo::na.locf function.

Reading broken CSV lines from R

There are probably several ways to do this..

UPDATE: Try this then. With the skip=argument in scan()you can specify how many rows to skip.


file <- scan("C:/Users/skupfer/Documents/bisher.txt", strip.white = TRUE, sep = ",",
what = list("character"), skip = 1)

file_mat <- matrix(file[[1]][file[[1]] != ""], ncol = 5, byrow = TRUE)

file_df <- as.data.frame(file_mat, stringsAsFactors = FALSE)

file_df$Quantity <- as.integer(file_mat[,3])

> file_df
Product Date Quantity Categorie sector
1 ABC 01052019 4510 Food Dry
2 CDE 01052019 222 Drink Cold
3 FGH 01052019 345 Food Dry
4 IJK 01052019 234 Food Cold

Reading in Poor CSV File Structure

Using panda.read_csv and regex negative look ahead. The same regex should work in R as well.

import pandas as pd

df = pd.read_csv(filename, sep=r',(?!\s)')

Filter df for rows in which LOC has a comma, to verify that we've parsed correctly:

df[df.LOC.str.contains(',')]

Sample Image

Error reading csv as a zoo object - certain lines with 'bad entries'

Your csv has empty values. You can fill with NAs and then turn into a zoo object. You could try this:

x<- read.csv("OakParkR.csv", header=TRUE)
na.fill(x,NA)
x<- zoo(x)
x[33:35]
#date imax Tmax imin Tmin irain rain cbl wdsp ihm hm iddhm ddhm ihg hg soil
#33 02-Feb-07 0 9.1 0 -1.7 0 0.1 1026.2 3.9 0 10 0 340 0 14 5.970
#34 03-Feb-07 0 9.2 0 -3.0 0 0.0 <NA> 2.4 0 7 0 130 0 11 3.101
#35 04-Feb-07 0 7.7 0 -3.7 0 0.0 1031.8 3.3 0 8 0 330 0 12 2.668


Related Topics



Leave a reply



Submit