How to Validate CSV File

known services to validate csv file

I recently came across Google Refine (now OpenRefine) - it's not a service for validating CSV files, it's a tool you download locally, but it does provide a lot of tools for working with data and detecting anomalies.

As mentioned in a reply, "CSV" has become an ill-defined term, principally because people don't follow the One True Way when using delimiter separated data

http://www.catb.org/~esr/writings/taoup/html/ch05s02.html

EDIT/UPDATE (2016-08-09):

CSV Currently Becoming a Well-Defined Term by the W3C CSV Working Group

How to more completely validate CSV file when uploading to Shiny App?

I've created a minimal version of your app (without interpolation or downloads) that I think addresses (1) and (2) and your desire for an existing matrix and plot to be preserved in the event that an invalid upload occurs. You should be able to rebuild your app by modifying this skeleton, but, before doing that, you should try to understand how this app works.

Note that I've added a dependency on package shinyFeedback, which places warning messages near the appropriate input panels. Let me know if that's a problem...

library("shiny")
library("shinyFeedback")
library("shinyMatrix")

## Your variable names
nms <- c("X", "Y")

ui <- fluidPage(
useShinyFeedback(),
sidebarLayout(
sidebarPanel(
fileInput("file", label = "CSV file", accept = ".csv"),
matrixInput("mat", label = "Matrix", value = matrix(rnorm(12L), 6L, 2L, dimnames = list(NULL, nms)), class = "numeric", rows = list(names = FALSE))
),
mainPanel(
plotOutput("plot"),
verbatimTextOutput("verb")
)
)
)

server <- function(input, output, session) {
rawdata <- reactive({
req(input$file)
try(read.csv(input$file$datapath, header = TRUE))
})

observeEvent(rawdata(), {
## If 'rawdata()' is a data frame with numeric variables named 'nms'
if (is.data.frame(rawdata()) && all(nms %in% names(rawdata())) && all(vapply(rawdata()[nms], is.numeric, NA))) {
## Then update matrix by extracting those variables, ignoring the rest (if any)
updateMatrixInput(session, "mat", as.matrix(rawdata()[nms]))
## And suppress warning if visible
hideFeedback("file")
} else {
## Otherwise show warning
showFeedbackWarning("file", "Invalid upload.")
}
})

## Plots matrix rows as points
output$plot <- renderPlot(plot(input$mat))
## Prints "try-error" if 'read.csv' threw error, "data.frame" otherwise
output$verb <- renderPrint(class(rawdata()))
}

shinyApp(ui, server)

Here is code that you can use to create test files. Each one tests a different behaviour of the app.

## OK
cat("X,Y,Z\na,1,3,5\nb,2,4,6\n", file = "test1.csv")
## OK: file contents matter, not file extension
cat("X,Y,Z\na,1,3,5\nb,2,4,6\n", file = "test2.txt")
## Missing 'X'
cat("W,Y,Z\na,1,3,5\nb,2,4,6\n", file = "test3.csv")
## 'X' is not numeric
cat("X,Y,Z\na,hello,3,5\nb,world,4,6\n", file = "test4.csv")
## Not a valid CSV file
cat("read.csv\nwill,not,like,this,file\n", file = "test5.csv")

Validate CSV file columns with Spark

From what I understand, you want to validate the schema of the CSV you read. The problem with the schema option is that its goal is to tell spark that it is the schema of your data, and not to check that it is.

There is an option however that infers the said schema when reading a CSV and that could be very useful (inferSchema) in your situation. Then, you can either compare that schema with the one you expect with equals, or do the small workaround that I will introduce to be a little bit more permissive.

Let's see how it works the following file:

a,b
1,abcd
2,efgh

Then, let's read the data. I used the scala REPL but you should be able to convert all that in Java very easily.

val df = spark.read
.option("header", true) // reading the header
.option("inferSchema", true) // infering the sschema
.csv(".../file.csv")
// then let's define the schema you would expect
val schema = StructType(Array(StructField("a", IntegerType),
StructField("b", StringType)))

// And we can check that the schema spark inferred is the same as the one
// we expect:
schema.equals(df.schema)
// res14: Boolean = true

going further

That's in a perfect world. Indeed, if you schema contains non nullable columns for instance or other small differences, this solution that's based on strict equality of object will not work.

val schema2 = StructType(Array(StructField("a", IntegerType, false),
StructField("b", StringType, true)))
// the first column is non nullable, it does not work because all the columns
// are nullable when inferred by spark:
schema2.equals(df.schema)
// res15: Boolean = false

In that case you may need to implement a schema comparison method that would suit you like:

def equalSchemas(s1 : StructType, s2 : StructType) = {
s1.indices
.map(i => s1(i).name.toUpperCase.equals(s2(i).name.toUpperCase) &&
s1(i).dataType.equals(s2(i).dataType))
.reduce(_ && _)
}
equalSchemas(schema2, df.schema)
// res23: Boolean = true

I am checking that the names and the types of the columns are matching and that the order is the same. You could need to implement a different logic depending on what you want.

How can I validate for only CSV file uploads using the pattern attribute using HTML5?

Now you can use the new HTML5 input validation attribute:

pattern="^.+\.(xlsx|xls|csv)$"

Accept type for other files (Reference: HTML5 Documentation):

For CSV:

<input type="file" accept=".csv" />

For Excel files, 2003-2007 (.xls):

<input type="file" accept="application/vnd.ms-excel" />

For Excel files, 2010 (.xlsx):

<input type="file" accept="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" />

For text files (.txt):

<input type="file" accept="text/plain" />

For image files (.png, .jpg, etc.):

<input type="file" accept="image/*" />

For HTML files (.htm, .html):

<input type="file" accept="text/html" />

For video files (.avi, .mpg, .mpeg, .mp4):

<input type="file" accept="video/*" />

For audio files (.mp3, .wav, etc.):

<input type="file" accept="audio/*" />

For PDF files, use:

<input type="file" accept=".pdf" /> 


Related Topics



Leave a reply



Submit