Cannot Read File with "#" and Space Using Read.Table or Read.CSV in R

Cannot read file with # and space using read.table or read.csv in R

From the documentation (?read.csv):

comment.char character: a character vector of length one containing a single character or an empty string. Use "" to turn off the interpretation of comments altogether.

The default is comment.char = "#" which is causing you trouble. Following the documentation, you should use comment.char = "".

Spaces in the header is another issue which, as mrdwab kindly pointed out, can be addressed by setting check.names = FALSE.

chromosomes <- read.csv(chromFile, sep = "\t", skip = 0, header = TRUE,
comment.char = "", check.names = FALSE)

Read.table while using '#' as delimiter does not work?

The comment character is also #, so you need something like:

read.table(file='tmp.txt', check.names=FALSE, sep='#', 
header=TRUE, comment.char="@")

Issue when importing dataset: `Error in scan(...): line 1 did not have 145 elements`

This error is pretty self-explanatory. There seem to be data missing in the first line of your data file (or second line, as the case may be since you're using header = TRUE).

Here's a mini example:

## Create a small dataset to play with
cat("V1 V2\nFirst 1 2\nSecond 2\nThird 3 8\n", file="test.txt")

R automatically detects that it should expect rownames plus two columns (3 elements), but it doesn't find 3 elements on line 2, so you get an error:

read.table("test.txt", header = TRUE)
# Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
# line 2 did not have 3 elements

Look at the data file and see if there is indeed a problem:

cat(readLines("test.txt"), sep = "\n")
# V1 V2
# First 1 2
# Second 2
# Third 3 8

Manual correction might be needed, or we can assume that the value first value in the "Second" row line should be in the first column, and other values should be NA. If this is the case, fill = TRUE is enough to solve your problem.

read.table("test.txt", header = TRUE, fill = TRUE)
# V1 V2
# First 1 2
# Second 2 NA
# Third 3 8

R is also smart enough to figure it out how many elements it needs even if rownames are missing:

cat("V1 V2\n1\n2 5\n3 8\n", file="test2.txt")
cat(readLines("test2.txt"), sep = "\n")
# V1 V2
# 1
# 2 5
# 3 8
read.table("test2.txt", header = TRUE)
# Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
# line 1 did not have 2 elements
read.table("test2.txt", header = TRUE, fill = TRUE)
# V1 V2
# 1 1 NA
# 2 2 5
# 3 3 8

Read txt file with hash tag (#) delimiter

From ?read.table:

comment.char
character: a character vector of length one containing a single character or an empty string. Use "" to turn off the interpretation of comments altogether.

So you want something like read.table(*, sep="#", comment.char="")

Reading in data with unusual characters

What I'm seeing in my text editor is the use of "|" as a separator rather than "\n" and in line 1724 this sequence:

kibbled [ìGrtzeî or ìgruttenî], pearl...     

There are two different accented characters that appear to be enclosing Grtze and grutten but the character you see was not displayed.

When I read it in on a Mac with:

read.table("~/Downloads/lines/1720-1730.txt", sep="|")

The characters in question appear thusly:

[\x93Gr\032tze\x94 or \x93grutten\x94]

So the 'arrow' you are seeing is \032. I have found that it is rather difficult to decipher what is meant by the various 'escaped' R output. The best place to look is the ?Quotes page and there we learn that this is 32 octal or 26 decimal. You might want to try this as you input strategy and see how it goes:

x <- read.table("yourpath/filename.txt", sep="|", stringsAsFactors=FALSE, allowEscapes = TRUE)

If that is insufficient then try adding one of the encoding options "latin1", "UTF-8", "UTF-16" and if unsuccessful there are other Windows encodings yet to try.

When you get a message about a a lower number of elements it ususually means there is an unmatched quote or an embedded hash ("#"). You can add these parameters: quote="", comment.char="". If you want to see the effect of those additional comments you can use this:

 table(count.fields("yourpath/filename.txt", sep="|", stringsAsFactors=FALSE, 
allowEscapes = TRUE, quote="", comment.char=""))

There are further inspection maneuvers that let you see which lines are problems:

 which(count.fields("yourpath/filename.txt", sep="|", stringsAsFactors=FALSE, 
allowEscapes = TRUE, quote="", comment.char="") == 28)

There can be a mismatch between your locale and the default codings. You should report the results of sessionInfo()

Encodings I have see mentioned as solving weird problems include "CP1252", "Latin2" (which is ISO-8859-2), but I have discovered that the list of encodings is larger than I expected:

 iconvlist()  # 419 encodings

If you know the organization that created the file then why not ask them?

From the first of the multiple zip files included in that "master" zip file, we see this resutl to my suggestion to use count.fields:

table( count.fields("~/Downloads/SMKA12_2012archive/SMKA121212", quote="", 
sep="|",comment.char="") )
#------------
15 27 28
1 10228 1
which( count.fields("~/Downloads/SMKA12_2012archive/SMKA121212", quote="", sep="|",comment.char="") ==15)
#[1] 1
which( count.fields("~/Downloads/SMKA12_2012archive/SMKA121212", quote="", sep="|",comment.char="") ==28)
#[1] 10230

Reading these files on a Mac with R 3.0.1 and TextEdit.app. The first record appears to be not really a header but rather a notation, perhaps signifying the month of data recording:

000000000|||||||||||||||||||||||||||HMCUSTOMS CONTROL DATA|2012|12

The last record has a non-data trailing record that includes a final record count appended to it.
999999999| | | | | | | | | | | | | | | | | | | | | | | | | | |0010228

So using skip= 1 and fill =TRUE should allow error free input.

dat <- read.table("~/Downloads/SMKA12_2012archive/SMKA121212", quote="", sep="|",comment.char="", fill=TRUE, skip=1 , colClasses=c( rep("integer", 2), rep("character", 4), rep("integer", 24-7+1), rep("character", 3)))
> str(dat)
'data.frame': 10230 obs. of 27 variables:
$ V1 : int 10110100 10110900 10121000 10129100 10129900 10130000 10190000 10190110 10190190 10190300 ...
$ V2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ V3 : chr "00/00" "00/00" "01/12" "01/12" ...
$ V4 : chr "12/11" "12/11" "00/00" "00/00" ...
$ V5 : chr "00/00" "00/00" "01/12" "01/12" ...
$ V6 : chr "12/11" "12/11" "00/00" "00/00" ...
$ V7 : int 0 0 0 0 0 0 0 0 0 0 ...
$ V8 : int 150 150 150 150 150 150 150 150 150 150 ...
$ V9 : int 2 2 2 2 2 2 2 2 2 2 ...
$ V10: int 13 13 13 13 13 13 13 13 13 13 ...
$ V11: int 0 0 0 0 0 0 0 0 0 0 ...
$ V12: int 200 200 200 200 200 200 200 200 200 200 ...
$ V13: int 0 0 0 0 0 0 0 0 0 0 ...
$ V14: int 0 0 0 0 0 0 0 0 0 0 ...
$ V15: int 0 0 0 0 0 0 0 0 0 0 ...
$ V16: int 0 0 0 0 0 0 0 0 0 0 ...
$ V17: int 0 0 0 0 0 0 0 0 0 0 ...
$ V18: int 0 0 0 0 0 0 0 0 0 0 ...
$ V19: int 0 0 0 0 0 0 0 0 0 0 ...
$ V20: int 0 0 0 0 0 0 0 0 0 0 ...
$ V21: int 0 0 0 0 0 0 0 0 0 0 ...
$ V22: int 0 0 0 0 0 0 0 0 0 0 ...
$ V23: int 0 0 0 0 0 0 0 0 0 0 ...
$ V24: int 0 0 0 0 0 0 0 0 0 0 ...
$ V25: chr "KG " "KG " "KG " "KG " ...
$ V26: chr "NO " "NO " "NO " "NO " ...
$ V27: chr "Pure-bred breeding horses "| __truncated__ "Pure-bred breeding asses "| __truncated__ "Pure-bred breeding horses "| __truncated__ "Horses for slaughter "| __truncated__ ...

As far as encoding issues I cannot give further insight:

Encoding (readLines("~/Downloads/SMKA12_2012archive/SMKA121212", n=1))
#[1] "unknown"

How to properly import a semicolon-separated file

Here is what I try (and edited based on David Arenburg 's comment).

Read and process the header line first; then read the remaining lines while skipping the first line:

library(data.table)

header <- strsplit(readLines('test.txt', n = 1), '\\s+')[[1]][-1]
res <- fread('test.txt', skip = 1, header = FALSE)
setnames(res, 1:3, header)

res[, strsplit(header[2], ';')[[1]] :=
tstrsplit(get(header[2]), ';', type.convert = TRUE, fixed = TRUE)[-10]]

res[, header[2] := NULL]

# word WN000000 len freq mean sens npos u orthon freqn bgp
# 1: fiber 10000000 5 8.671 1 5 1 0.0000000 5 6.10 0
# 2: clad 10000000 4 6.780 2 2 1 1.0000000 8 7.84 2026
# 3: tucker 10000000 6 8.103 2 3 2 0.9182958 7 5.50 4547

It should be noted that there are 9 items seperated by ; in second column of the input file, but the following rows have 10 ;-seperated items.

What's the best way to read file rows and write them to various database tables using Perl?

Instead of different variables for different tables, use a hash to keep all the information:

my %tables = (
R1 => [ [ 'Field_R1_C1', 11, 40 ],
[ 'Field_R1_C2', 41, 72 ],
[ 'Field_R1_C3', 73, 80 ],
...,
[ 'Field_R1_Cn', 250, 300 ] ],
R2 => [ [ ... ] ],
);

Then modify the writing sub to use this structure:

sub write_to_db {
my ($row) = @_;

my $record_type = substr $row, 0, 2;
my %table_names = ( 10 => 'R1',
20 => 'R2',
30 => 'R3' );
my $table = $table_names{$record_type};
die "Unknown table for record type $record_type\n"
unless defined $table;

my @insert_data;
for my $column (@{ $tables{$table} }) {
my ($column_name, $column_start, $column_end) = @$column;
my $record = substr $row, $column_start - 1,
$column_end - $column_start + 1;
push @insert_data, $record;
}
my $sql_columns = join ',', map $_->[0], @{ $tables{$table} };
my $placeholders = join ',', ('?') x @insert_data;
my $insert = $dbh->prepare(
"INSERT INTO $table ($sql_columns) VALUES ($placeholders)");
$insert->execute(@insert_data);
}

And call it for each row:

write_to_db($row);


Related Topics



Leave a reply



Submit