How to Read a Text File into Gnu R with a Multiple-Byte Separator

How to read a text file into GNU R with a multiple-byte separator?

Providing example data would help. However, you might be able to adapt the following to your needs.

I created an example data file, which is a just a text file containing the following:

1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3

I saved it as 'test.csv'. The separation character is the 'sep' string. I think read.csv() uses scan(), which only accepts a single character for sep. To get around it, consider the following:

dat <- readLines('test.csv')
dat <- gsub("sep", " ", dat)
dat <- textConnection(dat)
dat <- read.table(dat)

readLines() just reads the lines in. gsub substitutes the multi-character seperation string for a single ' ', or whatever is convenient for your data. Then textConnection() and read.data() reads everything back in conveniently. For smaller datasets, this should be fine. If you have very large data, consider preprocessing with something like AWK to substitute the multi-character separation string. The above is from http://tolstoy.newcastle.edu.au/R/e4/help/08/04/9296.html .

Update
Regarding your comment, if you have spaces in your data, use a different replacement separator. Consider changing test.csv to :

1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3

Then, with the following function:

readMulti <- function(x, sep, replace, as.is = T)
{
dat <- readLines(x)
dat <- gsub(sep, replace, dat)
dat <- textConnection(dat)
dat <- read.table(dat, sep = replace, as.is = as.is)

return(dat)
}

Try:

readMulti('test.csv', sep = "sep", replace = "\t", as.is = T)

Here, you replace the original separator with tabs (\t). The as.is is passed to read.table() to prevent strings being read in is factors, but that's your call. If you have more complicated white space within your data, you might find the quote argument in read.table() helpful, or pre-process with AWK, perl, etc.

Something similar with crippledlambda's strsplit() is most likely equivalent for moderately sized data. If performance becomes an issue, try both and see which works for you.

Read data separated with two colons in R

Let's say we have:

A::B::C
23::34::56
12::56::87
90::43::74

in a txt file. Then we can do:

lines <- readLines("doublesep.txt")
> lines
[1] "A::B::C" "23::34::56" "12::56::87" "90::43::74"

lines <- gsub("::", ",", lines)
> lines
[1] "A,B,C" "23,34,56" "12,56,87" "90,43,74"

Now, you can either write to a file or convert to a data.frame object:

> read.table(text=lines, sep=",", header=T)
A B C
1 23 34 56
2 12 56 87
3 90 43 74

> writeLines(lines, "doubletosingle.csv")

How to read the content of a file to a string in C?

I tend to just load the entire buffer as a raw memory chunk into memory and do the parsing on my own. That way I have best control over what the standard lib does on multiple platforms.

This is a stub I use for this. you may also want to check the error-codes for fseek, ftell and fread. (omitted for clarity).

char * buffer = 0;
long length;
FILE * f = fopen (filename, "rb");

if (f)
{
fseek (f, 0, SEEK_END);
length = ftell (f);
fseek (f, 0, SEEK_SET);
buffer = malloc (length);
if (buffer)
{
fread (buffer, 1, length, f);
}
fclose (f);
}

if (buffer)
{
// start to process your data / extract strings here...
}

Reading text file with varying column width but fixed delimiter in R

I don't know if there are good tools that look for multi-char delimiters, and you aren't the first to ask about it. Most (incl read.table, read.delim, and readr::read_delim) require a single-byte separator.

One method, though certainly not efficient for large files, is to load them in line-wise and do the splitting yourself.

(Consumable data that the bottom.)

x <- readLines(textConnection(file1))
x <- x[x != 'header'] # or x <- x[-(1:5)]

(I'm guessing it isn't always the literal header, so I'm assuming it's either a fixed count or you can easily "know" which is which.)

spl <- strsplit(x, '   ')
str(spl)
# List of 3
# $ : chr [1:31] "01130009.JPG" "JPEG" "" "" ...
# $ : chr [1:20] "01130009.JPG" "JPEG" "" "" ...
# $ : chr [1:7] "01130009.JPG" "JPEG" "" "" ...

This seems ok, except that in your examples, there are lots of blanks on the right ...

spl[[1]]
# [1] "01130009.JPG"
# [2] "JPEG"
# [3] ""
# [4] ""
# [5] "2/5/2018 3:53:44 PM"
# [6] "G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg"
# [7] "Gray Fox"
# [8] ""
# [9] ""
# [10] ""
# [11] ""
# [12] ""
# [13] ""
# [14] ""
# [15] ""
# [16] ""
# [17] ""
# [18] ""
# [19] ""
# [20] ""
# [21] ""
# [22] ""
# [23] ""
# [24] ""
# [25] ""
# [26] ""
# [27] ""
# [28] ""
# [29] ""
# [30] ""
# [31] ""

So if you know how many columns there are, then you can easily remove extras:

spl <- lapply(spl, `[`, 1:7)

and then check the output:

as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5
# 1 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# V6
# 1 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
# 2 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
# 3 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
# V7
# 1 Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3 Gray Fox

This works equally well with your second example:

x <- readLines(textConnection(file2))
x <- x[x != 'header'] # or x <- x[-(1:5)]
spl <- lapply(strsplit(x, ' '), `[`, 1:7)
as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5
# 1 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# V6
# 1 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
# 2 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
# 3 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
# V7
# 1 Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3 Gray Fox

Consumable data:

# note: replaced single '\' with double '\\' for R string-handling only
file1 <- 'header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg Gray Fox '
file2 <- 'header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB Gray Fox '

Split a text file using gsplit on a delimiter on OSX Mojave

How about:

awk -F\(abc\) 'RS="^$" { for (i=1;i<NF;i++) { system("echo \""$i"\" > "i"-abc.txt") } }' abc.txt

We remove the record separator so we can process the whole file as one record. Then we set "abc" as the delimiter and then we look through each record and use the system command to echo out record to a file names abc prefixed with the number of the record.

abc.txt holds the original data

How to use indirect reference to read contents into a data table in R.

R doesn't really have references like that, but you can use strings to retrieve/create variables of that name.

But first let me say this is generally not a good practice. If you're looking to do this type of thing, it's generally a sign that you're not doing it "the R way.'

Nevertheless

assign(refToMyTable, read.csv(inputFile, sep=",", header=T))

Should to the trick. And the complement to assign is get to retrieve a variable's value using it's name.

#include a text file in a C program as a char[]

I'd suggest using (unix util)xxd for this.
you can use it like so

$ echo hello world > a
$ xxd -i a

outputs:

unsigned char a[] = {
0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x0a
};
unsigned int a_len = 12;

How to split a file into equal parts, without breaking individual lines?

If you mean an equal number of lines, split has an option for this:

split --lines=75

If you need to know what that 75 should really be for N equal parts, its:

lines_per_part = int(total_lines + N - 1) / N

where total lines can be obtained with wc -l.

See the following script for an example:

#!/usr/bin/bash

# Configuration stuff

fspec=qq.c
num_files=6

# Work out lines per file.

total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))

# Split the actual file, maintaining lines.

split --lines=${lines_per_file} ${fspec} xyzzy.

# Debug information

echo "Total lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l xyzzy.*

This outputs:

Total lines     = 70
Lines per file = 12
12 xyzzy.aa
12 xyzzy.ab
12 xyzzy.ac
12 xyzzy.ad
12 xyzzy.ae
10 xyzzy.af
70 total

More recent versions of split allow you to specify a number of CHUNKS with the -n/--number option. You can therefore use something like:

split --number=l/6 ${fspec} xyzzy.

(that's ell-slash-six, meaning lines, not one-slash-six).

That will give you roughly equal files in terms of size, with no mid-line splits.

I mention that last point because it doesn't give you roughly the same number of lines in each file, more the same number of characters.

So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won't get four lines in every file.



Related Topics



Leave a reply



Submit