Get the Number of Lines in a Text File Using R

Get the number of lines in a text file using R

If you:

  • still want to avoid the system call that a system2("wc"… will cause
  • are on BSD/Linux or OS X (I didn't test the following on Windows)
  • don't mind a using a full filename path
  • are comfortable using the inline package

then the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc in an inline R C function):

library(inline)

wc.code <- "
uintmax_t linect = 0;
uintmax_t tlinect = 0;

int fd, len;
u_char *p;

struct statfs fsb;

static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;

PROTECT(f = AS_CHARACTER(f));

if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {

if (fstatfs(fd, &fsb)) {
fsb.f_iosize = SMALL_BUF_SIZE;
}

if (fsb.f_iosize != buf_size) {
if (buf != small_buf) {
free(buf);
}
if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
buf = small_buf;
buf_size = SMALL_BUF_SIZE;
} else {
buf_size = fsb.f_iosize;
}
}

while ((len = read(fd, buf, buf_size))) {

if (len == -1) {
(void)close(fd);
break;
}

for (p = buf; len--; ++p)
if (*p == '\\n')
++linect;
}

tlinect += linect;

(void)close(fd);

}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";

setCMethod("wc",
signature(f="character"),
wc.code,
includes=c("#include <stdlib.h>",
"#include <stdio.h>",
"#include <sys/param.h>",
"#include <sys/mount.h>",
"#include <sys/stat.h>",
"#include <ctype.h>",
"#include <err.h>",
"#include <errno.h>",
"#include <fcntl.h>",
"#include <locale.h>",
"#include <stdint.h>",
"#include <string.h>",
"#include <unistd.h>",
"#include <wchar.h>",
"#include <wctype.h>",
"#define SMALL_BUF_SIZE (1024 * 8)"),
language="C",
convention=".Call")

wc("FULLPATHTOFILE")

It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955 line file I had lying around, I get (mean values from a bunch of runs):

   user  system elapsed 
0.007 0.003 0.010

Getting the number of lines in a text file in R

As I understand it, you are simply trying to figure out how many lines are in your txt file. Is that correct? If so, something like this will work:

dat <- readLines(path/to/txtfile.txt)
length(dat)

How to get the exact count of lines in a very large text file in R?

1) wc This should be quite fast. First determine the filenames. We have assumed all files in the current directory whose extension is .txt. Change as needed. Then for each file run wc -l and form a data frame from it.

(If you are on Windows then install Rtools and ensure that \Rtools\bin is on your PATH.)

filenames <- dir(pattern = "[.]txt$")
wc <- function(x) shell(paste("wc -l", x), intern = TRUE)
DF <- read.table(text = sapply(filenames, wc), col.names = c("count", "filename"))

2) count.fields An alternative approach is to use count.fields. This does not make use of any external commands. filenames is from above.

sapply(filenames, function(x) length(count.fields(x, sep = "\1")))

Extracting numbers from text file in R

This should work:

files <- list.files(path= "directory/info/", pattern= "*.txt", full.names = TRUE)
data <- lapply(files, function(x) {
# the data we're interested in doesn't seem to be a table
# easier to read it in as a character vector
datxt <- readLines(x)

# keep only the line with the text we're looking for
datxt <- datxt[grepl(pattern = "never classified (0)", x = datxt, fixed = TRUE)]

# get the number from that line
n <- sub(pattern = "never classified (0)", replacement = "", x = datxt, fixed = TRUE)
n <- as.numeric(trimws(n))

return(data.frame(file = x, NoOfReturn = n))
})

Read lines by number from a large file

The trick is to use connection AND open it before read.table:

con<-file('filename')
open(con)

read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)

You may also try scan, it is faster and gives more control.

Different number of lines when loading a file into R

Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.

All that said, the following code should get the job done.

table <- read.table(file.choose(), sep = "\n")

Text Line Count in R

This is raised in the comments, but it really bares being it's own answer:

You cannot "count lines" without defining what a "line" is. A line is a very vague concept and can vary by the program being used.

Unless of course the data contains some indicator of a line break, such as \n. But even then, you would not be counting lines, you would be counting linebreaks. You would then have to ask yourself if the hardcoded line break is in accord with what you are hoping to analyze.

--

If your data does not contain linebreaks, but you still want to count the number of lines, then we're back to the question of "how do you define a line"? The most basic way, is as @flodel suggests, which is to use character length. For example, you can define a line as 76 characters long, and then take

ceiling(nchar(X) / 76))

This of course assumes that you can cut words. (If you need words to remain whole, then you have to get craftier)

Print first few lines of a text file in R

Use writeLines with readLines:

writeLines(readLines("file.txt", 2))

giving:

some text on line 1
some more text on line 2

This could alternately be written as the following pipeline. It gives the same output:

library(magrittr)

"file.txt" %>% readLines(2) %>% writeLines

Can I skip until I reach a certain line when iterating through text in RStudio?

You cannot do that in base R functions, and I don't know of a package that directly provides that. However, here are two ways to get the effect.

First, a file named file.txt:

I want to skip this
and this too
Absolute Irradiance
I need this line
txt <- readLines("file.txt")
txt[cumany(grepl("Absolute Irradiance", txt))]
# [1] "Absolute Irradiance" "I need this line"

If you don't want the "Irradiance" line but want everything after it, then add [-1] to remove the first of the returned lines:

txt[cumany(grepl("Absolute Irradiance", txt))][-1]
# [1] "I need this line"

If the file is relatively large and you do not want to read all of it into R, then

system2("sed", c("-ne", "'/Absolute Irradiance/,$p'", "file.txt"), stdout = TRUE)
# [1] "Absolute Irradiance" "I need this line"

This second technique is really not that great ... it might be better to run that from file.txt into a second (temp) file, then just readLines("tempfile.txt") directly.



Related Topics



Leave a reply



Submit