Difference Between Parsing a Text File in R and Rb Mode

Difference between parsing a text file in r and rb mode

This depends a little bit on what version of Python you're using. In Python 2, Chris Drappier's answer applies.

In Python 3, its a different (and more consistent) story: in text mode ('r'), Python will parse the file according to the text encoding you give it (or, if you don't give one, a platform-dependent default), and read() will give you a str. In binary ('rb') mode, Python does not assume that the file contains things that can reasonably be parsed as characters, and read() gives you a bytes object.

Also, in Python 3, the universal newlines (the translating between '\n' and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.

What is the difference between rb and r+b modes in file objects

r+ is used for reading, and writing mode. b is for binary.
r+b mode is open the binary file in read or write mode.

You can read more here.

what's the differences between r and rb in fopen

You should use "r" for opening text files. Different operating systems have slightly different ways of storing text, and this will perform the correct translations so that you don't need to know about the idiosyncracies of the local operating system. For example, you will know that newlines will always appear as a simple "\n", regardless of where the code runs.

You should use "rb" if you're opening non-text files, because in this case, the translations are not appropriate.

Parsing a text file by a delimiter and outputting multiple files with R

This isn't the most elegant answer but this got me what I needed. I'll try out the other answer, it's a good idea to keep the data in my R environment so I can run all my metrics without reading in unnecessary files. Thanks @Till

#~~~~~~~~~~~~~~~~~~~~~~#
#~~ Parse Server Log ~~#
#~~~~~~~~~~~~~~~~~~~~~~#

# Read File 
serverLog <- "server-out.min"
conn <- file( serverLog ,open="r")
linn <-readLines(conn)
num <- 1

# Loop through File 
for (i in 1:length(linn)){
  # print( linn[i] )

  # current output file
  file <- paste( "server-log-", num, sep = "")
  # write to file
  write(linn[i], file=file, append=TRUE)

  # Check for Monthly Delimiter, update num
  test <- grepl(  "Monthly", linn[i] )
  if( test ) {
    print( "Found Monthly Breakpoint")
    num <- num+1
  }
}
close(conn)

Difference between modes a, a+, w, w+, and r+ in built-in open function?

The opening modes are exactly the same as those for the C standard library function fopen().

The BSD fopen manpage defines them as follows:

 The argument mode points to a string beginning with one of the following
 sequences (Additional characters may follow these sequences.):

 ``r''   Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w''   Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a''   Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+''  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.

Parsing Speech Transcripts Using R

It's hard to know exactly what your input format is, since the example is not fully reproducible, but let's assume that your text as printed in the question are lines from a single text file. Here, I saved it (without the double quotes) as such a text file, example.txt.

We designed corpus_segment() for this use case.

library("quanteda")
## Package version: 1.3.14

example_corpus <- readtext::readtext("example.txt") %>%
  corpus()
summary(example_corpus)
## Corpus consisting of 1 document:
## 
##         Text Types Tokens Sentences
##  example.txt    93    141         8
## 
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
## Created: Wed Jan  9 19:09:55 2019
## Notes:

example_corpus2 <-
  corpus_segment(example_corpus, pattern = "sr\\..*-", valuetype = "regex")
summary(example_corpus2)
## Corpus consisting of 2 documents:
## 
##           Text Types Tokens Sentences                        pattern
##  example.txt.1    10     10         1     sr. presidente domínguez.-
##  example.txt.2    80    117         7 sr. ATANASOF, ALFREDO NESTOR.-
## 
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
## Created: Wed Jan  9 19:09:55 2019
## Notes: corpus_segment.corpus(example_corpus, pattern = "sr\\..*-", valuetype = "regex")

We can tidy that up a bit.

# clean up pattern by removing unneeded elements
docvars(example_corpus2, "pattern") <-
  stringi::stri_replace_all_fixed(docvars(example_corpus2, "pattern"),
    c("sr. ", ".-"), "",
    vectorize_all = FALSE
  )

names(docvars(example_corpus2))[1] <- "speaker"

summary(example_corpus2)
## Corpus consisting of 2 documents:
## 
##           Text Types Tokens Sentences                  speaker
##  example.txt.1    10     10         1     presidente domínguez
##  example.txt.2    80    117         7 ATANASOF, ALFREDO NESTOR
## 
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
## Created: Wed Jan  9 19:09:55 2019
## Notes: corpus_segment.corpus(example_corpus, pattern = "sr\\..*-", valuetype = "regex")

Difference between r+ and w+ in fopen()

The main difference is w+ truncate the file to zero length if it exists or create a new file if it doesn't. While r+ neither deletes the content nor create a new file if it doesn't exist.

Try these codes and you will understand:

#include <stdio.h>
int main()
{
   FILE *fp;

   fp = fopen("test.txt", "w+");
   fprintf(fp, "This is testing for fprintf...\n");
   fputs("This is testing for fputs...\n", fp);
   fclose(fp);
}

and then this

#include <stdio.h>
int main()
{
   FILE *fp;

   fp = fopen("test.txt", "w+");
   fclose(fp);
}

If you will open test.txt, you will see that all data written by the first program has been erased.

Repeat this for r+ and see the result.

Here is the summary of different file modes:

Sample Image

Difference Between Parsing a Text File in R and Rb Mode