Java - Read File and Split into Multiple Files

Java - Read file and split into multiple files

Since one file can be very large, each split file could be large as well.

Example:

Source File Size: 5GB

Num Splits: 5: Destination

File Size: 1GB each (5 files)

There is no way to read this large split chunk in one go, even if we have such a memory. Basically for each split we can read a fix size byte-array which we know should be feasible in terms of performance as well memory.

NumSplits: 10 MaxReadBytes: 8KB

public static void main(String[] args) throws Exception
{
RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
long numSplits = 10; //from user input, extract it from args
long sourceSize = raf.length();
long bytesPerSplit = sourceSize/numSplits ;
long remainingBytes = sourceSize % numSplits;

int maxReadBufferSize = 8 * 1024; //8KB
for(int destIx=1; destIx <= numSplits; destIx++) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
if(bytesPerSplit > maxReadBufferSize) {
long numReads = bytesPerSplit/maxReadBufferSize;
long numRemainingRead = bytesPerSplit % maxReadBufferSize;
for(int i=0; i<numReads; i++) {
readWrite(raf, bw, maxReadBufferSize);
}
if(numRemainingRead > 0) {
readWrite(raf, bw, numRemainingRead);
}
}else {
readWrite(raf, bw, bytesPerSplit);
}
bw.close();
}
if(remainingBytes > 0) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
readWrite(raf, bw, remainingBytes);
bw.close();
}
raf.close();
}

static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
byte[] buf = new byte[(int) numBytes];
int val = raf.read(buf);
if(val != -1) {
bw.write(buf);
}
}

How can I split a CSV file into different CSV files by line in Java?

public static void splitLargeFile(final String fileName, 
final String extension,
final int maxLines,
final boolean deleteOriginalFile) {

try (Scanner s = new Scanner(new FileReader(String.format("%s.%s", fileName, extension)))) {
int file = 0;
int cnt = 0;
BufferedWriter writer = new BufferedWriter(new FileWriter(String.format("%s_%d.%s", fileName, file, extension)));

while (s.hasNext()) {
writer.write(s.next() + System.lineSeparator());
if (++cnt == maxLines && s.hasNext()) {
writer.close();
writer = new BufferedWriter(new FileWriter(String.format("%s_%d.%s", fileName, ++file, extension)));
cnt = 0;
}
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}

if (deleteOriginalFile) {
try {
File f = new File(String.format("%s.%s", fileName, extension));
f.delete();
} catch (Exception e) {
e.printStackTrace();
}
}
}

Splitting a text file into multiple files by specific character sequence

You want to find the lines which match "I n".

The regex you need is : ^.I \d$

  • ^ indicates the beginning of the line. Hence, if there are some whitespaces or text before I, the line will not match the regex.
  • \d indicates any digit. For the sake of simplicty, I allow only one digit in this regex.
  • $ indicates the end of the line. Hence, if there are some characters after the digit, the line will not match the expression.

Now, you need to read the file line by line and keep a reference to the file in which you write the current line.

Reading a file line by line is much easier in Java 8 with Files.lines();

private String currentFile = "root.txt";

public static final String REGEX = "^.I \\d$";

public void foo() throws Exception{

Path path = Paths.get("path/to/your/input/file.txt");
Files.lines(path).forEach(line -> {
if(line.matches(REGEX)) {
//Extract the digit and update currentFile
currentFile = "File DOC_ID_"+line.substring(3, line.length())+".txt";
System.out.println("Current file is now : currentFile);
} else {
System.out.println("Writing this line to "+currentFile + " :" + line);
//Files.write(...);
}
});

Note : In order to extract the digit, I use a raw "".substring() which I consider as evil but it is easier to understand. You can do it in a better way with a Pattern and a Matcher :

With this regex : ".I (\\d)". (The same as before but with parenthesis which indicates what you will want to capture). Then :

Pattern pattern = Pattern.compile(".I (\\d)");
Matcher matcher = pattern.matcher(".I 3");
if(matcher.find()) {
System.out.println(matcher.group(1));//display "3"
}

how to split one text into multiple text files

In addition to my comment, considering there isn't a space between the leading integer and the first word, the substring at the first space doesn't work.

This question/answer has a few options that should help, the one using regex (\d+) being the simplest one imo, and copied below.

Matcher matcher = Pattern.compile("\\d+").matcher(arrOfStr[i]);
matcher.find();
int yourNumber = Integer.valueOf(matcher.group());

Given a string find the first embedded occurrence of an integer

Java: Read a File & Split a line into separate string

Check how your file ends. Scanner is clever enough to throw away a single line-break from the end. However if there is anything (like a space or another line break) afterwards, that's going to be a new line to be read.

In such cases

String descritpion = scan.nextLine();

will read an empty-ish string, then

String []temp = descritpion.split(":");

splits it into a single-item array, where

String name = temp[0];

contains the entire string (being empty or containing a single space or something), that's how it passes

but

String surname = temp[1];

does not exist, and that's why it throws an exception.

However in such cases a line should appear on screen prior to the exception. See test (with strings instead of files) here: https://ideone.com/ixo0kd - the no-line-break, and single-line-break cases work fine, the space-after-line-break and double-line-break cases throw the exception, but have an empty line displayed before.

JAVA Code to split CSV file into different CSV files and extracting a single column data from parent file to child files

Your logic is very bad here. I rewrote the whole code for you,

import java.io.*;  
import java.util.Scanner;

public class FileSplit {

public static void myFunction(int lines, int files) throws FileNotFoundException, IOException{

String inputfile = "file.csv";
BufferedReader br = new BufferedReader(new FileReader(inputfile)); //reader for input file intitialized only once
String strLine = null;
for (int i=1;i<=files;i++) {
FileWriter fstream1 = new FileWriter("FileNumber_"+i+".csv"); //creating a new file writer.
BufferedWriter out = new BufferedWriter(fstream1);
for(int j=0;j<lines;j++){ //iterating the reader to read only the first few lines of the csv as defined earlier
strLine = br.readLine();
if (strLine!= null) {
String strar[] = strLine.split(",");
out.write(strar[0]); //acquring the first column
out.newLine();
}
}
out.close();
}
}

public static void main(String args[])
{
try{
int lines = 2; //set this to whatever number of lines you need in each file
int count = 0;
String inputfile = "file.csv";
File file = new File(inputfile);
Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) { //counting the lines in the input file
scanner.nextLine();
count++;
}
System.out.println(count);
int files=0;
if((count%lines)==0){
files= (count/lines);
}
else{
files=(count/lines)+1;
}
System.out.println(files); //number of files that shall eb created

myFunction(lines,files);
}

catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
}

}


Related Topics



Leave a reply



Submit