Java - Read file and split into multiple files
Since one file can be very large, each split file could be large as well.
Example:
Source File Size: 5GB
Num Splits: 5: Destination
File Size: 1GB each (5 files)
There is no way to read this large split chunk in one go, even if we have such a memory. Basically for each split we can read a fix size byte-array
which we know should be feasible in terms of performance as well memory.
NumSplits: 10 MaxReadBytes: 8KB
public static void main(String[] args) throws Exception
{
RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
long numSplits = 10; //from user input, extract it from args
long sourceSize = raf.length();
long bytesPerSplit = sourceSize/numSplits ;
long remainingBytes = sourceSize % numSplits;
int maxReadBufferSize = 8 * 1024; //8KB
for(int destIx=1; destIx <= numSplits; destIx++) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
if(bytesPerSplit > maxReadBufferSize) {
long numReads = bytesPerSplit/maxReadBufferSize;
long numRemainingRead = bytesPerSplit % maxReadBufferSize;
for(int i=0; i<numReads; i++) {
readWrite(raf, bw, maxReadBufferSize);
}
if(numRemainingRead > 0) {
readWrite(raf, bw, numRemainingRead);
}
}else {
readWrite(raf, bw, bytesPerSplit);
}
bw.close();
}
if(remainingBytes > 0) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
readWrite(raf, bw, remainingBytes);
bw.close();
}
raf.close();
}
static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
byte[] buf = new byte[(int) numBytes];
int val = raf.read(buf);
if(val != -1) {
bw.write(buf);
}
}
How can I split a CSV file into different CSV files by line in Java?
public static void splitLargeFile(final String fileName,
final String extension,
final int maxLines,
final boolean deleteOriginalFile) {
try (Scanner s = new Scanner(new FileReader(String.format("%s.%s", fileName, extension)))) {
int file = 0;
int cnt = 0;
BufferedWriter writer = new BufferedWriter(new FileWriter(String.format("%s_%d.%s", fileName, file, extension)));
while (s.hasNext()) {
writer.write(s.next() + System.lineSeparator());
if (++cnt == maxLines && s.hasNext()) {
writer.close();
writer = new BufferedWriter(new FileWriter(String.format("%s_%d.%s", fileName, ++file, extension)));
cnt = 0;
}
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
if (deleteOriginalFile) {
try {
File f = new File(String.format("%s.%s", fileName, extension));
f.delete();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Splitting a text file into multiple files by specific character sequence
You want to find the lines which match "I n".
The regex you need is : ^.I \d$
^
indicates the beginning of the line. Hence, if there are some whitespaces or text beforeI
, the line will not match the regex.\d
indicates any digit. For the sake of simplicty, I allow only one digit in this regex.$
indicates the end of the line. Hence, if there are some characters after the digit, the line will not match the expression.
Now, you need to read the file line by line and keep a reference to the file in which you write the current line.
Reading a file line by line is much easier in Java 8 with Files.lines();
private String currentFile = "root.txt";
public static final String REGEX = "^.I \\d$";
public void foo() throws Exception{
Path path = Paths.get("path/to/your/input/file.txt");
Files.lines(path).forEach(line -> {
if(line.matches(REGEX)) {
//Extract the digit and update currentFile
currentFile = "File DOC_ID_"+line.substring(3, line.length())+".txt";
System.out.println("Current file is now : currentFile);
} else {
System.out.println("Writing this line to "+currentFile + " :" + line);
//Files.write(...);
}
});
Note : In order to extract the digit, I use a raw "".substring()
which I consider as evil but it is easier to understand. You can do it in a better way with a Pattern
and a Matcher
:
With this regex : ".I (\\d)
". (The same as before but with parenthesis which indicates what you will want to capture). Then :
Pattern pattern = Pattern.compile(".I (\\d)");
Matcher matcher = pattern.matcher(".I 3");
if(matcher.find()) {
System.out.println(matcher.group(1));//display "3"
}
how to split one text into multiple text files
In addition to my comment, considering there isn't a space between the leading integer and the first word, the substring at the first space doesn't work.
This question/answer has a few options that should help, the one using regex (\d+) being the simplest one imo, and copied below.
Matcher matcher = Pattern.compile("\\d+").matcher(arrOfStr[i]);
matcher.find();
int yourNumber = Integer.valueOf(matcher.group());
Given a string find the first embedded occurrence of an integer
Java: Read a File & Split a line into separate string
Check how your file ends. Scanner
is clever enough to throw away a single line-break from the end. However if there is anything (like a space or another line break) afterwards, that's going to be a new line to be read.
In such cases
String descritpion = scan.nextLine();
will read an empty-ish string, then
String []temp = descritpion.split(":");
splits it into a single-item array, where
String name = temp[0];
contains the entire string (being empty or containing a single space or something), that's how it passes
but
String surname = temp[1];
does not exist, and that's why it throws an exception.
However in such cases a line
should appear on screen prior to the exception. See test (with strings instead of files) here: https://ideone.com/ixo0kd - the no-line-break, and single-line-break cases work fine, the space-after-line-break and double-line-break cases throw the exception, but have an empty line
displayed before.
JAVA Code to split CSV file into different CSV files and extracting a single column data from parent file to child files
Your logic is very bad here. I rewrote the whole code for you,
import java.io.*;
import java.util.Scanner;
public class FileSplit {
public static void myFunction(int lines, int files) throws FileNotFoundException, IOException{
String inputfile = "file.csv";
BufferedReader br = new BufferedReader(new FileReader(inputfile)); //reader for input file intitialized only once
String strLine = null;
for (int i=1;i<=files;i++) {
FileWriter fstream1 = new FileWriter("FileNumber_"+i+".csv"); //creating a new file writer.
BufferedWriter out = new BufferedWriter(fstream1);
for(int j=0;j<lines;j++){ //iterating the reader to read only the first few lines of the csv as defined earlier
strLine = br.readLine();
if (strLine!= null) {
String strar[] = strLine.split(",");
out.write(strar[0]); //acquring the first column
out.newLine();
}
}
out.close();
}
}
public static void main(String args[])
{
try{
int lines = 2; //set this to whatever number of lines you need in each file
int count = 0;
String inputfile = "file.csv";
File file = new File(inputfile);
Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) { //counting the lines in the input file
scanner.nextLine();
count++;
}
System.out.println(count);
int files=0;
if((count%lines)==0){
files= (count/lines);
}
else{
files=(count/lines)+1;
}
System.out.println(files); //number of files that shall eb created
myFunction(lines,files);
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Related Topics
Class Object of Generic Class (Java)
Java - How to Create New Entry (Key, Value)
Uninitialized Object VS Object Initialized to Null
How to Get a List of Trusted Root Certificates in Java
Execute Jar File with Multiple Classpath Libraries from Command Prompt
Maximum Size of Hashset, Vector, Linkedlist
Jframe.Dispose() VS System.Exit()
Shuffle a List of Integers with Java 8 Streams API
Why Is Each Public Class in a Separate File
Absolute Minimum Code to Get a Valid Oauth_Signature Populated in Java or Groovy
Tomcat in Idea. War Exploded: Server Is Not Connected. Deploy Is Not Available
Java: How to Do Double-Buffering in Swing
Observer Is Deprecated in Java 9. What Should We Use Instead of It