Reading Psv (Pipe-Separated) File or String

Parsing a pipe-delimited file in Python

If you're parsing a very simple file that won't contain any | characters in the actual field values, you can use split:

fileHandle = open('file', 'r')

for line in fileHandle:
fields = line.split('|')

print(fields[0]) # prints the first fields value
print(fields[1]) # prints the second fields value

fileHandle.close()

A more robust way to parse tabular data would be to use the csv library as mentioned in Spencer Rathbun's answer.

Reading PSV (pipe-separated) file or string

We could use read.table to read *.psv file.

read.table("myfile.psv", sep = "|", header = FALSE, stringsAsFactors = FALSE)

There might be many different representations of psv file, but when it comes to data mining, I think it might be more about "pipe separated" file. The data in the file is separated by "|"

How can Spark read pipe delimited text file which doesnt have file extension

Instead of .textFile use .csv.


Using .csv:

spark.read.option("delimiter","|").option("header","false").csv("books").show()
//+---+-------------+--------------------+----------+--------------+------+
//|_c0| _c1| _c2| _c3| _c4| _c5|
//+---+-------------+--------------------+----------+--------------+------+
//| 0|3-88623-803-7| GARDENING|2003-11-07| Editora FTD|174.99|
//| 1|5-72448-672-4|TECHNOLOGY-ENGINE...|2012-08-08|Wolters Kluwer|140.99|
//| 2|7-64433-458-3| SOCIAL-SCIENCE|2015-11-14| Bungeishunju| 7.99|
//| 3|1-18243-251-3| MATHEMATICS|1997-02-22|Hachette Livre| 34.99|
//+---+-------------+--------------------+----------+--------------+------+

Using .textFile:

Results RDD[String] then by using .split and .map we need to convert RDD to dataframe.

spark.read.textFile("books").map(x => x.split("\\|")).
map(x =>(x(0),x(1),x(2),x(3),x(4),x(5))).
toDF().
show()
//+---+-------------+--------------------+----------+--------------+------+
//| _1| _2| _3| _4| _5| _6|
//+---+-------------+--------------------+----------+--------------+------+
//| 0|3-88623-803-7| GARDENING|2003-11-07| Editora FTD|174.99|
//| 1|5-72448-672-4|TECHNOLOGY-ENGINE...|2012-08-08|Wolters Kluwer|140.99|
//| 2|7-64433-458-3| SOCIAL-SCIENCE|2015-11-14| Bungeishunju| 7.99|
//| 3|1-18243-251-3| MATHEMATICS|1997-02-22|Hachette Livre| 34.99|
//+---+-------------+--------------------+----------+--------------+------+

how to read both comma separated and pipe line separated csv file in a single item reader in spring batch

Take a look at the PatternMatchingCompositeLineTokenizer. There, you can use a Pattern to identify what records get parsed by what LineTokenizer. In your case, you'd have one Pattern that identifies comma delimited records and map them to the tokenizer that parses via commas. You'd also have a Pattern that identifies records delimited by pipes and maps those to the appropriate LineTokenizer. It would look something like this:

    @Bean
public LineTokenizer compositeLineTokenizer() throws Exception {
DelimitedLineTokenizer commaTokenizer = new DelimitedLineTokenizer();

commaTokenizer.setNames("a", "b", "c");
commaTokenizer.setDelimiter(",");
commaTokenizer.afterPropertiesSet();

DelimitedLineTokenizer pipeTokenizer = new DelimitedLineTokenizer();

pipeTokenizer.setNames("a", "b", "c");
pipeTokenizer.setDelimiter("|");

pipeTokenizer.afterPropertiesSet();

// I have not tested the patterns here so they may need to be adjusted
Map<String, LineTokenizer> tokenizers = new HashMap<>(2);
tokenizers.put("*,*", commaTokenizer);
tokenizers.put("*|*", pipeTokenizer);

PatternMatchingCompositeLineTokenizer lineTokenizer = new PatternMatchingCompositeLineTokenizer();

lineTokenizer.setTokenizers(tokenizers);

return lineTokenizer;
}

Hive - Load pipe delimited data starting with pipe

You can solve this by 2 ways.

  1. Remove first column before processing the file. This is clean and preferable solution.
cut -d "|" -f 2- input_filename > output_filename

Then use this output_filename as your input to the load process.

-d "|" - this says, use pipe as a delimiter.
-f 2- - this says, extract everything after first field.


  1. add a dummy column in the beginning of the table like this
CREATE table TEST_1 (
dummy string,
COL1 string,
COL2 string,
COL3 string,
COL4 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

And then proceed with the loading data.
Then you can ignore this dummy column or store the data into a final table without this column or create a view on top of this to exclude this dummy column.

Convert csv file to pipe delimited file in Python

This does what I think you want:

import csv

with open('C:/Path/InputFile.csv', 'rb') as fin, \
open('C:/Path/OutputFile.txt', 'wb') as fout:
reader = csv.DictReader(fin)
writer = csv.DictWriter(fout, reader.fieldnames, delimiter='|')
writer.writeheader()
writer.writerows(reader)


Related Topics



Leave a reply



Submit