Parsing a pipe-delimited file in Python
If you're parsing a very simple file that won't contain any |
characters in the actual field values, you can use split
:
fileHandle = open('file', 'r')
for line in fileHandle:
fields = line.split('|')
print(fields[0]) # prints the first fields value
print(fields[1]) # prints the second fields value
fileHandle.close()
A more robust way to parse tabular data would be to use the csv
library as mentioned in Spencer Rathbun's answer.
Reading PSV (pipe-separated) file or string
We could use read.table to read *.psv file.
read.table("myfile.psv", sep = "|", header = FALSE, stringsAsFactors = FALSE)
There might be many different representations of psv file, but when it comes to data mining, I think it might be more about "pipe separated" file. The data in the file is separated by "|"
How can Spark read pipe delimited text file which doesnt have file extension
Instead of .textFile
use .csv
.
Using .csv:
spark.read.option("delimiter","|").option("header","false").csv("books").show()
//+---+-------------+--------------------+----------+--------------+------+
//|_c0| _c1| _c2| _c3| _c4| _c5|
//+---+-------------+--------------------+----------+--------------+------+
//| 0|3-88623-803-7| GARDENING|2003-11-07| Editora FTD|174.99|
//| 1|5-72448-672-4|TECHNOLOGY-ENGINE...|2012-08-08|Wolters Kluwer|140.99|
//| 2|7-64433-458-3| SOCIAL-SCIENCE|2015-11-14| Bungeishunju| 7.99|
//| 3|1-18243-251-3| MATHEMATICS|1997-02-22|Hachette Livre| 34.99|
//+---+-------------+--------------------+----------+--------------+------+
Using .textFile:
Results RDD[String]
then by using .split and .map we need to convert RDD to dataframe
.
spark.read.textFile("books").map(x => x.split("\\|")).
map(x =>(x(0),x(1),x(2),x(3),x(4),x(5))).
toDF().
show()
//+---+-------------+--------------------+----------+--------------+------+
//| _1| _2| _3| _4| _5| _6|
//+---+-------------+--------------------+----------+--------------+------+
//| 0|3-88623-803-7| GARDENING|2003-11-07| Editora FTD|174.99|
//| 1|5-72448-672-4|TECHNOLOGY-ENGINE...|2012-08-08|Wolters Kluwer|140.99|
//| 2|7-64433-458-3| SOCIAL-SCIENCE|2015-11-14| Bungeishunju| 7.99|
//| 3|1-18243-251-3| MATHEMATICS|1997-02-22|Hachette Livre| 34.99|
//+---+-------------+--------------------+----------+--------------+------+
how to read both comma separated and pipe line separated csv file in a single item reader in spring batch
Take a look at the PatternMatchingCompositeLineTokenizer
. There, you can use a Pattern
to identify what records get parsed by what LineTokenizer
. In your case, you'd have one Pattern
that identifies comma delimited records and map them to the tokenizer that parses via commas. You'd also have a Pattern
that identifies records delimited by pipes and maps those to the appropriate LineTokenizer
. It would look something like this:
@Bean
public LineTokenizer compositeLineTokenizer() throws Exception {
DelimitedLineTokenizer commaTokenizer = new DelimitedLineTokenizer();
commaTokenizer.setNames("a", "b", "c");
commaTokenizer.setDelimiter(",");
commaTokenizer.afterPropertiesSet();
DelimitedLineTokenizer pipeTokenizer = new DelimitedLineTokenizer();
pipeTokenizer.setNames("a", "b", "c");
pipeTokenizer.setDelimiter("|");
pipeTokenizer.afterPropertiesSet();
// I have not tested the patterns here so they may need to be adjusted
Map<String, LineTokenizer> tokenizers = new HashMap<>(2);
tokenizers.put("*,*", commaTokenizer);
tokenizers.put("*|*", pipeTokenizer);
PatternMatchingCompositeLineTokenizer lineTokenizer = new PatternMatchingCompositeLineTokenizer();
lineTokenizer.setTokenizers(tokenizers);
return lineTokenizer;
}
Hive - Load pipe delimited data starting with pipe
You can solve this by 2 ways.
- Remove first column before processing the file. This is clean and preferable solution.
cut -d "|" -f 2- input_filename > output_filename
Then use this output_filename as your input to the load process.
-d "|"
- this says, use pipe as a delimiter.-f 2-
- this says, extract everything after first field.
- add a dummy column in the beginning of the table like this
CREATE table TEST_1 (
dummy string,
COL1 string,
COL2 string,
COL3 string,
COL4 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
And then proceed with the loading data.
Then you can ignore this dummy column or store the data into a final table without this column or create a view on top of this to exclude this dummy column.
Convert csv file to pipe delimited file in Python
This does what I think you want:
import csv
with open('C:/Path/InputFile.csv', 'rb') as fin, \
open('C:/Path/OutputFile.txt', 'wb') as fout:
reader = csv.DictReader(fin)
writer = csv.DictWriter(fout, reader.fieldnames, delimiter='|')
writer.writeheader()
writer.writerows(reader)
Related Topics
Ggplot Scale Color Gradient to Range Outside of Data Range
Optimized Rolling Functions on Irregular Time Series with Time-Based Window
Equivalent to Rowmeans() for Min()
How Achieve Identical Facet Sizes and Scales in Several Multi-Facet Ggplot2 Graphics
Make a Rectangular Legend, with Rows and Columns Labeled, in Grid
Read CSV File in R with Currency Column as Numeric
Ggplot: Boxplot of Multiple Column Values
Check If a Date Is Within an Interval in R
How to Get the Number of Rows in a CSV File Without Opening It
How to Stop Bookdown Tables from Floating to Bottom of the Page in PDF
Fitting with Ggplot2, Geom_Smooth and Nls
Shiny Leaflet Ploygon Click Event
How to Repeat the Grubbs Test and Flag the Outliers