Parsing a JSON string which was loaded from a CSV using Pandas
There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
So first define your custom parser. In this case the below should work:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
In your case you'll have something like:
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict
From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.
Read CSV with JSON feature
The problem here is that the commas inside your json
string are being treated as delimiters. You should modify the input data (if you don't have direct access to the file, you can always read the contents into a list of strings using open
first).
Here are a few modification options that you can try:
Option 1: Quote json
string with single quote
Use a single quote (or another character that doesn't otherwise appear in your data) as a quote character for your json
string.
>> cat data.csv
Time,location,labelA,labelB
2019-09-10,'{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}',nan,nan
Then use quotechar="'"
when you read the data:
import pandas as pd
import json
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar="'")
Option 2: Quote json
string with double quote and escape
If the single quote can't be used, you can actually use the double quote as the quotechar
, as long as your escape the quotes inside the json
string:
>> cat data.csv
Time,location,labelA,labelB
2019-09-10,"{""lng"":12.9,""alt"":413.0,""time"":""2019-09-10"",""error"":7.0,""lat"":17.8}",nan,nan
Notice that this now matches the format of the question you linked.
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar='"')
Option 3: Change the delimiter
Use a different character, for example the |
as the delimiter
>> cat data.csv
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
Now use the sep
argument to specify the new delimiter:
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, sep="|")
Each of these methods produce the same output:
print(df)
# Time location labelA labelB
#0 2019-09-10 {u'lat': 17.8, u'lng': 12.9, u'error': 7.0, u'... NaN NaN
Once you have that, you can expand the location
column using one of the methods described in Flatten JSON column in a Pandas DataFrame
new_df = df.join(pd.io.json.json_normalize(df["location"])).drop(["location"], axis=1)
print(new_df)
# Time labelA labelB alt error lat lng time
#0 2019-09-10 NaN NaN 413.0 7.0 17.8 12.9 2019-09-10
How do i parse a json string in a csv column and break it down into multiple columns?
You should load the CSV file ignoring the JSON string at this time.
Then you convert the column to a json list and normalize it:
tmp = pd.json_normalize(InterimReport['REQUEST_RE'].apply(json.loads).tolist()).rename(
columns=lambda x: x.replace('Attributes.', ''))
You should get something like:
Fruit Cost ID Country
0 Apple 1.5 001 America
1 Orange 2.0 002 China
That you can easily concat to the original dataframe:
InterimReport = pd.concat([InterimReport.drop(columns=['REQUEST_RE']), tmp], axis=1)
Parsing a JSON string enclosed with quotation marks from a CSV using Pandas
As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data
to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:
import pandas as pd
data = []
with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
This would display your data as:
id employee details createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05
Parse json string in csv file
The data you have is using \"
to escape a double quote within each cell. This behaviour can be specified by setting both doublequote=True
and escapechar='\\'
as parameters as follows:
df = pd.read_csv('input.json', doublequote=True, escapechar='\\')
print df
Giving you something like:
0 1 2
0 file1 {"A1": {"a": "123"}, "B1": {"b1": "456", "b2":...
1 file2 {"A2": {"a": "321"}, "B2": {"b1": "654", "b2":... None
file1 {"A1": {"a": "123"}, "B1": {"b1": "456", "b2": "789", "b3": "000"}} \
0 file2 {"A2": {"a": "321"}, "B2": {"b1": "654", "b2":...
Unnamed: 2
0 NaN
Using Pandas to parse a JSON column w/nested values in a huge CSV
Just use json_normalize()
on the series directly, and then use pandas.concat()
to merge the new dataframe with the existing dataframe:
pd.concat([df, json_normalize(df['Metadata'])])
You can add a .drop('Metadata', axis=1)
if you no longer need the old column with the JSON datastructure in it.
The columns produced for the my_custom_data
nested dictionary will have my_custom_data.
prefixed. If all the names in that nested dictionary are unique, you could drop that prefix with a DataFrame.rename()
operation:
json_normalize(df['Metadata']).rename(
columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)
If you are using some other means to convert each dictionary value to a flattened structure (say, with flatten_json
, then you want to use Series.apply()
to process each value and then return each resulting dictionary as a pandas.Series()
object:
def some_conversion_function(dictionary):
result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
return pd.Series(something_that_processes_dictionary_into_a_flat_dict)
You can then concatenate the result of the Series.apply()
call (which will be a dataframe) back onto your original dataframe:
pd.concat([df, df['Metadata'].apply(some_conversion_function)])
Pandas read CSV with embedded JSON into dataframe
- The issue is, the
'visits'
column isstr
type (e.g.'{"ABCD":9,"DEFG":8,"ASDF":6}'
). - When loading the csv with
.read_csv
, use theconverters
parameter to apply ast.literal_eval to the'visits'
column, which will convert thestr
to adict
.converters
: Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
from ast import literal_eval
import pandas as pd
# load the csv using the converters parameter with literal_eval
df2 = pd.read_csv('test_visits.csv', converters={'visits': literal_eval})
# normalize the visits, join it to location_id and drop the visits column
df2 = df2.join(pd.json_normalize(df2.visits)).drop(columns=['visits'])
# display(df)
location_id ABCD DEFG ASDF XYZR
0 1 9.0 8.0 6.0 NaN
1 2 4.0 NaN NaN 4.0
2 3 NaN NaN 4.0 NaN
Parse CSV column contains mixed values as string and JSON using python pandas
I would just build a dataframe containing the new columns by hand and join it to the first one. Unfortunately you have not provided copyable data so I just used mine.
Original df:
df = pd.DataFrame({'ref': ['Outcomes', 'API-TEST', '{"from":"abc", "to": "def"}',
'Manual(add)', '{"from": "gh", "to": "ij"}', 'Migration']})
Giving:
ref
0 Outcomes
1 API-TEST
2 {"from":"abc", "to": "def"}
3 Manual(add)
4 {"from": "gh", "to": "ij"}
5 Migration
Extract only json data from ref
column:
data = [] # future data of the dataframe
ix = [] # future index
cols = set() # future columns
for name, s in df[['ref']].iterrows():
try:
d = json.loads(s['ref'])
ix.append(name) # if we could decode feed the future dataframe
cols.update(set(d.keys()))
data.append(d)
except json.JSONDecodeError:
pass # else ignore the line
df = df.join(pd.DataFrame(data, ix, cols), how='left')
gives:
ref to from
0 Outcomes NaN NaN
1 API-TEST NaN NaN
2 {"from":"abc", "to": "def"} def abc
3 Manual(add) NaN NaN
4 {"from": "gh", "to": "ij"} ij gh
5 Migration NaN NaN
Python - Function for parsing key-value pairs into DataFrame columns
It looks like someone scrape JavaScript code and saved as CSV string.
"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
It needs to convert CSV string back to normal string and later parse it.
Or it needs to change text in lines to correct JSON data
[1, {"key": "construction_year", "value": 1900}, {"key": "available_date", "value": "Vereinbarung"}]
which can be converted to 3 columns.
And later you can convert dictionaries to one dictionary
[1, {'construction_year': 1900, 'available_date': 'Vereinbarung'}]
which can be converted to columns using pandas
and .apply(pd.Series)
I uses text
as string but you could read it from file
text = '''"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
'''
import pandas as pd
#text = open('data.csv').read()
rows = []
for line in text.splitlines():
line = line.replace('""', '"')
line = '[' + line[1:-1] + ']'
line = json.loads(line)
item = {}
for d in line[1:]:
key = d['key']
val = d['value']
item[key] = val
rows.append( [line[0], item] )
df = pd.DataFrame(rows, columns=['id', 'data'])
# convert dictionaries to columns
df = df.join(df['data'].apply(pd.Series))
# remove column with dictionaries
del df['data']
print(df.to_string())
Result:
id construction_year available_date useful_area wheelchair_accessible
0 1 1900.0 Vereinbarung NaN NaN
1 2 NaN 01.04.2022 60.0 NaN
2 3 2020.0 sofort NaN NaN
3 4 NaN Vereinbarung NaN True
Related Topics
Heapq with Custom Compare Predicate
Converting an Rgb Color Tuple to a Hexidecimal String
How to Improve the Label Placement in Scatter Plot
Boto3 to Download All Files from a S3 Bucket
Using the Class as a Type Hint for Arguments in Its Methods
Expected Conditions in Protractor
Scipy: Savefig Without Frames, Axes, Only Content
Using Colormaps to Set Color of Line in Matplotlib
String to Dictionary in Python
Generating Random Dates Within a Given Range in Pandas
How to Patch a Python Decorator Before It Wraps a Function
How to Check If Character in a String Is a Letter? (Python)
How to Write Binary Data to Stdout in Python 3
In Selenium Web Driver How to Choose the Correct Iframe