Parsing a JSON String Which Was Loaded from a CSV Using Pandas

Parsing a JSON string which was loaded from a CSV using Pandas

There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv

converters : dict. optional

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

So first define your custom parser. In this case the below should work:

def CustomParser(data):
import json
j1 = json.loads(data)
return j1

In your case you'll have something like:

df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)

We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict

From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)

df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)

On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.

Read CSV with JSON feature

The problem here is that the commas inside your json string are being treated as delimiters. You should modify the input data (if you don't have direct access to the file, you can always read the contents into a list of strings using open first).

Here are a few modification options that you can try:

Option 1: Quote json string with single quote

Use a single quote (or another character that doesn't otherwise appear in your data) as a quote character for your json string.

>> cat data.csv
Time,location,labelA,labelB
2019-09-10,'{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}',nan,nan

Then use quotechar="'" when you read the data:

import pandas as pd
import json

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar="'")

Option 2: Quote json string with double quote and escape

If the single quote can't be used, you can actually use the double quote as the quotechar, as long as your escape the quotes inside the json string:

>> cat data.csv
Time,location,labelA,labelB
2019-09-10,"{""lng"":12.9,""alt"":413.0,""time"":""2019-09-10"",""error"":7.0,""lat"":17.8}",nan,nan

Notice that this now matches the format of the question you linked.

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar='"')

Option 3: Change the delimiter

Use a different character, for example the | as the delimiter

>> cat data.csv
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan

Now use the sep argument to specify the new delimiter:

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, sep="|")

Each of these methods produce the same output:

print(df)
# Time location labelA labelB
#0 2019-09-10 {u'lat': 17.8, u'lng': 12.9, u'error': 7.0, u'... NaN NaN

Once you have that, you can expand the location column using one of the methods described in Flatten JSON column in a Pandas DataFrame

new_df = df.join(pd.io.json.json_normalize(df["location"])).drop(["location"], axis=1)
print(new_df)
# Time labelA labelB alt error lat lng time
#0 2019-09-10 NaN NaN 413.0 7.0 17.8 12.9 2019-09-10

How do i parse a json string in a csv column and break it down into multiple columns?

You should load the CSV file ignoring the JSON string at this time.

Then you convert the column to a json list and normalize it:

tmp = pd.json_normalize(InterimReport['REQUEST_RE'].apply(json.loads).tolist()).rename(
columns=lambda x: x.replace('Attributes.', ''))

You should get something like:

    Fruit Cost   ID  Country
0 Apple 1.5 001 America
1 Orange 2.0 002 China

That you can easily concat to the original dataframe:

InterimReport = pd.concat([InterimReport.drop(columns=['REQUEST_RE']), tmp], axis=1)

Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:

import pandas as pd

data = []

with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)

df = pd.DataFrame(data[1:], columns=data[0])
print(df)

This would display your data as:

  id employee                                            details   createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05

Parse json string in csv file

The data you have is using \" to escape a double quote within each cell. This behaviour can be specified by setting both doublequote=True and escapechar='\\' as parameters as follows:

df = pd.read_csv('input.json', doublequote=True, escapechar='\\')
print df

Giving you something like:

       0                                                  1     2
0 file1 {"A1": {"a": "123"}, "B1": {"b1": "456", "b2":...
1 file2 {"A2": {"a": "321"}, "B2": {"b1": "654", "b2":... None
file1 {"A1": {"a": "123"}, "B1": {"b1": "456", "b2": "789", "b3": "000"}} \
0 file2 {"A2": {"a": "321"}, "B2": {"b1": "654", "b2":...

Unnamed: 2
0 NaN

Using Pandas to parse a JSON column w/nested values in a huge CSV

Just use json_normalize() on the series directly, and then use pandas.concat() to merge the new dataframe with the existing dataframe:

pd.concat([df, json_normalize(df['Metadata'])])

You can add a .drop('Metadata', axis=1) if you no longer need the old column with the JSON datastructure in it.

The columns produced for the my_custom_data nested dictionary will have my_custom_data. prefixed. If all the names in that nested dictionary are unique, you could drop that prefix with a DataFrame.rename() operation:

json_normalize(df['Metadata']).rename(
columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)

If you are using some other means to convert each dictionary value to a flattened structure (say, with flatten_json, then you want to use Series.apply() to process each value and then return each resulting dictionary as a pandas.Series() object:

def some_conversion_function(dictionary):
result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
return pd.Series(something_that_processes_dictionary_into_a_flat_dict)

You can then concatenate the result of the Series.apply() call (which will be a dataframe) back onto your original dataframe:

pd.concat([df, df['Metadata'].apply(some_conversion_function)])

Pandas read CSV with embedded JSON into dataframe

  • The issue is, the 'visits' column is str type (e.g. '{"ABCD":9,"DEFG":8,"ASDF":6}').
  • When loading the csv with .read_csv, use the converters parameter to apply ast.literal_eval to the 'visits' column, which will convert the str to a dict.
    • converters: Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
from ast import literal_eval
import pandas as pd

# load the csv using the converters parameter with literal_eval
df2 = pd.read_csv('test_visits.csv', converters={'visits': literal_eval})

# normalize the visits, join it to location_id and drop the visits column
df2 = df2.join(pd.json_normalize(df2.visits)).drop(columns=['visits'])

# display(df)
location_id ABCD DEFG ASDF XYZR
0 1 9.0 8.0 6.0 NaN
1 2 4.0 NaN NaN 4.0
2 3 NaN NaN 4.0 NaN

Parse CSV column contains mixed values as string and JSON using python pandas

I would just build a dataframe containing the new columns by hand and join it to the first one. Unfortunately you have not provided copyable data so I just used mine.

Original df:

df = pd.DataFrame({'ref': ['Outcomes', 'API-TEST', '{"from":"abc", "to": "def"}',
'Manual(add)', '{"from": "gh", "to": "ij"}', 'Migration']})

Giving:

                           ref
0 Outcomes
1 API-TEST
2 {"from":"abc", "to": "def"}
3 Manual(add)
4 {"from": "gh", "to": "ij"}
5 Migration

Extract only json data from ref column:

data = []           # future data of the dataframe
ix = [] # future index
cols = set() # future columns
for name, s in df[['ref']].iterrows():
try:
d = json.loads(s['ref'])
ix.append(name) # if we could decode feed the future dataframe
cols.update(set(d.keys()))
data.append(d)
except json.JSONDecodeError:
pass # else ignore the line

df = df.join(pd.DataFrame(data, ix, cols), how='left')

gives:

                           ref   to from
0 Outcomes NaN NaN
1 API-TEST NaN NaN
2 {"from":"abc", "to": "def"} def abc
3 Manual(add) NaN NaN
4 {"from": "gh", "to": "ij"} ij gh
5 Migration NaN NaN

Python - Function for parsing key-value pairs into DataFrame columns

It looks like someone scrape JavaScript code and saved as CSV string.

"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"

It needs to convert CSV string back to normal string and later parse it.

Or it needs to change text in lines to correct JSON data

[1, {"key": "construction_year", "value": 1900}, {"key": "available_date", "value": "Vereinbarung"}]

which can be converted to 3 columns.

And later you can convert dictionaries to one dictionary

[1, {'construction_year': 1900, 'available_date': 'Vereinbarung'}]

which can be converted to columns using pandas and .apply(pd.Series)


I uses text as string but you could read it from file

text = '''"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
'''

import pandas as pd

#text = open('data.csv').read()

rows = []
for line in text.splitlines():
line = line.replace('""', '"')
line = '[' + line[1:-1] + ']'
line = json.loads(line)

item = {}
for d in line[1:]:
key = d['key']
val = d['value']
item[key] = val

rows.append( [line[0], item] )

df = pd.DataFrame(rows, columns=['id', 'data'])

# convert dictionaries to columns
df = df.join(df['data'].apply(pd.Series))

# remove column with dictionaries
del df['data']

print(df.to_string())

Result:

    id  construction_year available_date  useful_area wheelchair_accessible
0 1 1900.0 Vereinbarung NaN NaN
1 2 NaN 01.04.2022 60.0 NaN
2 3 2020.0 sofort NaN NaN
3 4 NaN Vereinbarung NaN True


Related Topics



Leave a reply



Submit