How to Recursively Flatten a Yaml File into a JSON Object Where Keys Are Dot Separated Strings

How do I recursively flatten a YAML file into a JSON object where keys are dot separated strings?

Since the question is about using YAML files for i18n on a Rails app, it's worth noting that the i18n gem provides a helper module I18n::Backend::Flatten that flattens translations exactly like this:

test.rb:

require 'yaml'
require 'json'
require 'i18n'

yaml = YAML.load <<YML
en:
questions:
new: 'New Question'
other:
recent: 'Recent'
old: 'Old'
YML
include I18n::Backend::Flatten
puts JSON.pretty_generate flatten_translations(nil, yaml, nil, false)

Output:

$ ruby test.rb
{
"en.questions.new": "New Question",
"en.questions.other.recent": "Recent",
"en.questions.other.old": "Old"
}

How to flatten multilevel/nested JSON?

I used the following function (details can be found here):

def flatten_data(y):
out = {}

def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x

flatten(y)
return out

This unfortunately completely flattens whole JSON, meaning that if you have multi-level JSON (many nested dictionaries), it might flatten everything into single line with tons of columns.

What I used, in the end, was json_normalize() and specified structure that I required. A nice example of how to do it that way can be found here.

Fastest way to flatten / un-flatten nested JavaScript objects

Here's my much shorter implementation:

Object.unflatten = function(data) {
"use strict";
if (Object(data) !== data || Array.isArray(data))
return data;
var regex = /\.?([^.\[\]]+)|\[(\d+)\]/g,
resultholder = {};
for (var p in data) {
var cur = resultholder,
prop = "",
m;
while (m = regex.exec(p)) {
cur = cur[prop] || (cur[prop] = (m[2] ? [] : {}));
prop = m[2] || m[1];
}
cur[prop] = data[p];
}
return resultholder[""] || resultholder;
};

flatten hasn't changed much (and I'm not sure whether you really need those isEmpty cases):

Object.flatten = function(data) {
var result = {};
function recurse (cur, prop) {
if (Object(cur) !== cur) {
result[prop] = cur;
} else if (Array.isArray(cur)) {
for(var i=0, l=cur.length; i<l; i++)
recurse(cur[i], prop + "[" + i + "]");
if (l == 0)
result[prop] = [];
} else {
var isEmpty = true;
for (var p in cur) {
isEmpty = false;
recurse(cur[p], prop ? prop+"."+p : p);
}
if (isEmpty && prop)
result[prop] = {};
}
}
recurse(data, "");
return result;
}

Together, they run your benchmark in about the half of the time (Opera 12.16: ~900ms instead of ~ 1900ms, Chrome 29: ~800ms instead of ~1600ms).

Note: This and most other solutions answered here focus on speed and are susceptible to prototype pollution and shold not be used on untrusted objects.

Denormalize/flatten list of nested objects into dot separated key value pairs

code.py:

from itertools import product
from pprint import pprint as pp

all_objs = [{
"a": 1,
"b": [{"ba": 2, "bb": 3}, {"ba": 21, "bb": 31}],
"c": 4,
#"d": [{"da": 2}, {"da": 5}],
}, {
"a": 11,
"b": [{"ba": 22, "bb": 33, "bc": [{"h": 1, "e": 2}]}],
"c": 44,
}]

def flatten_dict(obj, parent_key=None):
base_dict = dict()
complex_items = list()
very_complex_items = list()
for key, val in obj.items():
new_key = ".".join((parent_key, key)) if parent_key is not None else key
if isinstance(val, list):
if len(val) > 1:
very_complex_items.append((key, val))
else:
complex_items.append((key, val))
else:
base_dict[new_key] = val
if not complex_items and not very_complex_items:
return [base_dict]
base_dicts = list()
partial_dicts = list()
for key, val in complex_items:
partial_dicts.append(flatten_dict(val[0], parent_key=new_key))
for product_tuple in product(*tuple(partial_dicts)):
new_base_dict = base_dict.copy()
for new_dict in product_tuple:
new_base_dict.update(new_dict)
base_dicts.append(new_base_dict)
if not very_complex_items:
return base_dicts
ret = list()
very_complex_keys = [item[0] for item in very_complex_items]
very_complex_vals = tuple([item[1] for item in very_complex_items])
for product_tuple in product(*very_complex_vals):
for base_dict in base_dicts:
new_dict = base_dict.copy()
new_items = zip(very_complex_keys, product_tuple)
for key, val in new_items:
new_key = ".".join((parent_key, key)) if parent_key is not None else key
new_dict.update(flatten_dict(val, parent_key=new_key)[0])
ret.append(new_dict)
return ret

def main():
flatten = list()
for obj in all_objs:
flatten.extend(flatten_dict(obj))
pp(flatten)

if __name__ == "__main__":
main()

Notes:

  • As expected, recursion is used
  • It's general, it also works for the case that I mentioned in my 2nd comment (for one input dict having more than one key with a value consisting of a list with more than one element), that can be tested by decommenting the "d" key in all_objs. Also, theoretically it should support any depth
  • flatten_dict: takes an input dictionary and outputs a list of dictionaries (as the input dictionary might yield more than one output dictionary):

    • Every key having a "simple" (not list) value, goes into the output dictionar(y/ies) unchanged
    • At this point, a base output dictionary is complete (if the input dictionary will generate more than output dictionary, all will have the base dictionary keys/values, if it only generates one output dictionary, then that will be the base one)
    • Next, the keys with "problematic" values - that may generate more than output dictionary - (if any) are processed:

      • Keys having a list with a single element ("problematic") - each might generate more than one output dictionary:

        • Each of the values will be flattened (might yield more than one output dictionary); the corresponding key will be used in the process
        • Then, the cartesian product will be computed on all the flatten dictionary lists (for current input, there will only be one list with one element)
        • Now, each product item needs to be in a distinct output dictionary, so the base dictionary is duplicated and updated with the keys / values of every element in the product item (for current input, there will be only one element per product item)
        • The new dictionary is appended to a list
      • At this point a list of base dictionaries (might only be one) is complete, if no values consisting of lists with more than one element are present, this is the return list, else everything below has to be done for each base dictionary in the list
      • Keys having a list with a more elements ("very problematic") - each will generate more than one output dictionaries:

        • First, the cartesian product will be computed against all the values (lists with more than one element). In the current case, since since it's only one such list, each product item will only contain an element from that list
        • Then, for each product item element, its key will need to be established based on the lists order (for the current input, the product item will only contain one element, and also, there will only be one key)
        • Again, each product item needs to be in a distinct output dictionary, so the base dictionary is duplicated and updated with the keys / values, of the flattened product item
      • The new dictionary is appended to the output dictionaries list
  • Works with Python 3 and Python 2
  • Might be slow (especially for big input objects), as performance was not the goal. Also since it was built bottom-up (adding functionality as new cases were handled), it is pretty twisted (RO: intortocheated :) ), there might be a simpler implementation that I missed.

Output:

c:\Work\Dev\StackOverflow\q046341856>c:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe code.py
[{'a': 1, 'b.ba': 2, 'b.bb': 3, 'c': 4},
{'a': 1, 'b.ba': 21, 'b.bb': 31, 'c': 4},
{'a': 11, 'b.ba': 22, 'b.bb': 33, 'b.bc.e': 2, 'b.bc.h': 1, 'c': 44}]

@EDIT0:

  • Made it more general (although it's not visible for the current input): values containing only one element can yield more than output dictionary (when flattened), addressed that case (before I was only considering the 1st output dictionary, simply ignoring the rest)
  • Corrected a logical error that was masked out tuple unpacking combined with cartesian product: if not complex_items ... part

@EDIT1:

  • Modified the code to match a requirement change: the key in the flattened dictionary must have the full nesting path in the input dictionary

How to iterate over deep nested hash without known depth in Ruby

So, you want to parse recursively until there are no more levels to parse into.

It’s super common in software and referred to as “recursion”. Have a search around to learn more about it - it’ll come up again and again in your journey. Welcome to ruby btw!

As for your actual current problem. Have a read of https://mrxpalmeiras.wordpress.com/2017/03/30/how-to-parse-a-nested-yaml-config-file-in-python-and-ruby/

But also, consider the i18n gem. See this answer https://stackoverflow.com/a/51216931/1777331 and the docs for the gem https://github.com/ruby-i18n/i18n This might fix your problem of handling internationalisation without you having to get into the details of handling yaml files.

How to deserialize JSON into flat, Map-like structure?

You can do this to traverse the tree and keep track of how deep you are to figure out dot notation property names:

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.node.ArrayNode;
import com.fasterxml.jackson.databind.node.ObjectNode;
import com.fasterxml.jackson.databind.node.ValueNode;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import org.junit.Test;

public class FlattenJson {
String json = "{\n" +
" \"Port\":\n" +
" {\n" +
" \"@alias\": \"defaultHttp\",\n" +
" \"Enabled\": \"true\",\n" +
" \"Number\": \"10092\",\n" +
" \"Protocol\": \"http\",\n" +
" \"KeepAliveTimeout\": \"20000\",\n" +
" \"ThreadPool\":\n" +
" {\n" +
" \"@enabled\": \"false\",\n" +
" \"Max\": \"150\",\n" +
" \"ThreadPriority\": \"5\"\n" +
" },\n" +
" \"ExtendedProperties\":\n" +
" {\n" +
" \"Property\":\n" +
" [ \n" +
" {\n" +
" \"@name\": \"connectionTimeout\",\n" +
" \"$\": \"20000\"\n" +
" }\n" +
" ]\n" +
" }\n" +
" }\n" +
"}";

@Test
public void testCreatingKeyValues() {
Map<String, String> map = new HashMap<String, String>();
try {
addKeys("", new ObjectMapper().readTree(json), map);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(map);
}

private void addKeys(String currentPath, JsonNode jsonNode, Map<String, String> map) {
if (jsonNode.isObject()) {
ObjectNode objectNode = (ObjectNode) jsonNode;
Iterator<Map.Entry<String, JsonNode>> iter = objectNode.fields();
String pathPrefix = currentPath.isEmpty() ? "" : currentPath + ".";

while (iter.hasNext()) {
Map.Entry<String, JsonNode> entry = iter.next();
addKeys(pathPrefix + entry.getKey(), entry.getValue(), map);
}
} else if (jsonNode.isArray()) {
ArrayNode arrayNode = (ArrayNode) jsonNode;
for (int i = 0; i < arrayNode.size(); i++) {
addKeys(currentPath + "[" + i + "]", arrayNode.get(i), map);
}
} else if (jsonNode.isValueNode()) {
ValueNode valueNode = (ValueNode) jsonNode;
map.put(currentPath, valueNode.asText());
}
}
}

It produces the following map:

Port.ThreadPool.Max=150, 
Port.ThreadPool.@enabled=false,
Port.Number=10092,
Port.ExtendedProperties.Property[0].@name=connectionTimeout,
Port.ThreadPool.ThreadPriority=5,
Port.Protocol=http,
Port.KeepAliveTimeout=20000,
Port.ExtendedProperties.Property[0].$=20000,
Port.@alias=defaultHttp,
Port.Enabled=true

It should be easy enough to strip out @ and $ in the property names, although you could end up with collisions in key names since you said the JSON was arbitrary.

Why and when do we need to flatten JSON objects?

There are many situations where you get JSON text that was automatically built by some library. Throughout the programming languages, there are many libraries that build JSON text (one example is here).

Whenever libraries add some additional object or array wrappings, you might want to get rid of them maybe because you send the JSON to the server and your code there crashes because it expects a primitive value instead of an object (or an array). Or, if your JSON is a server response, you don't want the resulting Javascript code having to differ between object/array or not object/array. In all these cases, flattening is helpful as it will save you time. You will have to implement lesser if/elses, and you can reliably expect your data structure to be as flat as possible.

The other approach to improve code for the scenario mentioned is to write the code in a maximal robust way so there is no way for it to crash by superfluous wrappings ever. So always expect some wrappers and get it's contents. Then, flattening is not needed.

You see, it depends on what is building the JSON and what is parsing it. The building may be out of your scope.

This leads also to data model questions. I've worked with XML code that needed to be parsed quiet a different way if there where 0 entries of some XY, or if there were >0 entries of some XY. Having a wrapper that is allowed to have 0 or more entries of some XY will make live easier. These are data model desicions.

In all cases where the JSON represents an object structure that I've combined manually, I expect it not to change. So flattening something I've designed in detail would be disturbing. Standard operations as far I've seen them do not need flattening (e.g. JSON.stringify(), json_encode() etc.)

One liner to flatten nested object

Here you go:

Object.assign({}, ...function _flatten(o) { return [].concat(...Object.keys(o).map(k => typeof o[k] === 'object' ? _flatten(o[k]) : ({[k]: o[k]})))}(yourObject))

Summary: recursively create an array of one-property objects, then combine them all with Object.assign.

This uses ES6 features including Object.assign or the spread operator, but it should be easy enough to rewrite not to require them.

For those who don't care about the one-line craziness and would prefer to be able to actually read it (depending on your definition of readability):

Object.assign(
{},
...function _flatten(o) {
return [].concat(...Object.keys(o)
.map(k =>
typeof o[k] === 'object' ?
_flatten(o[k]) :
({[k]: o[k]})
)
);
}(yourObject)
)

Flatten nested JSON using jq

You can also use the following jq command to flatten nested JSON objects in this manner:

[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries

The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.



Related Topics



Leave a reply



Submit