Read and Write Yaml Files Without Destroying Anchors and Aliases

Read and write YAML files without destroying anchors and aliases

The problem here is that anchors and aliases in Yaml are a serialization detail, and so aren’t part of the data after it’s been parsed, so the original anchor name isn’t known when writing the data back out to Yaml. In order to keep the anchor names when round tripping you need to store them somewhere when parsing so that they are available later when serializing. In Ruby any object can have instance variables associated with it, so an easy way to achieve this would be to store the anchor name in an instance variable of the objet in question.

Continuing from the example in the earlier question, for hashes we can change our redifined revive_hash method so that if the hash is an anchor then as well as recording the anchor name in the @st variable so later alises can be recognised, we add the it as an instance variable on the hash.

class ToRubyNoMerge < Psych::Visitors::ToRuby
  def revive_hash hash, o
    if o.anchor
      @st[o.anchor] = hash
      hash.instance_variable_set "@_yaml_anchor_name", o.anchor
    end

    o.children.each_slice(2) { |k,v|
      key = accept(k)
      hash[key] = accept(v)
    }
    hash
  end
end

Note that this only affects yaml mappings that are anchors. If you want to have other types to keep their anchor name you’ll need to look at psych/visitors/to_ruby.rb and make sure the name is added in all cases. Most types can be included by overriding register but there are a couple of others; search for @st.

Now that the hash has the desired anchor name associated with it, you need to make Psych use it instead of the object id when serializing it. This can be done by subclassing YAMLTree. When YAMLTree processes an object, it first checks to see if that object has been seen already, and emits an alias for it if it has. For any new objects, it records that it has seen the object in case it needs to create an alias later. The object_id is used as the key in this, so you need to override those two methods to check for the instance variable, and use that instead if it exists:

class MyYAMLTree < Psych::Visitors::YAMLTree

  # check to see if this object has been seen before
  def accept target
    if anchor_name = target.instance_variable_get('@_yaml_anchor_name')
      if @st.key? anchor_name
        oid         = anchor_name
        node        = @st[oid]
        anchor      = oid.to_s
        node.anchor = anchor
        return @emitter.alias anchor
      end
    end

    # accept is a pretty big method, call super to avoid copying
    # it all here. super will handle the cases when it's an object
    # that's been seen but doesn't have '@_yaml_anchor_name' set
    super
  end

  # record object for future, using '@_yaml_anchor_name' rather
  # than object_id if it exists
  def register target, yaml_obj
    anchor_name = target.instance_variable_get('@_yaml_anchor_name') || target.object_id
    @st[anchor_name] = yaml_obj
    yaml_obj
  end
end

Now you can use it like this (unlike the previous question, you don’t need to create a custom emitter in this case):

builder = MyYAMLTree.new
builder << data

tree = builder.tree

puts tree.yaml # returns a string

# alternativelty write direct to file:
File.open('a_file.yml', 'r+') do |f|
  tree.yaml f
end

Read and write YAML files without destroying anchors and aliases?

The use of << to indicate an aliased mapping should be merged in to the current mapping isn’t part of the core Yaml spec, but it is part of the tag repository.

The current Yaml library provided by Ruby – Psych – provides the dump and load methods which allow easy serialization and deserialization of Ruby objects and use the various implicit type conversion in the tag repository including << to merge hashes. It also provides tools to do more low level Yaml processing if you need it. Unfortunately it doesn’t easily allow selectively disabling or enabling specific parts of the tag repository – it’s an all or nothing affair. In particular the handling of << is pretty baked in to the handling of hashes.

One way to achieve what you want is to provide your own subclass of Psych’s ToRuby class and override this method, so that it just treats mapping keys of << as literals. This involves overriding a private method in Psych, so you need to be a little careful:

require 'psych'

class ToRubyNoMerge < Psych::Visitors::ToRuby
  def revive_hash hash, o
    @st[o.anchor] = hash if o.anchor

    o.children.each_slice(2) { |k,v|
      key = accept(k)
      hash[key] = accept(v)
    }
    hash
  end
end

You would then use it like this:

tree = Psych.parse your_data
data = ToRubyNoMerge.new.accept tree

With the Yaml from your example, data would then look something like

{"defaults"=>{"foo"=>"bar", "zip"=>"button"},
 "node"=>{"<<"=>{"foo"=>"bar", "zip"=>"button"}, "foo"=>"other"}}

Note the << as a literal key. Also the hash under the data["defaults"] key is the same hash as the one under the data["node"]["<<"] key, i.e. they have the same object_id. You can now manipulate the data as you want, and when you write it out as Yaml the anchors and aliases will still be in place, although the anchor names will have changed:

data['node']['foo'] = "yet another"
puts Yaml.dump data

produces (Psych uses the object_id of the hash to ensure unique anchor names (the current version of Psych now uses sequential numbers rather than object_id)):

---
defaults: &2151922820
  foo: bar
  zip: button
node:
  <<: *2151922820
  foo: yet another

If you want to have control over the anchor names, you can provide your own Psych::Visitors::Emitter. Here’s a simple example based on your example and assuming there’s only the one anchor:

class MyEmitter < Psych::Visitors::Emitter
  def visit_Psych_Nodes_Mapping o
    o.anchor = 'defaults' if o.anchor
    super
  end

  def visit_Psych_Nodes_Alias o
    o.anchor = 'defaults' if o.anchor
    super
  end
end

When used with the modified data hash from above:

#create an AST based on the Ruby data structure
builder = Psych::Visitors::YAMLTree.new
builder << data
ast = builder.tree

# write out the tree using the custom emitter
MyEmitter.new($stdout).accept ast

the output is:

---
defaults: &defaults
  foo: bar
  zip: button
node:
  <<: *defaults
  foo: yet another

(Update: another question asked how to do this with more than one anchor, where I came up with a possibly better way to keep anchor names when serializing.)

How to change an anchored scalar in a sequence without destroying the anchor in ruamel.yaml?

If you read in an anchored scalar, like your This is unencrypted,
using ruamel.yaml, you get a PlainScalarString object (or one of the other ScalarString
subclasses), which is an extremely thin layer around the basic string
type. That layer has an attribute to store an anchor if applicable (other uses are primarily to
maintain quoting/literal/folding style information). And any aliases using that anchor refer to the same ScalarString instance.

When dumping the anchor attribute is not used to create aliases, that
is is done in the normal way by having multiple references to the same
object. The attribute is only used to write the anchor id and also
does so if there is an attribute but no further references (i.e. an anchor without aliases).

So it is not surprising that if you replace such an object with
multiple references (either at the anchor spot or any of the alias
spots) that the reference disappears. If you then also force the same
anchor name on some other object, you get duplicate anchors, contrary
to the normal anchor/alias generation there is no check done on
"forced" anchors.

Since the ScalarString is such a thin wrapper, they are essentially
immutable objects, just like the string itself. Unlike with aliased
dicts and lists which are collection objects that can be emptied and
then filled (instead of replaced by a new instance), you cannot do
that with string.

The implementation of ScalarString can of course be changed, so you
can have your set_values() method, but involves creating alternative
classes for all the objects (PlainScalarString,
FoldedScalarString). You would have to make sure
these get used for constructing and for representing and then
preferable also behave like normal strings as far as you need it, so
at least you can print.
That is relatively easy to do but requires copying and slightly modifyging several
tens of lines of code

I think it is easier to leave the ScalarStrings in place as is (i.e
being immutable) and do what you need to do if you want to change all
occurences (i.e. references): update all the references to the
original. If your datastructure would contain millions of nodes that
might be prohibitively time consuming, but still would be afraction of what
loading and dumping the YAML itself would take:

import sys
from pathlib import Path
import ruamel.yaml

in_file = Path('test.yaml')

def update_aliased_scalar(data, obj, val):
    def recurse(d, ref, nv):
        if isinstance(d, dict):
            for i, k in [(idx, key) for idx, key in enumerate(d.keys()) if key is ref]:
                d.insert(i, nv, d.pop(k))
            for k, v in d.non_merged_items():
                if v is ref:
                    d[k] = nv
                else:
                    recurse(v, ref, nv)
        elif isinstance(d, list):
            for idx, item in enumerate(d):
                if item is ref:
                    d[idx] = nv
                else:
                    recurse(item, ref, nv)

    if hasattr(obj, 'anchor'):
        recurse(data, obj, type(obj)(val, anchor=obj.anchor.value))
    else:
        recurse(data, obj, type(obj)(val))

yaml = ruamel.yaml.YAML()
yaml.indent(mapping=2, sequence=4, offset=2)
yaml.preserve_quotes = True
data = yaml.load(in_file)

update_aliased_scalar(data, data['aliases'][1], "New string password")
update_aliased_scalar(data, data['top::hash']['sub']['blocked_alias'], "New block password\n")

yaml.dump(data, sys.stdout)

which gives:

# Post-header comment

# Reusable aliases
aliases:
  - &plain_value This is unencrypted
  - &string_password New string password
  - &block_password >
    New block password

top_key: unencrypted value
top_alias: *plain_value

top::hash:
  ignore: more
  # This pulls its string-form value from above
  stringified_alias: *string_password
  sub:
    ignore: value
    key: unencrypted subbed-value
    # This pulls its block-form value from above
    blocked_alias: *block_password
  sub_more:
    # This is a stringified EYAML value, NOT an alias
    inline_string: ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqkv6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6HtkolM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoKB4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
    # Also NOT an alias, in block form
    block_string: >
      ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
      DQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12
      hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5
      TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqk
      v6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6Htko
      lM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4
      osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoK
      B4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064
      EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]

# Signature line

As you can see the anchors are preserved and it doesn't matter for update_aliased_scalar if you
provide the anchored "place" or one of the aliased places as a reference.

The above recurse also handles keys that are aliased, as it is perfectly fine for a key in a YAML mapping to have an anchor or to be an alias. You can even have an anchored key with a value that is an alias to the corresponding key.

Annotating Ruby structures to include anchors/references on #to_yaml

If you use the same Ruby object, the YAML library will set up references for you:

> common = {"ohai" => "I am common"}
> doc = {"parent1" => {"id" => 1, "stuff" => common}, "parent2" => {"id" => 2, "stuff" => common}}
> puts doc.to_yaml
---
parent1:
  id: 1
  stuff: &70133422893680
    ohai: I am common
parent2:
  id: 2
  stuff: *70133422893680

I'm not sure there's a straightforward way of defining Hashes that are subsets of each other, though. Perhaps tweaking your structure a bit would be warranted?

Override YAML subkey

If I've understood the question correctly, I don't think the spec supports overriding elements of anchored nodes.

On reading the spec (version 1.2, but 1.1 says the same), section 7.1 Alias Nodes states (emphasis mine):

Subsequent occurrences of a previously serialized node are presented as alias nodes. The first occurrence of the node must be marked by an anchor to allow subsequent occurrences to be presented as alias nodes.
An alias node is denoted by the “*” indicator. The alias refers to the most recent preceding node having the same anchor. It is an error for an alias node to use an anchor that does not previously occur in the document. It is not an error to specify an anchor that is not used by any alias node.
Note that an alias node must not specify any properties or content, as these were already specified at the first occurrence of the node.

Two points here:

"Previously serialized node" - this wording suggests that the alias is meant to represent another occurrence of the original node, not just the data in the original node. In other words, it represents the same object, not a copy.
If an alias cannot have any content (second bold section), then you cannot specify the override in the fashion suggested in the question.

So my interpretation of the spec is that you cannot do this according to the spec.

However - If you paste the example (second code block) from the original into this online tool(you may want to uncheck 'canonical'), that tool interprets it as intended in the question, copying the original content but overriding subkey100. Same for this YAML Lint Tool, as does this online parser.

So it seems to work in practice, but I can't find support for it within the spec.

PyYaml include file and yaml aliases (anchors/references)

Crucial for the handling of anchors and aliases in PyYAML is the dict anchors that is part of the Composer. It maps anchor to nodes so that aliases can be looked up. It existence is limited by the existence of the Composer, which is a composite element of the Loader that you use.

That Loader class only exists during the time of the call to yaml.load() so there is no trivial way to extract this afterwards: first you would have to make the instance of the Loader() persist and then make sure that the normal compose_document() method is not called (which among other things does self.anchors = {}, to be clean for the next document (in a single stream)).

To further complicate things if you would have warehouse.yaml:

warehouse:
  obj1: &obj1
    key1: 1
    key2: 2

and specific.yaml:

warehouse: !include warehouse.yaml
specific:
  spec1:
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

you would never get this to work with your snippet, even if you could preserve, extract and pass on the anchor information because the composer handling specific.yaml will much earlier encountering a non-defined alias than the tag !include gets used for construction (and filling anchors).

What you can do to circumvent this problem is to include specific.yaml

specific:
  spec1:
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

from warehouse.yaml:

warehouse:
  obj1: &obj1
    key1: 1
    key2: 2
specific: !include specific.yaml

, or include both in a third file. Please note that the key specific is in both files.

With those two files run:

import sys
from ruamel import yaml

def my_compose_document(self):
    self.get_event()
    node = self.compose_node(None, None)
    self.get_event()
    # self.anchors = {}    # <<<< commented out
    return node

yaml.SafeLoader.compose_document = my_compose_document

# adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
def yaml_include(loader, node):
    with open(node.value) as inputfile:
        return list(my_safe_load(inputfile, master=loader).values())[0]
#              leave out the [0] if your include file drops the key ^^^

yaml.add_constructor("!include", yaml_include, Loader=yaml.SafeLoader)


def my_safe_load(stream, Loader=yaml.SafeLoader, master=None):
    loader = Loader(stream)
    if master is not None:
        loader.anchors = master.anchors
    try:
        return loader.get_single_data()
    finally:
        loader.dispose()

with open('warehouse.yaml') as fp:
    data = my_safe_load(fp)
yaml.safe_dump(data, sys.stdout, default_flow_style=False)

which gives:

specific:
  spec1:
    key1: 1
    key2: 2
  spec2:
    key1: 10
    key2: 2
warehouse:
  obj1:
    key1: 1
    key2: 2

If your specific.yaml would not have the top-level key specific:

spec1:
  <<: *obj1
spec2:
  <<: *obj1
  key1: 10

then replace the last line of yaml_include() with:

return my_safe_load(inputfile, master=loader)

The above was done with ruamel.yaml (disclaimer: I am the author of that package) and tested on Python 2.7 and 3.6. By changing the import it will work with PyYAML as well.

With the new ruamel.yaml API the above can be much simplified, because the loader handed to the yaml_include() constructor knows about the YAML instance, but of course you still need an adapted compose_document that doesn't destroy anchors. Assuming the specific.yaml without top-level key specific, the following gives the same output as before.

import sys
from ruamel.std.pathlib import Path
from ruamel.yaml import YAML, version_info

yaml = YAML(typ='safe', pure=True)
yaml.default_flow_style = False


def my_compose_document(self):
    self.parser.get_event()
    node = self.compose_node(None, None)
    self.parser.get_event()
    # self.anchors = {}    # <<<< commented out
    return node

yaml.Composer.compose_document = my_compose_document

# adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
def yaml_include(loader, node):
    y = loader.loader
    yaml = YAML(typ=y.typ, pure=y.pure)  # same values as including YAML
    yaml.composer.anchors = loader.composer.anchors
    return yaml.load(Path(node.value))

yaml.Constructor.add_constructor("!include", yaml_include)

data = yaml.load(Path('warehouse.yaml'))
yaml.dump(data, sys.stdout)