How to Dump Strings in Yaml Using Literal Scalar Style

How to dump strings in YAML using literal scalar style?


require 'psych'

# Construct an AST
visitor = Psych::Visitors::YAMLTree.new({})
visitor << DATA.read
ast = visitor.tree

# Find all scalars and modify their formatting
ast.grep(Psych::Nodes::Scalar).each do |node|
node.plain = false
node.quoted = true
node.style = Psych::Nodes::Scalar::LITERAL
end

begin
# Call the `yaml` method on the ast to convert to yaml
puts ast.yaml
rescue
# The `yaml` method was introduced in later versions, so fall back to
# constructing a visitor
Psych::Visitors::Emitter.new($stdout).accept ast
end

__END__
{
"page": 1,
"results": [
"item", "another"
],
"total_pages": 0
}

Any yaml libraries in Python that support dumping of long strings as block literals or folded blocks?


import yaml

class folded_unicode(unicode): pass
class literal_unicode(unicode): pass

def folded_unicode_representer(dumper, data):
return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='>')
def literal_unicode_representer(dumper, data):
return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='|')

yaml.add_representer(folded_unicode, folded_unicode_representer)
yaml.add_representer(literal_unicode, literal_unicode_representer)

data = {
'literal':literal_unicode(
u'by hjw ___\n'
' __ /.-.\\\n'
' / )_____________\\\\ Y\n'
' /_ /=== == === === =\\ _\\_\n'
'( /)=== == === === == Y \\\n'
' `-------------------( o )\n'
' \\___/\n'),
'folded': folded_unicode(
u'It removes all ordinary curses from all equipped items. '
'Heavy or permanent curses are unaffected.\n')}

print yaml.dump(data)

The result:

folded: >
It removes all ordinary curses from all equipped items. Heavy or permanent curses
are unaffected.
literal: |
by hjw ___
__ /.-.\
/ )_____________\\ Y
/_ /=== == === === =\ _\_
( /)=== == === === == Y \
`-------------------( o )
\___/

For completeness, one should also have str implementations, but I'm going to be lazy :-)

How can I control what scalar form PyYAML uses for my data?

Based on Any yaml libraries in Python that support dumping of long strings as block literals or folded blocks?

import yaml
from collections import OrderedDict

class quoted(str):
pass

def quoted_presenter(dumper, data):
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
yaml.add_representer(quoted, quoted_presenter)

class literal(str):
pass

def literal_presenter(dumper, data):
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
yaml.add_representer(literal, literal_presenter)

def ordered_dict_presenter(dumper, data):
return dumper.represent_dict(data.items())
yaml.add_representer(OrderedDict, ordered_dict_presenter)

d = OrderedDict(short=quoted("Hello"), long=literal("Line1\nLine2\nLine3\n"))

print(yaml.dump(d))

Output

short: "Hello"
long: |
Line1
Line2
Line3

How to format a string in YAML dump?

First of all, what you present as that what you would like to get as output,
is not a representation of the data that you provide. Since
the multi-line string in that data starts with a newline, the block
style literal scalar for that requires a block indentation indicator and a newline at the start:

address_pattern_template: |2

^ #the beginning of the address string (e.g. interface number)
.
.
.

But it doesn't make sense (to me at least) to have these patterns
start with a newline, so I'll leave that out in the following.


If you don't know where the multi-line strings are in your data structure, but if you can
convert it in-place before dumping, than you can use ruamel.yaml.scalarstring:walk_tree

import sys
import ruamel.yaml

data = dict(a=[1, 2, 3, dict(
address_pattern_template="""\
^ #the beginning of the address string (e.g. interface number)
(?P<junkbefore> #capturing the junk before the address
\D? #an optional non-digit character
.*? #any characters (non-greedy) up to the address
)
(?P<address> #capturing the pure address
{pure_address_pattern}
)
(?P<junkafter> #capturing the junk after the address
\D? #an optional non-digit character
.* #any characters (greedy) up to the end of the string
)
$ #the end of the input address string
"""
)])


yaml = ruamel.yaml.YAML()
ruamel.yaml.scalarstring.walk_tree(data)
yaml.dump(data, sys.stdout)

which gives:

a:
- 1
- 2
- 3
- address_pattern_template: |
^ #the beginning of the address string (e.g. interface number)
(?P<junkbefore> #capturing the junk before the address
\D? #an optional non-digit character
.*? #any characters (non-greedy) up to the address
)
(?P<address> #capturing the pure address
{pure_address_pattern}
)
(?P<junkafter> #capturing the junk after the address
\D? #an optional non-digit character
.* #any characters (greedy) up to the end of the string
)
$ #the end of the input address string

walk_tree will replace the the multiline string with
LiteralScalarString, which behave for most purposes like a normal
string.

If that in-place transform is not acceptable, you can do a deepcopy of
data first and then apply walk_tree on the copy. If that is not is acceptable
because of memory constraints, then you have to provide an alternative representer for strings
that checks during representation if you have multi-line string. Preferably you do that
in a subclass the Representer:

import sys
import ruamel.yaml

# data defined as before

class MyRepresenter(ruamel.yaml.representer.RoundTripRepresenter):
def represent_str(self, data):
style = '|' if '\n' in data else None
return self.represent_scalar(u'tag:yaml.org,2002:str', data, style=style)


MyRepresenter.add_representer(str, MyRepresenter.represent_str)

yaml = ruamel.yaml.YAML()
yaml.Representer = MyRepresenter
yaml.dump(data, sys.stdout)

which gives the same output as the previous example.

yaml.dump adding unwanted newlines in multiline strings

If that is the only thing going into your YAML file then you can dump with the option default_style='|' which gives you block style literal for all of your scalars (probably not what you want).

Your string, contains no special characters (that need \ escaping and double quotes), because of the newlines PyYAML decides to represented single quoted. In single quoted style a double newline is the way to represent a single newline that occurred in string that is represented. This gets "undone" on loading, but is indeed not very readable.

If you want to get the block style literals on an individual basis, you can do multiple things:

  • adapt the Representer to output all strings with embedded newlines using the literal scalar block style (assuming they don't need \ escaping of special characters, which will force double quotes)

    import sys
    import yaml

    x = u"""\
    -----BEGIN RSA PRIVATE KEY-----
    MIIEogIBAAKCAQEA6oySC+8/N9VNpk0gJS7Gk8vn9sYN7FhjpAQnoHRqTN/Oaiyx
    xk2AleP2vXpojA/DHldT1JO+o3j56AHD+yfNFFeYvgWKDY35g49HsZZhbyCEAB45
    ...
    """

    yaml.SafeDumper.org_represent_str = yaml.SafeDumper.represent_str

    def repr_str(dumper, data):
    if '\n' in data:
    return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='|')
    return dumper.org_represent_str(data)

    yaml.add_representer(str, repr_str, Dumper=yaml.SafeDumper)

    yaml.safe_dump(dict(a=1, b='hello world', c=x), sys.stdout)
  • make a subclass of string, that has its special representer. You should be able to take the code for that from here, here and here:

    import sys
    import yaml

    class PSS(str):
    pass

    x = PSS("""\
    -----BEGIN RSA PRIVATE KEY-----
    MIIEogIBAAKCAQEA6oySC+8/N9VNpk0gJS7Gk8vn9sYN7FhjpAQnoHRqTN/Oaiyx
    xk2AleP2vXpojA/DHldT1JO+o3j56AHD+yfNFFeYvgWKDY35g49HsZZhbyCEAB45
    ...
    """)

    def pss_representer(dumper, data):
    style = '|'
    # if sys.versioninfo < (3,) and not isinstance(data, unicode):
    # data = unicode(data, 'ascii')
    tag = u'tag:yaml.org,2002:str'
    return dumper.represent_scalar(tag, data, style=style)

    yaml.add_representer(PSS, pss_representer, Dumper=yaml.SafeDumper)

    yaml.safe_dump(dict(a=1, b='hello world', c=x), sys.stdout)
  • use ruamel.yaml:

    import sys
    from ruamel.yaml import YAML
    from ruamel.yaml.scalarstring import PreservedScalarString as pss

    x = pss("""\
    -----BEGIN RSA PRIVATE KEY-----
    MIIEogIBAAKCAQEA6oySC+8/N9VNpk0gJS7Gk8vn9sYN7FhjpAQnoHRqTN/Oaiyx
    xk2AleP2vXpojA/DHldT1JO+o3j56AHD+yfNFFeYvgWKDY35g49HsZZhbyCEAB45
    ...
    """)

    yaml = YAML()

    yaml.dump(dict(a=1, b='hello world', c=x), sys.stdout)

All of these give:

a: 1
b: hello world
c: |
-----BEGIN RSA PRIVATE KEY-----
MIIEogIBAAKCAQEA6oySC+8/N9VNpk0gJS7Gk8vn9sYN7FhjpAQnoHRqTN/Oaiyx
xk2AleP2vXpojA/DHldT1JO+o3j56AHD+yfNFFeYvgWKDY35g49HsZZhbyCEAB45
...

Please note that it is not necessary to specify default_flow_style=False as the literal scalars can only appear in block style.

Change the scalar style used for all multi-line strings when serialising a dynamic model using YamlDotNet

To answer my own question, I've now worked out how to do this by deriving from the ChainedEventEmitter class and overriding void Emit(ScalarEventInfo eventInfo, IEmitter emitter). See code sample below.

public class MultilineScalarFlowStyleEmitter : ChainedEventEmitter
{
public MultilineScalarFlowStyleEmitter(IEventEmitter nextEmitter)
: base(nextEmitter) { }

public override void Emit(ScalarEventInfo eventInfo, IEmitter emitter)
{

if (typeof(string).IsAssignableFrom(eventInfo.Source.Type))
{
string value = eventInfo.Source.Value as string;
if (!string.IsNullOrEmpty(value))
{
bool isMultiLine = value.IndexOfAny(new char[] { '\r', '\n', '\x85', '\x2028', '\x2029' }) >= 0;
if (isMultiLine)
eventInfo = new ScalarEventInfo(eventInfo.Source)
{
Style = ScalarStyle.Literal
};
}
}

nextEmitter.Emit(eventInfo, emitter);
}
}

Can I control the formatting of multiline strings?

If you load, then dump, your expected output, you'll see that ruamel.yaml can actually
preserve the block style literal scalar.

import sys
import ruamel.yaml

yaml_str = """\
hello.py: |
import sys
sys.stdout.write("hello world")
"""

yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)

as this gives again the loaded input:

hello.py: |
import sys
sys.stdout.write("hello world")

To find out how it does that you should inspect the type of your multi-line string:

print(type(data['hello.py']))

which prints:

<class 'ruamel.yaml.scalarstring.LiteralScalarString'>

and that should point you in the right direction:

from ruamel.yaml import YAML
from ruamel.yaml.scalarstring import LiteralScalarString
import sys, textwrap

def LS(s):
return LiteralScalarString(textwrap.dedent(s))


yaml = ruamel.yaml.YAML()
yaml.dump({
'hello.py': LS("""\
import sys
sys.stdout.write("hello world")
""")
}, sys.stdout)

which also outputs what you want:

hello.py: |
import sys
sys.stdout.write("hello world")

Convert YAML multi-line values to folded block scalar style?

The class ScalarString is a base class for LiteralScalarString, it has no representer as you found out. You should just make/keep this a Python string, as that deals with special characters appropriately (quoting strings that need to be quoted to conform to the YAML specification).

Assuming you have input like this:

- 1
- abc: |
this is a short string scalar with a newline
in it
- "there are also a multiline\nsequence element\nin this file\nand it is longer"

You probably want to do something like:

import ruamel.yaml
from ruamel.yaml.scalarstring import LiteralScalarString, preserve_literal


def walk_tree(base):
from ruamel.yaml.compat import string_types

def test_wrap(v):
v = v.replace('\r\n', '\n').replace('\r', '\n').strip()
return v if len(v) < 72 else preserve_literal(v)

if isinstance(base, dict):
for k in base:
v = base[k]
if isinstance(v, string_types) and '\n' in v:
base[k] = test_wrap(v)
else:
walk_tree(v)
elif isinstance(base, list):
for idx, elem in enumerate(base):
if isinstance(elem, string_types) and '\n' in elem:
base[idx] = test_wrap(elem)
else:
walk_tree(elem)

yaml = YAML()

with open("input.yaml", "r") as fi:
data = yaml.load(fi)

walk_tree(data)

with open("output.yaml", "w") as fo:
yaml.dump(data, fo)

to get output:

- 1
- abc: "this is a short string scalar with a newline\nin it"
- |-
there are also a multiline
sequence element
in this file
and it is longer

Some notes:

  • Use of LiteralScalarString is preferred over PreservedScalarString. The latter name a remnant from the time it was the only preserved string type.
  • you probably had no sequence elements that where strings, as you did not import preserve_literal, although it was still used in the copied code.
  • I factored out the "wrapping" code into test_wrap, used by both value and element wrapping, the max line length for that was set at 72 characters.
  • the value data[1]['abc'] loads as LiteralScalarString. If you want to preserve existing literal style string scalars, you should test for those before testing on type string_types.
  • I used the new API with an instance of YAML()
  • You might have to set the width attribute to something like 1000, to prevent automatic line wrapping, if you increase 72 in the example to above the default of 80. (yaml.width = 1000)


Related Topics



Leave a reply



Submit