Python Parsing Bracketed Blocks

Python parsing bracketed blocks

Pseudocode:

For each string in the array:
Find the first '{'. If there is none, leave that string alone.
Init a counter to 0.
For each character in the string:
If you see a '{', increment the counter.
If you see a '}', decrement the counter.
If the counter reaches 0, break.
Here, if your counter is not 0, you have invalid input (unbalanced brackets)
If it is, then take the string from the first '{' up to the '}' that put the
counter at 0, and that is a new element in your array.

parsing file with curley brakets

Recursivity is the key here. Try something around that:

def parse(it):
result = []
while True:
try:
tk = next(it)
except StopIteration:
break

if tk == '}':
break
val = next(it)
if val == '{':
result.append((tk,parse(it)))
else:
result.append((tk, val))

return result

The use case:

import pprint       

data = """
Continent
{
Name Europe
Country
{
Name UK
Dog
{
Name Fiffi
Colour Gray
}
Dog
{
Name Smut
Colour Black
}
}
}
"""

r = parse(iter(data.split()))
pprint.pprint(r)

... which produce (Python 2.6):

[('Continent',
[('Name', 'Europe'),
('Country',
[('Name', 'UK'),
('Dog', [('Name', 'Fiffi'), ('Colour', 'Gray')]),
('Dog', [('Name', 'Smut'), ('Colour', 'Black')])])])]

Please take this as only starting point, and feel free to improve the code as you need (depending on your data, a dictionary could have been a better choice, maybe). In addition, the sample code does not handle properly ill formed data (notably extra or missing } -- I urge you to do a full test coverage ;)


EDIT: Discovering pyparsing, I tried the following which appears to work (much) better and could be (more) easily tailored for special needs:

import pprint
from pyparsing import Word, Literal, Forward, Group, ZeroOrMore, alphas

def syntax():
lbr = Literal( '{' ).suppress()
rbr = Literal( '}' ).suppress()
key = Word( alphas )
atom = Word ( alphas )
expr = Forward()
pair = atom | (lbr + ZeroOrMore( expr ) + rbr)
expr << Group ( key + pair )

return expr

expr = syntax()
result = expr.parseString(data).asList()
pprint.pprint(result)

Producing:

[['Continent',
['Name', 'Europe'],
['Country',
['Name', 'UK'],
['Dog', ['Name', 'Fiffi'], ['Colour', 'Gray']],
['Dog', ['Name', 'Smut'], ['Colour', 'Black']]]]]

Best way to parse this sequence?

Goal: Turn it into a dict. And then create your output from the dictionary.

>>> string = "EXAMPLE{TEST;ANOTHER{PART1;PART2};UNLIMITED{POSSIBILITIES{LIKE;THIS}}}"
>>> string = string.replace(";", ",").replace("{", ": {")
>>> string
'EXAMPLE: {TEST,ANOTHER: {PART1,PART2},UNLIMITED: {POSSIBILITIES: {LIKE,THIS}}}'

EXAMPLE, TEST, ANOTHER are strings but they aren't wrapped in quotation marks "" or ''.

Using RegEx to solve this problem:

>>> import re
>>> string = re.sub(r"(\w+)", r"'\1'", string)
>>> string
"'EXAMPLE': {'TEST','ANOTHER': {'PART1','PART2'},'UNLIMITED': {'POSSIBILITIES': {'LIKE','THIS'}}}"

This is still not a valid file format. It's not JSON. It's not a dict. It's a mixture of a dict and a set in Python.

Making it look more like a dictionary:

>>> string = re.sub(r"'(\w+)',", r"'\1': None, ", string)
>>> string = re.sub(r"'(\w+)'}", r"'\1': None}", string)
>>> string
"'EXAMPLE': {'TEST': None, 'ANOTHER': {'PART1': None, 'PART2': None},'UNLIMITED': {'POSSIBILITIES': {'LIKE': None, 'THIS': None}}}"

Now turning it into a Python object:

>>> my_dict = eval('{' + string + '}')
>>> my_dict
{'EXAMPLE': {'TEST': None, 'ANOTHER': {'PART1': None, 'PART2': None}, 'UNLIMITED': {'POSSIBILITIES': {'LIKE': None, 'THIS': None}}}}

Now you have a regular Python object which you can iterate through and do string manipulation. You can write a recursive function that concatenates your string:

>>> def create_output(dict_element, result):
... if dict_element == None:
... print(result)
... return
... for key, value in dict_element.items():
... create_output(value, result + key)
...
>>> create_output(my_dict, "")
EXAMPLETEST
EXAMPLEANOTHERPART1
EXAMPLEANOTHERPART2
EXAMPLEUNLIMITEDPOSSIBILITIESLIKE
EXAMPLEUNLIMITEDPOSSIBILITIESTHIS


Related Topics



Leave a reply



Submit