parse

This module gathers parsers to handle whole input text

find_first_pattern

class textops.find_first_pattern(patterns)

Fast multiple pattern search, returns on first match

It works like textops.find_patterns except that it stops searching on first match.

Parameters:patterns (list) – a list of patterns.
Returns:matched value if only one capture group otherwise the full groupdict
Return type:str or dict

Examples

>>> s = '''creation: 2015-10-14
... update: 2015-11-16
... access: 2015-11-17'''
>>> s | find_first_pattern([r'^update:\s*(.*)', r'^access:\s*(.*)', r'^creation:\s*(.*)'])
'2015-11-16'
>>> s | find_first_pattern([r'^UPDATE:\s*(.*)'])
NoAttr
>>> s | find_first_pattern([r'^update:\s*(?P<year>.*)-(?P<month>.*)-(?P<day>.*)'])
{'year': '2015', 'day': '16', 'month': '11'}

find_first_patterni

class textops.find_first_patterni(patterns)

Fast multiple pattern search, returns on first match

It works like textops.find_first_pattern except that patterns are case insensitives.

Parameters:patterns (list) – a list of patterns.
Returns:matched value if only one capture group otherwise the full groupdict
Return type:str or dict

Examples

>>> s = '''creation: 2015-10-14
... update: 2015-11-16
... access: 2015-11-17'''
>>> s | find_first_patterni([r'^UPDATE:\s*(.*)'])
'2015-11-16'

find_pattern

class textops.find_pattern(pattern)

Fast pattern search

This operation can be use to find a pattern very fast : it uses re.search() on the whole input text at once. The input text is not read line by line, this means it must fit into memory. It returns the first captured group (named or not named group).

Parameters:pattern (str) – a regular expression string (case sensitive).
Returns:the first captured group or NoAttr if not found
Return type:str

Examples

>>> s = '''This is data text
... Version: 1.2.3
... Format: json'''
>>> s | find_pattern(r'^Version:\s*(.*)')
'1.2.3'
>>> s | find_pattern(r'^Format:\s*(?P<format>.*)')
'json'
>>> s | find_pattern(r'^version:\s*(.*)') # 'version' : no match because case sensitive
NoAttr

find_patterni

class textops.find_patterns(pattern)

Fast multiple pattern search

It works like textops.find_pattern except that one can specify a list or a dictionary of patterns. Patterns must contains capture groups. It returns a list or a dictionary of results depending on the patterns argument type. Each result will be the re.MatchObject groupdict if there are more than one capture named group in the pattern otherwise directly the value corresponding to the unique captured group. It is recommended to use named capture group, if not, the groups will be automatically named ‘groupN’ with N the capture group order in the pattern.

Parameters:patterns (list or dict) – a list or a dictionary of patterns.
Returns:patterns search result
Return type:dict

Examples

>>> s = '''This is data text
... Version: 1.2.3
... Format: json'''
>>> r = s | find_patterns({
... 'version':r'^Version:\s*(?P<major>\d+)\.(?P<minor>\d+)\.(?P<build>\d+)',
... 'format':r'^Format:\s*(?P<format>.*)',
... })
>>> r
{'version': {'major': '1', 'build': '3', 'minor': '2'}, 'format': 'json'}
>>> r.version.major
'1'
>>> s | find_patterns({
... 'version':r'^Version:\s*(\d+)\.(\d+)\.(\d+)',
... 'format':r'^Format:\s*(.*)',
... })
{'version': {'group1': '2', 'group0': '1', 'group2': '3'}, 'format': 'json'}
>>> s | find_patterns({'version':r'^version:\s*(.*)'}) # lowercase 'version' : no match
{}
>>> s = '''creation: 2015-10-14
... update: 2015-11-16
... access: 2015-11-17'''
>>> s | find_patterns([r'^update:\s*(.*)', r'^access:\s*(.*)', r'^creation:\s*(.*)'])
['2015-11-16', '2015-11-17', '2015-10-14']
>>> s | find_patterns([r'^update:\s*(?P<year>.*)-(?P<month>.*)-(?P<day>.*)',
... r'^access:\s*(.*)', r'^creation:\s*(.*)'])
[{'month': '11', 'day': '16', 'year': '2015'}, '2015-11-17', '2015-10-14']

find_patterns

class textops.find_patterns(patterns)

Fast multiple pattern search

It works like textops.find_pattern except that one can specify a list or a dictionary of patterns. Patterns must contains capture groups. It returns a list or a dictionary of results depending on the patterns argument type. Each result will be the re.MatchObject groupdict if there are more than one capture named group in the pattern otherwise directly the value corresponding to the unique captured group. It is recommended to use named capture group, if not, the groups will be automatically named ‘groupN’ with N the capture group order in the pattern.

Parameters:patterns (list or dict) – a list or a dictionary of patterns.
Returns:patterns search result
Return type:dict

Examples

>>> s = '''This is data text
... Version: 1.2.3
... Format: json'''
>>> r = s | find_patterns({
... 'version':r'^Version:\s*(?P<major>\d+)\.(?P<minor>\d+)\.(?P<build>\d+)',
... 'format':r'^Format:\s*(?P<format>.*)',
... })
>>> r
{'version': {'major': '1', 'build': '3', 'minor': '2'}, 'format': 'json'}
>>> r.version.major
'1'
>>> s | find_patterns({
... 'version':r'^Version:\s*(\d+)\.(\d+)\.(\d+)',
... 'format':r'^Format:\s*(.*)',
... })
{'version': {'group1': '2', 'group0': '1', 'group2': '3'}, 'format': 'json'}
>>> s | find_patterns({'version':r'^version:\s*(.*)'}) # lowercase 'version' : no match
{}
>>> s = '''creation: 2015-10-14
... update: 2015-11-16
... access: 2015-11-17'''
>>> s | find_patterns([r'^update:\s*(.*)', r'^access:\s*(.*)', r'^creation:\s*(.*)'])
['2015-11-16', '2015-11-17', '2015-10-14']
>>> s | find_patterns([r'^update:\s*(?P<year>.*)-(?P<month>.*)-(?P<day>.*)',
... r'^access:\s*(.*)', r'^creation:\s*(.*)'])
[{'month': '11', 'day': '16', 'year': '2015'}, '2015-11-17', '2015-10-14']

find_patternsi

class textops.find_patternsi(patterns)

Fast multiple pattern search (case insensitive)

It works like textops.find_patterns except that patterns are case insensitive.

Parameters:patterns (dict) – a dictionary of patterns.
Returns:patterns search result
Return type:dict

Examples

>>> s = '''This is data text
... Version: 1.2.3
... Format: json'''
>>> s | find_patternsi({'version':r'^version:\s*(.*)'})     # case insensitive
{'version': '1.2.3'}

index_normalize

textops.index_normalize(index_val)

Normalize dictionary calculated key

When parsing, keys within a dictionary may come from the input text. To ensure there is no space or other special caracters, one should use this function. This is useful because DictExt dictionaries can be access with a dotted notation that only supports A-Za-z0-9_ chars.

Parameters:index_val (str) – The candidate string to a dictionary key.
Returns:A normalized string with only A-Za-z0-9_ chars
Return type:str

Examples

>>> index_normalize('this my key')
'this_my_key'
>>> index_normalize('this -my- %key%')
'this_my_key'

mgrep

class textops.mgrep(patterns_dict, key=None)

Multiple grep

This works like textops.grep except that it can do several greps in a single command. By this way, you can select many patterns in a big file.

Parameters:
  • patterns_dict (dict) – a dictionary where all patterns to search are in values.
  • key (int or str) – test only one column or one key (optional)
Returns:

A dictionary where the keys are the same as for patterns_dict, the values will contain the textops.grep result for each corresponding patterns.

Return type:

dict

Examples

>>> logs = '''
... error 1
... warning 1
... warning 2
... info 1
... error 2
... info 2
... '''
>>> t = logs | mgrep({
... 'errors' : r'^err',
... 'warnings' : r'^warn',
... 'infos' : r'^info',
... })
>>> print t                                         
{'infos': ['info 1', 'info 2'],
'errors': ['error 1', 'error 2'],
'warnings': ['warning 1', 'warning 2']}
>>> s = '''
... Disk states
... -----------
... name: c1t0d0s0
... state: good
... fs: /
... name: c1t0d0s4
... state: failed
... fs: /home
...
... '''
>>> t = s | mgrep({
... 'disks' : r'^name:',
... 'states' : r'^state:',
... 'fss' : r'^fs:',
... })
>>> print t                                         
{'states': ['state: good', 'state: failed'],
'disks': ['name: c1t0d0s0', 'name: c1t0d0s4'],
'fss': ['fs: /', 'fs: /home']}
>>> dict(zip(t.disks.cutre(': *',1),zip(t.states.cutre(': *',1),t.fss.cutre(': *',1))))
{'c1t0d0s0': ('good', '/'), 'c1t0d0s4': ('failed', '/home')}

mgrepi

class textops.mgrepi(patterns_dict, key=None)

same as mgrep but case insensitive

This works like textops.mgrep, except it is case insensitive.

Parameters:
  • patterns_dict (dict) – a dictionary where all patterns to search are in values.
  • key (int or str) – test only one column or one key (optional)
Returns:

A dictionary where the keys are the same as for patterns_dict, the values will contain the textops.grepi result for each corresponding patterns.

Return type:

dict

Examples

>>> 'error 1' | mgrep({'errors':'ERROR'})
{}
>>> 'error 1' | mgrepi({'errors':'ERROR'})
{'errors': ['error 1']}

mgrepv

class textops.mgrepv(patterns_dict, key=None)

Same as mgrep but exclusive

This works like textops.mgrep, except it searches lines that DOES NOT match patterns.

Parameters:
  • patterns_dict (dict) – a dictionary where all patterns to exclude are in values().
  • key (int or str) – test only one column or one key (optional)
Returns:

A dictionary where the keys are the same as for patterns_dict, the values will contain the textops.grepv result for each corresponding patterns.

Return type:

dict

Examples

>>> logs = '''error 1
... warning 1
... warning 2
... error 2
... '''
>>> t = logs | mgrepv({
... 'not_errors' : r'^err',
... 'not_warnings' : r'^warn',
... })
>>> print t                                         
{'not_warnings': ['error 1', 'error 2'], 'not_errors': ['warning 1', 'warning 2']}

mgrepvi

class textops.mgrepvi(patterns_dict, key=None)

Same as mgrepv but case insensitive

This works like textops.mgrepv, except it is case insensitive.

Parameters:
  • patterns_dict (dict) – a dictionary where all patterns to exclude are in values().
  • key (int or str) – test only one column or one key (optional)
Returns:

A dictionary where the keys are the same as for patterns_dict, the values will contain the textops.grepvi result for each corresponding patterns.

Return type:

dict

Examples

>>> logs = '''error 1
... WARNING 1
... warning 2
... ERROR 2
... '''
>>> t = logs | mgrepv({
... 'not_errors' : r'^err',
... 'not_warnings' : r'^warn',
... })
>>> print t                                         
{'not_warnings': ['error 1', 'WARNING 1', 'ERROR 2'],
'not_errors': ['WARNING 1', 'warning 2', 'ERROR 2']}
>>> t = logs | mgrepvi({
... 'not_errors' : r'^err',
... 'not_warnings' : r'^warn',
... })
>>> print t                                         
{'not_warnings': ['error 1', 'ERROR 2'], 'not_errors': ['WARNING 1', 'warning 2']}

parse_indented

class textops.parse_indented(sep=':')

Parse key:value indented text

It looks for key:value patterns, store found values in a dictionary. Each time a new indent is found, a sub-dictionary is created. The keys are normalized (only keep A-Za-z0-9_), the values are stripped.

Parameters:sep (str) – key:value separator (Default : ‘:’)
Returns:structured keys:values
Return type:dict

Examples

>>> s = '''
... a:val1
... b:
...     c:val3
...     d:
...         e ... : val5
...         f ... :val6
...     g:val7
... f: val8'''
>>> s | parse_indented()
{'a': 'val1', 'b': {'c': 'val3', 'd': {'e': 'val5', 'f': 'val6'}, 'g': 'val7'}, 'f': 'val8'}
>>> s = '''
... a --> val1
... b --> val2'''
>>> s | parse_indented(r'-->')
{'a': 'val1', 'b': 'val2'}

parseg

class textops.parseg(pattern)

Find all occurrences of one pattern, return MatchObject groupdict

Parameters:pattern (str) – a regular expression string (case sensitive)
Returns:A list of dictionaries (MatchObject groupdict)
Return type:list

Examples

>>> s = '''name: Lapouyade
... first name: Eric
... country: France'''
>>> s | parseg(r'(?P<key>.*):\s*(?P<val>.*)')         
[{'key': 'name', 'val': 'Lapouyade'},
{'key': 'first name', 'val': 'Eric'},
{'key': 'country', 'val': 'France'}]

parsegi

class textops.parsegi(pattern)

Same as parseg but case insensitive

Parameters:pattern (str) – a regular expression string (case insensitive)
Returns:A list of dictionaries (MatchObject groupdict)
Return type:list

Examples

>>> s = '''Error: System will reboot
... Notice: textops rocks
... Warning: Python must be used without moderation'''
>>> s | parsegi(r'(?P<level>error|warning):\s*(?P<msg>.*)')         
[{'msg': 'System will reboot', 'level': 'Error'},
{'msg': 'Python must be used without moderation', 'level': 'Warning'}]

parsek

class textops.parsek(pattern, key_name = 'key', key_update = None)

Find all occurrences of one pattern, return one Key

One have to give a pattern with named capturing parenthesis, the function will return a list of value corresponding to the specified key. It works a little like textops.parseg except that it returns from the groupdict, a value for a specified key (‘key’ be default)

Parameters:
  • pattern (str) – a regular expression string.
  • key_name (str) – The key to get (‘key’ by default)
  • key_update (callable) – function to convert the found value
Returns:

A list of values corrsponding to MatchObject groupdict[key]

Return type:

list

Examples

>>> s = '''Error: System will reboot
... Notice: textops rocks
... Warning: Python must be used without moderation'''
>>> s | parsek(r'(?P<level>Error|Warning):\s*(?P<msg>.*)','msg')
['System will reboot', 'Python must be used without moderation']

parseki

class textops.parseki(pattern, key_name = 'key', key_update = None)

Same as parsek but case insensitive

It works like textops.parsek except the pattern is case insensitive.

Parameters:
  • pattern (str) – a regular expression string.
  • key_name (str) – The key to get (‘key’ by default)
  • key_update (callable) – function to convert the found value
Returns:

A list of values corrsponding to MatchObject groupdict[key]

Return type:

list

Examples

>>> s = '''Error: System will reboot
... Notice: textops rocks
... Warning: Python must be used without moderation'''
>>> s | parsek(r'(?P<level>error|warning):\s*(?P<msg>.*)','msg')
[]
>>> s | parseki(r'(?P<level>error|warning):\s*(?P<msg>.*)','msg')
['System will reboot', 'Python must be used without moderation']

parsekv

class textops.parsekv(pattern, key_name = 'key', key_update = None)

Find all occurrences of one pattern, returns a dict of groupdicts

It works a little like textops.parseg except that it returns a dict of dicts : values are MatchObject groupdicts, keys are a value in the groupdict at a specified key (By default : ‘key’). Note that calculated keys are normalized (spaces are replaced by underscores)

Parameters:
  • pattern (str) – a regular expression string.
  • key_name (str) – The key name to optain the value that will be the key of the groupdict (‘key’ by default)
  • key_update (callable) – function to convert/normalize the calculated key
Returns:

A dict of MatchObject groupdicts

Return type:

dict

Examples

>>> s = '''name: Lapouyade
... first name: Eric
... country: France'''
>>> s | parsekv(r'(?P<key>.*):\s*(?P<val>.*)')         
{'country': {'val': 'France', 'key': 'country'},
'first_name': {'val': 'Eric', 'key': 'first name'},
'name': {'val': 'Lapouyade', 'key': 'name'}}
>>> s | parsekv(r'(?P<item>.*):\s*(?P<val>.*)','item',str.upper)         
{'FIRST NAME': {'item': 'first name', 'val': 'Eric'},
'NAME': {'item': 'name', 'val': 'Lapouyade'},
'COUNTRY': {'item': 'country', 'val': 'France'}}

parsekvi

class textops.parsekvi(pattern, key_name = 'key', key_update = None)

Find all occurrences of one pattern (case insensitive), returns a dict of groupdicts

It works a little like textops.parsekv except that the pattern is case insensitive.

Parameters:
  • pattern (str) – a regular expression string (case insensitive).
  • key_name (str) – The key name to optain the value that will be the key of the groupdict (‘key’ by default)
  • key_update (callable) – function to convert/normalize the calculated key
Returns:

A dict of MatchObject groupdicts

Return type:

dict

Examples

>>> s = '''name: Lapouyade
... first name: Eric
... country: France'''
>>> s | parsekvi(r'(?P<key>NAME):\s*(?P<val>.*)')
{'name': {'val': 'Lapouyade', 'key': 'name'}}

state_pattern

class textops.state_pattern(states_patterns_desc, reflags=0, autostrip=True)

States and patterns parser

This is a state machine parser : The main advantage is that it reads line-by-line the whole input text only once to collect all data you want into a multi-level dictionary. It uses patterns to select rules to be applied. It uses states to ensure only a set of rules are used against specific document sections.

Parameters:
  • states_patterns_desc (tupple) – descrption of states and patterns : see below for explaination
  • reflags – re flags, ie re.I or re.M or re.I | re.M (Default : no flag)
  • autostrip – before being stored, groupdict keys and values are stripped (Default : True)
Returns:

parsed data from text

Return type:

dict


The states_patterns_desc :

It looks like this:

((<if state1>,<goto state1>,<pattern1>,<out data path1>,<out filter1>),
...
(<if stateN>,<goto stateN>,<patternN>,<out data pathN>,<out filterN>))
<if state>
is a string telling on what state(s) the pattern must be searched, one can specify several states with comma separated string or a tupple. if <if state> is empty, the pattern will be searched for all lines. Note : at the beginning, the state is ‘top’
<goto state>
is a string corresponding to the new state if the pattern matches. use an empty string to not change the current state. One can use any string, usually, it corresponds to a specific section name of the document to parse where specific rules has to be used.
<pattern>
is a string or a re.regex to match a line of text. one should use named groups for selecting data, ex: (?P<key1>pattern)
<out data path>

is a string with a dot separator or a tuple telling where to place the groupdict from pattern maching process, The syntax is:

'{contextkey1}.{contextkey2}. ... .{contextkeyN}'
or
('{contextkey1}','{contextkey2}', ... ,'{contextkeyN}')
or
'key1.key2.keyN'
or
'key1.key2.keyN[]'
or
'{contextkey1}.{contextkey2}. ... .keyN[]'

The contextdict is used to format strings with {contextkeyN} syntax. instead of {contextkeyN}, one can use a simple string to put data in a fixed path. Once the path fully formatted, let’s say to key1.key2.keyN, the parser will store the value into the result dictionnary at : {'key1':{'key2':{'keyN' : thevalue }}} One can use the string [] at the end of the path : the groupdict will be appended in a list ie : {'key1':{'key2':{'keyN' : [thevalue,...] }}}

<out filter>

is used to build the value to store,

it could be :

  • None : no filter is applied, the re.MatchObject.groupdict() is stored
  • a string : used as a format string with context dict, the formatted string is stored
  • a callable : to calculate the value to be stored, the context dict is given as param.

How the parser works :

You have a document where the syntax may change from one section to an another : You have just to give a name to these kind of sections : it will be your state names. The parser reads line by line the input text : For each line, it will look for the first matching rule from states_patterns_desc table, then will apply the rule. One rule has got 2 parts : the matching parameters, and the action parameters.

Matching parameters:
To match, a rule requires the parser to be at the specified state <if state> AND the line to be parsed must match the pattern <pattern>. When the parser is at the first line, it has the default state top. The pattern follow the standard python re module syntax. It is important to note that you must capture text you want to collect with the named group capture syntax, that is (?P<mydata>mypattern). By this way, the parser will store text corresponding to mypattern to a contextdict at the key mydata.
Action parameters:
Once the rule matches, the action is to store <out filter> into the final dictionary at a specified <out data path>.

Context dict :

The context dict is used within <out filter> and <out data path>, it is a dictionary that is PERSISTENT during the whole parsing process : It is empty at the parsing beginning and will accumulate all captured pattern. For exemple, if a first rule pattern contains (?P<key1>.*),(?P<key2>.*) and matches the document line val1,val2, the context dict will be { 'key1' : 'val1', 'key2' : 'val2' }. Then if a second rule pattern contains (?P<key2>.*):(?P<key3>.*) and matches the document line val4:val5 then the context dict will be UPDATED to { 'key1' : 'val1', 'key2' : 'val4', 'key3' : 'val5' }. As you can see, the choice of the key names are VERY IMPORTANT in order to avoid collision across all the rules.

Examples

>>> s = '''
... first name: Eric
... last name: Lapouyade'''
>>> s | state_pattern( (('',None,'(?P<key>.*):(?P<val>.*)','{key}','{val}'),) )
{'first_name': 'Eric', 'last_name': 'Lapouyade'}
>>> s | state_pattern( (('',None,'(?P<key>.*):(?P<val>.*)','{key}',None),) ) 
{'first_name': {'val': 'Eric', 'key': 'first name'},
'last_name': {'val': 'Lapouyade', 'key': 'last name'}}
>>> s | state_pattern((('',None,'(?P<key>.*):(?P<val>.*)','my.path.{key}','{val}'),))
{'my': {'path': {'first_name': 'Eric', 'last_name': 'Lapouyade'}}}
>>> s = '''Eric
... Guido'''
>>> s | state_pattern( (('',None,'(?P<val>.*)','my.path.info[]','{val}'),) )
{'my': {'path': {'info': ['Eric', 'Guido']}}}
>>> s = '''
... Section 1
... ---------
...   email = ericdupo@gmail.com
...
... Section 2
... ---------
...   first name: Eric
...   last name: Dupont'''
>>> s | state_pattern( (                                    
... ('','section1','^Section 1',None,None),
... ('','section2','^Section 2',None,None),
... ('section1', '', '(?P<key>.*)=(?P<val>.*)', 'section1.{key}', '{val}'),
... ('section2', '', '(?P<key>.*):(?P<val>.*)', 'section2.{key}', '{val}')) )
{'section2': {'first_name': 'Eric', 'last_name': 'Dupont'},
'section1': {'email': 'ericdupo@gmail.com'}}
>>> s = '''
... Disk states
... -----------
... name: c1t0d0s0
... state: good
... fs: /
... name: c1t0d0s4
... state: failed
... fs: /home
...
... '''
>>> s | state_pattern( (                                    
... ('top','disk',r'^Disk states',None,None),
... ('disk','top', r'^\s*$',None,None),
... ('disk', '', r'^name:(?P<diskname>.*)',None, None),
... ('disk', '', r'(?P<key>.*):(?P<val>.*)', 'disks.{diskname}.{key}', '{val}')) )
{'disks': {'c1t0d0s0': {'state': 'good', 'fs': '/'},
'c1t0d0s4': {'state': 'failed', 'fs': '/home'}}}