随笔-204  评论-149  文章-0  trackbacks-0

python 异常、正则表达式
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html#regex-howto

例 6.1. 打开一个不存在的文件
>>> fsock = open("/notthere", "r")     
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'
>>> try:
...     fsock = open("/notthere")      
... except IOError:                    
...     print "The file does not exist, exiting gracefully"
... print "This line will always print"
The file does not exist, exiting gracefully
This line will always print


# Bind the name getpass to the appropriate function
  try:
      import termios, TERMIOS                    
  except ImportError:
      try:
          import msvcrt                          
      except ImportError:
          try:
              from EasyDialogs import AskPassword
          except ImportError:
              getpass = default_getpass          
          else:                                  
              getpass = AskPassword
      else:
          getpass = win_getpass
  else:
      getpass = unix_getpass

 

例 6.10. 遍历 dictionary
>>> import os
>>> for k, v in os.environ.items():      
...     print "%s=%s" % (k, v)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim

[...略...]
>>> print "\n".join(["%s=%s" % (k, v)
...     for k, v in os.environ.items()])
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM

 

例 6.13. 使用 sys.modules
>>> import fileinfo        
>>> print '\n'.join(sys.modules.keys())
win32api
os.path
os
fileinfo
exceptions

>>> fileinfo
<module 'fileinfo' from 'fileinfo.pyc'>
>>> sys.modules["fileinfo"]
<module 'fileinfo' from 'fileinfo.pyc'>


下面的例子将展示通过结合使用 __module__ 类属性和 sys.modules dictionary 来获取已知类所在的模块。

例 6.14. __module__ 类属性
>>> from fileinfo import MP3FileInfo
>>> MP3FileInfo.__module__             
'fileinfo'
>>> sys.modules[MP3FileInfo.__module__]
<module 'fileinfo' from 'fileinfo.pyc'>  每个 Python 类都拥有一个内置的类属性 __module__,它定义了这个类的模块的名字。 
  将它与 sys.modules 字典复合使用,你可以得到定义了某个类的模块的引用。 

 

例 6.16. 构造路径名
>>> import os
>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3") 
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")  
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.expanduser("~")                        
'c:\\Documents and Settings\\mpilgrim\\My Documents'
>>> os.path.join(os.path.expanduser("~"), "Python")
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'

 

例 7.2. 匹配整个单词
>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
'100 BRD.'
>>> re.sub('\\bROAD$', 'RD.', s) 
'100 BROAD'
>>> re.sub(r'\bROAD$', 'RD.', s) 
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s) 
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b', 'RD.', s)
'100 BROAD RD. APT 3'

我真正想要做的是,当 'ROAD' 出现在字符串的末尾,并且是作为一个独立的单词时,而不是一些长单词的一部分,才对他进行匹配。为了在正则表达式中表达这个意思,你利用 \b,它的含义是“单词的边界必须在这里”。在 Python 中,由于字符 '\' 在一个字符串中必须转义,这会变得非常麻烦。有时候,这类问题被称为“反斜线灾难”,这也是 Perl 中正则表达式比 Python 的正则表达式要相对容易的原因之一。另一方面,Perl 也混淆了正则表达式和其他语法,因此,如果你发现一个 bug,很难弄清楚究竟是一个语法错误,还是一个正则表达式错误。 
  为了避免反斜线灾难,你可以利用所谓的“原始字符串”,只要为字符串添加一个前缀 r 就可以了。这将告诉 Python,字符串中的所有字符都不转义;'\t' 是一个制表符,而 r'\t' 是一个真正的反斜线字符 '\',紧跟着一个字母 't'。我推荐只要处理正则表达式,就使用原始字符串;否则,事情会很快变得混乱 (并且正则表达式自己也会很快被自己搞乱了)。 

 

例 7.4. 检验百位数
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
>>> re.search(pattern, 'MCM')           
<SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')            
<SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')        
<SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')          
>>> re.search(pattern, '')              
<SRE_Match object at 01071D98>

 

例 7.5. 老方法:每一个字符都是可选的
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')   
<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM')  
<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MMM') 
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMMM')
>>>


例 7.6. 一个新的方法:从 n 到 m
>>> pattern = '^M{0,3}$'      
>>> re.search(pattern, 'M')   
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MM')  
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMM') 
<_sre.SRE_Match object at 0x008EEDA8>
>>> re.search(pattern, 'MMMM')
>>>


对于个位数的正则表达式有类似的表达方式,我将省略细节,直接展示结果。

>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
用另一种 {n,m} 语法表达这个正则表达式会如何呢?这个例子展示新的语法。

例 7.8. 用 {n,m} 语法确认罗马数字
>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')            
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI')        
<_sre.SRE_Match object at 0x008EEB48>


例 7.9. 带有内联注释 (Inline Comments) 的正则表达式
>>> pattern = """
    ^                   # beginning of string
    M{0,3}              # thousands - 0 to 3 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
    """
>>> re.search(pattern, 'M', re.VERBOSE)               
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)       
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE) 
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M')                           
  当使用松散正则表达式时,最重要的一件事情就是:必须传递一个额外的参数 re.VERBOSE,该参数是定义在 re 模块中的一个常量,标志着待匹配的正则表达式是一个松散正则表达式。正如你看到的,这个模式中,有很多空格 (所有的空格都被忽略),和几个注释 (所有的注释也被忽略)。如果忽略所有的空格和注释,它就和前面章节里的正则表达式完全相同,但是具有更好的可读性。 
>>> re.search(pattern, 'M')       
这个没有匹配。为什么呢?因为没有 re.VERBOSE 标记,所以 re.search 函数把模式作为一个紧凑正则表达式进行匹配。Python 不能自动检测一个正则表达式是为松散类型还是紧凑类型。Python 默认每一个正则表达式都是紧凑类型的,除非你显式地标明一个正则表达式为松散类型。

 

例 7.16. 解析电话号码 (最终版本)
>>> phonePattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})     # area code is 3 digits (e.g. '800')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '555')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '1212')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()       
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')                               
('800', '555', '1212', '')

 


现在,你应该熟悉下列技巧:

^ 匹配字符串的开始。
$ 匹配字符串的结尾。
\b 匹配一个单词的边界。
\d 匹配任意数字。
\D 匹配任意非数字字符。
x? 匹配一个可选的 x 字符 (换言之,它匹配 1 次或者 0 次 x 字符)。
x* 匹配0次或者多次 x 字符。
x+ 匹配1次或者多次 x 字符。
x{n,m} 匹配 x 字符,至少 n 次,至多 m 次。
(a|b|c) 要么匹配 a,要么匹配 b,要么匹配 c。
(x) 一般情况下表示一个记忆组 (remembered group)。你可以利用 re.search 函数返回对象的 groups() 函数获取它的值。

http://www.woodpecker.org.cn/diveintopython/regular_expressions/phone_numbers.html

Regular expression pattern syntax

Element

Meaning

.

Matches any character except \n (if DOTALL, also matches \n)

^

Matches start of string (if MULTILINE, also matches after \n)

$

Matches end of string (if MULTILINE, also matches before \n)

*

Matches zero or more cases of the previous regular expression; greedy (match as many as possible)

+

Matches one or more cases of the previous regular expression; greedy (match as many as possible)

?

Matches zero or one case of the previous regular expression; greedy (match one if possible)

*? , +?, ??

Non-greedy versions of *, +, and ? (match as few as possible)

{m,n}

Matches m to n cases of the previous regular expression (greedy)

{m,n}?

Matches m to n cases of the previous regular expression (non-greedy)

[...]

Matches any one of a set of characters contained within the brackets

|

Matches expression either preceding it or following it

(...)

Matches the regular expression within the parentheses and also indicates a group

(?iLmsux)

Alternate way to set optional flags; no effect on match

(?:...)

Like (...), but does not indicate a group

(?P<id>...)

Like (...), but the group also gets the name id

(?P=id)

Matches whatever was previously matched by group named id

(?#...)

Content of parentheses is just a comment; no effect on match

(?=...)

Lookahead assertion; matches if regular expression ... matches what comes next, but does not consume any part of the string

(?!...)

Negative lookahead assertion; matches if regular expression ... does not match what comes next, and does not consume any part of the string

(?<=...)

Lookbehind assertion; matches if there is a match for regular expression ... ending at the current position (... must match a fixed length)

(?<!...)

Negative lookbehind assertion; matches if there is no match for regular expression ... ending at the current position (... must match a fixed length)

\number

Matches whatever was previously matched by group numbered number (groups are automatically numbered from 1 up to 99)

\A

Matches an empty string, but only at the start of the whole string

\b

Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also \w)

\B

Matches an empty string, but not at the start or end of a word

\d

Matches one digit, like the set [0-9]

\D

Matches one non-digit, like the set [^0-9]

\s

Matches a whitespace character, like the set [ \t\n\r\f\v]

\S

Matches a non-white character, like the set [^ \t\n\r\f\v]

\w

Matches one alphanumeric character; unless LOCALE or UNICODE is set, \w is like [a-zA-Z0-9_]

\W

Matches one non-alphanumeric character, the reverse of \w

\Z

Matches an empty string, but only at the end of the whole string

\\

Matches one backslash character

posted on 2009-08-22 23:48 Frank_Fang 阅读(1875) 评论(0)  编辑  收藏 所属分类: Python学习

只有注册用户登录后才能发表评论。


网站导航: