python 异常、正则表达式
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html#regex-howto
例 6.1. 打开一个不存在的文件
>>> fsock = open("/notthere", "r")
Traceback (innermost last):
File "<interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'
>>> try:
... fsock = open("/notthere")
... except IOError:
... print "The file does not exist, exiting gracefully"
... print "This line will always print"
The file does not exist, exiting gracefully
This line will always print
# Bind the name getpass to the appropriate function
try:
import termios, TERMIOS
except ImportError:
try:
import msvcrt
except ImportError:
try:
from EasyDialogs import AskPassword
except ImportError:
getpass = default_getpass
else:
getpass = AskPassword
else:
getpass = win_getpass
else:
getpass = unix_getpass
例 6.10. 遍历 dictionary
>>> import os
>>> for k, v in os.environ.items():
... print "%s=%s" % (k, v)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim
[...略...]
>>> print "\n".join(["%s=%s" % (k, v)
... for k, v in os.environ.items()])
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
例 6.13. 使用 sys.modules
>>> import fileinfo
>>> print '\n'.join(sys.modules.keys())
win32api
os.path
os
fileinfo
exceptions
>>> fileinfo
<module 'fileinfo' from 'fileinfo.pyc'>
>>> sys.modules["fileinfo"]
<module 'fileinfo' from 'fileinfo.pyc'>
下面的例子将展示通过结合使用 __module__ 类属性和 sys.modules dictionary 来获取已知类所在的模块。
例 6.14. __module__ 类属性
>>> from fileinfo import MP3FileInfo
>>> MP3FileInfo.__module__
'fileinfo'
>>> sys.modules[MP3FileInfo.__module__]
<module 'fileinfo' from 'fileinfo.pyc'> 每个 Python 类都拥有一个内置的类属性 __module__,它定义了这个类的模块的名字。
将它与 sys.modules 字典复合使用,你可以得到定义了某个类的模块的引用。
例 6.16. 构造路径名
>>> import os
>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.expanduser("~")
'c:\\Documents and Settings\\mpilgrim\\My Documents'
>>> os.path.join(os.path.expanduser("~"), "Python")
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
例 7.2. 匹配整个单词
>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
'100 BRD.'
>>> re.sub('\\bROAD$', 'RD.', s)
'100 BROAD'
>>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b', 'RD.', s)
'100 BROAD RD. APT 3'
我真正想要做的是,当 'ROAD' 出现在字符串的末尾,并且是作为一个独立的单词时,而不是一些长单词的一部分,才对他进行匹配。为了在正则表达式中表达这个意思,你利用 \b,它的含义是“单词的边界必须在这里”。在 Python 中,由于字符 '\' 在一个字符串中必须转义,这会变得非常麻烦。有时候,这类问题被称为“反斜线灾难”,这也是 Perl 中正则表达式比 Python 的正则表达式要相对容易的原因之一。另一方面,Perl 也混淆了正则表达式和其他语法,因此,如果你发现一个 bug,很难弄清楚究竟是一个语法错误,还是一个正则表达式错误。
为了避免反斜线灾难,你可以利用所谓的“原始字符串”,只要为字符串添加一个前缀 r 就可以了。这将告诉 Python,字符串中的所有字符都不转义;'\t' 是一个制表符,而 r'\t' 是一个真正的反斜线字符 '\',紧跟着一个字母 't'。我推荐只要处理正则表达式,就使用原始字符串;否则,事情会很快变得混乱 (并且正则表达式自己也会很快被自己搞乱了)。
例 7.4. 检验百位数
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
>>> re.search(pattern, 'MCM')
<SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')
<SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')
<SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')
>>> re.search(pattern, '')
<SRE_Match object at 01071D98>
例 7.5. 老方法:每一个字符都是可选的
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMMM')
>>>
例 7.6. 一个新的方法:从 n 到 m
>>> pattern = '^M{0,3}$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EEDA8>
>>> re.search(pattern, 'MMMM')
>>>
对于个位数的正则表达式有类似的表达方式,我将省略细节,直接展示结果。
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
用另一种 {n,m} 语法表达这个正则表达式会如何呢?这个例子展示新的语法。
例 7.8. 用 {n,m} 语法确认罗马数字
>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI')
<_sre.SRE_Match object at 0x008EEB48>
例 7.9. 带有内联注释 (Inline Comments) 的正则表达式
>>> pattern = """
^ # beginning of string
M{0,3} # thousands - 0 to 3 M's
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
# or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
# or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
"""
>>> re.search(pattern, 'M', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M')
当使用松散正则表达式时,最重要的一件事情就是:必须传递一个额外的参数 re.VERBOSE,该参数是定义在 re 模块中的一个常量,标志着待匹配的正则表达式是一个松散正则表达式。正如你看到的,这个模式中,有很多空格 (所有的空格都被忽略),和几个注释 (所有的注释也被忽略)。如果忽略所有的空格和注释,它就和前面章节里的正则表达式完全相同,但是具有更好的可读性。
>>> re.search(pattern, 'M')
这个没有匹配。为什么呢?因为没有 re.VERBOSE 标记,所以 re.search 函数把模式作为一个紧凑正则表达式进行匹配。Python 不能自动检测一个正则表达式是为松散类型还是紧凑类型。Python 默认每一个正则表达式都是紧凑类型的,除非你显式地标明一个正则表达式为松散类型。
例 7.16. 解析电话号码 (最终版本)
>>> phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555')
\D* # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212')
\D* # optional separator
(\d*) # extension is optional and can be any number of digits
$ # end of string
''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')
现在,你应该熟悉下列技巧:
^ 匹配字符串的开始。
$ 匹配字符串的结尾。
\b 匹配一个单词的边界。
\d 匹配任意数字。
\D 匹配任意非数字字符。
x? 匹配一个可选的 x 字符 (换言之,它匹配 1 次或者 0 次 x 字符)。
x* 匹配0次或者多次 x 字符。
x+ 匹配1次或者多次 x 字符。
x{n,m} 匹配 x 字符,至少 n 次,至多 m 次。
(a|b|c) 要么匹配 a,要么匹配 b,要么匹配 c。
(x) 一般情况下表示一个记忆组 (remembered group)。你可以利用 re.search 函数返回对象的 groups() 函数获取它的值。
http://www.woodpecker.org.cn/diveintopython/regular_expressions/phone_numbers.html
Regular expression pattern syntax
.
|
Matches any character except \n (if DOTALL, also matches \n)
|
^
|
Matches start of string (if MULTILINE, also matches after \n)
|
$
|
Matches end of string (if MULTILINE, also matches before \n)
|
*
|
Matches zero or more cases of the previous regular expression; greedy (match as many as possible)
|
+
|
Matches one or more cases of the previous regular expression; greedy (match as many as possible)
|
?
|
Matches zero or one case of the previous regular expression; greedy (match one if possible)
|
*? , +?, ??
|
Non-greedy versions of *, +, and ? (match as few as possible)
|
{m,n}
|
Matches m to n cases of the previous regular expression (greedy)
|
{m,n}?
|
Matches m to n cases of the previous regular expression (non-greedy)
|
[...]
|
Matches any one of a set of characters contained within the brackets
|
|
|
Matches expression either preceding it or following it
|
(...)
|
Matches the regular expression within the parentheses and also indicates a group
|
(?iLmsux)
|
Alternate way to set optional flags; no effect on match
|
(?:...)
|
Like (...), but does not indicate a group
|
(?P<id>...)
|
Like (...), but the group also gets the name id
|
(?P=id)
|
Matches whatever was previously matched by group named id
|
(?#...)
|
Content of parentheses is just a comment; no effect on match
|
(?=...)
|
Lookahead assertion; matches if regular expression ... matches what comes next, but does not consume any part of the string
|
(?!...)
|
Negative lookahead assertion; matches if regular expression ... does not match what comes next, and does not consume any part of the string
|
(?<=...)
|
Lookbehind assertion; matches if there is a match for regular expression ... ending at the current position (... must match a fixed length)
|
(?<!...)
|
Negative lookbehind assertion; matches if there is no match for regular expression ... ending at the current position (... must match a fixed length)
|
\number
|
Matches whatever was previously matched by group numbered number (groups are automatically numbered from 1 up to 99)
|
\A
|
Matches an empty string, but only at the start of the whole string
|
\b
|
Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also \w)
|
\B
|
Matches an empty string, but not at the start or end of a word
|
\d
|
Matches one digit, like the set [0-9]
|
\D
|
Matches one non-digit, like the set [^0-9]
|
\s
|
Matches a whitespace character, like the set [ \t\n\r\f\v]
|
\S
|
Matches a non-white character, like the set [^ \t\n\r\f\v]
|
\w
|
Matches one alphanumeric character; unless LOCALE or UNICODE is set, \w is like [a-zA-Z0-9_]
|
\W
|
Matches one non-alphanumeric character, the reverse of \w
|
\Z
|
Matches an empty string, but only at the end of the whole string
|
\\
|
Matches one backslash character
|
posted on 2009-08-22 23:48
Frank_Fang 阅读(1874)
评论(0) 编辑 收藏 所属分类:
Python学习