python 一些好用的正则

正则的效果，可以到这里验证

索引

1、汉字、英文、数字
2、标点符号
3、email
4、手机号码
5、IP地址
6、复杂需求

chinese-english-number

1
2
3
4
5
6
7


# 保留文本中的汉字、英文、数字（标点符号不会保留）
import re
# pattern = "[^\u4e00-\u9fa5^A-Z^a-z^0-9]" # 效果等价
# 如果需要保留字间的空格，直接在正则末尾加一个空格就行 [^\u4e00-\u9fa5A-Za-z0-9 ]
pattern = "[^\u4e00-\u9fa5A-Za-z0-9]"
text = "\某某，\\你好=+！123【我//""们】abc~————聊/天'吧:：！这.!！_#？?（）个‘’“”￥$主|意()不错......！"
result = re.sub(pattern,"",text)

symbol

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# 标点符号（中英文）
import string

# 最后"\\\"是转义，为了得到"\"
# 中文标点
pattern="[！？｡＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘'‛“”„‟…‧﹏.￥’\\\]"
# 这里得到的是英文标点
# !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
print(string.punctuation)
pattern2 = "["+string.punctuation+"]"
text = "\某某，\\你好=+！123【我//""们】abc~————聊/天'吧:：！这.!！_#？?（）个‘’“”￥$主|意()不错......！"
result = re.sub(pattern,"",text)
print(result)
result = re.sub(pattern2,"",result)
print(result)

email

1
2


email_pattern = '^[*#\u4e00-\u9fa5 a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.[a-zA-Z0-9]{2,6}$'
emails = re.findall(email_pattern, text, flags=0)

phone-number

1
2
3


# 抽取国内手机号码
cellphone_pattern = '^((13[0-9])|(14[0-9])|(15[0-9])|(17[0-9])|(18[0-9]))\d{8}$'
phoneNumbers = re.findall(cellphone_pattern, text, flags=0)

ip

1

(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)

complex

1
2
3
4
5
6
7


text = "0023苏宁（中国）ABC123有限公司A平台"
# 这里需要把开头的数字去掉^[0-9]*，结尾的‘A平台’固定字符串去掉 A平台$，有的字符串包含这些开头结尾，有些没有
# 中括号及内部字符串全去掉\\(.*?\\)|（.*?）
# 字符串内部仅保留中英文和数字[^\u4e00-\u9fa5A-Za-z0-9]
pattern = '^[0-9]*|A平台$|\\(.*?\\)|（.*?）|[^\u4e00-\u9fa5A-Za-z0-9]'
result = re.sub(pattern,"",text)
result = "苏宁ABC123有限公司"

对于上面以固定字符的需求，也可以用以下方法：

1
2


# texts: list of text
texts_new = [text[:-3] if text.endswith('A平台') else text for text in texts]

Reference

1、 https://github.com/cold-eye/funNLP

打赏

微信	支付宝
万分感谢

chinese-english-number

symbol

email

phone-number

ip

complex

Reference

打赏

See Also

最近文章

福利派送

分类

标签

友情链接

其它