|
|
马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
引言
正则表达式(Regular Expression,简称regex或regexp)是一种强大而灵活的文本处理工具,它使用特定的字符序列来描述和匹配字符串模式。在当今数据爆炸的时代,无论是数据清洗、日志分析、信息提取还是表单验证,正则表达式都能提供高效而优雅的解决方案。掌握正则表达式不仅能显著提升你的工作效率,还能让你在面对复杂数据处理任务时游刃有余。本文将带你深入了解正则表达式的核心技巧,帮助你从入门到精通,轻松应对各种文本处理挑战。
正则表达式基础
什么是正则表达式
正则表达式是一种用于描述字符串模式的特殊语法。它由普通字符(如字母、数字)和特殊字符(称为”元字符”)组成,可以用来检查一个字符串是否含有某种模式、替换匹配的子串或提取符合条件的子串。
基本语法
最简单的正则表达式就是普通字符串,它精确匹配自身:
- import re
- pattern = "hello"
- text = "hello world"
- result = re.search(pattern, text)
- print(result.group()) # 输出: hello
复制代码
元字符是正则表达式中具有特殊含义的字符:
• .: 匹配除换行符以外的任意字符
• \d: 匹配任意数字,等价于[0-9]
• \D: 匹配任意非数字字符
• \w: 匹配字母、数字和下划线,等价于[a-zA-Z0-9_]
• \W: 匹配非字母、数字和下划线
• \s: 匹配任意空白字符(空格、制表符、换行符等)
• \S: 匹配任意非空白字符
- import re
- # 匹配任意字符
- pattern = "h.llo"
- text = "hallo world"
- result = re.search(pattern, text)
- print(result.group()) # 输出: hallo
- # 匹配数字
- pattern = "\d+"
- text = "I have 2 apples and 5 oranges"
- result = re.findall(pattern, text)
- print(result) # 输出: ['2', '5']
复制代码
量词用于指定前面的字符或模式出现的次数:
• *: 零次或多次
• +: 一次或多次
• ?: 零次或一次
• {n}: 恰好n次
• {n,}: 至少n次
• {n,m}: 至少n次,至多m次
- import re
- # 匹配零次或多次
- pattern = "ab*c"
- text = "ac abc abbc abbbc"
- result = re.findall(pattern, text)
- print(result) # 输出: ['ac', 'abc', 'abbc', 'abbbc']
- # 匹配一次或多次
- pattern = "ab+c"
- text = "ac abc abbc abbbc"
- result = re.findall(pattern, text)
- print(result) # 输出: ['abc', 'abbc', 'abbbc']
- # 匹配恰好2次
- pattern = "ab{2}c"
- text = "abc abbc abbbc"
- result = re.findall(pattern, text)
- print(result) # 输出: ['abbc']
复制代码
字符类用于匹配一组字符中的一个:
• [abc]: 匹配a、b或c中的任意一个
• [^abc]: 匹配除a、b、c以外的任意字符
• [a-z]: 匹配a到z之间的任意小写字母
• [A-Z]: 匹配A到Z之间的任意大写字母
• [0-9]: 匹配0到9之间的任意数字
- import re
- # 匹配字符类中的任意一个
- pattern = "[aeiou]"
- text = "hello world"
- result = re.findall(pattern, text)
- print(result) # 输出: ['e', 'o', 'o']
- # 匹配不在字符类中的任意字符
- pattern = "[^aeiou]"
- text = "hello world"
- result = re.findall(pattern, text)
- print(result) # 输出: ['h', 'l', 'l', ' ', 'w', 'r', 'l', 'd']
复制代码
使用括号()可以创建分组,分组可以用于:
1. 应用量词到一个子表达式
2. 限制选择范围
3. 捕获匹配的文本以便后续引用
- import re
- # 对子表达式应用量词
- pattern = "(ab)+"
- text = "ab abab ababab"
- result = re.findall(pattern, text)
- print(result) # 输出: ['ab', 'ab', 'ab']
- # 捕获匹配的文本
- pattern = "(\d{4})-(\d{2})-(\d{2})"
- text = "Date: 2023-05-15"
- result = re.search(pattern, text)
- if result:
- print("Full match:", result.group(0)) # 输出: Full match: 2023-05-15
- print("Year:", result.group(1)) # 输出: Year: 2023
- print("Month:", result.group(2)) # 输出: Month: 05
- print("Day:", result.group(3)) # 输出: Day: 15
复制代码
边界匹配用于指定匹配的位置:
• ^: 匹配字符串的开始
• $: 匹配字符串的结束
• \b: 匹配单词边界
• \B: 匹配非单词边界
- import re
- # 匹配字符串开始
- pattern = "^hello"
- text = "hello world"
- result = re.search(pattern, text)
- print(result.group()) # 输出: hello
- # 匹配字符串结束
- pattern = "world$"
- text = "hello world"
- result = re.search(pattern, text)
- print(result.group()) # 输出: world
- # 匹配单词边界
- pattern = "\bhello\b"
- text = "hello world helloworld"
- result = re.findall(pattern, text)
- print(result) # 输出: ['hello']
复制代码
使用|可以指定多个可能的匹配模式:
- import re
- # 匹配多个可能的模式
- pattern = "cat|dog"
- text = "I have a cat and a dog"
- result = re.findall(pattern, text)
- print(result) # 输出: ['cat', 'dog']
复制代码
常用正则表达式模式与技巧
邮箱验证
邮箱验证是正则表达式的常见应用场景之一。一个基本的邮箱验证正则表达式如下:
- import re
- # 基本邮箱验证
- pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
- emails = [
- "user@example.com",
- "user.name@example.co.uk",
- "user-name@example.org",
- "invalid.email",
- "@example.com",
- "user@.com"
- ]
- for email in emails:
- if re.match(pattern, email):
- print(f"{email} 是有效的邮箱地址")
- else:
- print(f"{email} 不是有效的邮箱地址")
复制代码
输出:
- user@example.com 是有效的邮箱地址
- user.name@example.co.uk 是有效的邮箱地址
- user-name@example.org 是有效的邮箱地址
- invalid.email 不是有效的邮箱地址
- @example.com 不是有效的邮箱地址
- user@.com 不是有效的邮箱地址
复制代码
电话号码验证
电话号码的格式因国家/地区而异,以下是一些常见的电话号码验证模式:
- import re
- # 美国电话号码验证
- us_phone_pattern = r"\(\d{3}\)\s\d{3}-\d{4}|\d{3}-\d{3}-\d{4}"
- us_phones = [
- "(123) 456-7890",
- "123-456-7890",
- "123 456 7890",
- "123.456.7890"
- ]
- for phone in us_phones:
- if re.match(us_phone_pattern, phone):
- print(f"{phone} 是有效的美国电话号码")
- else:
- print(f"{phone} 不是有效的美国电话号码")
- # 中国手机号码验证
- cn_phone_pattern = r"1[3-9]\d{9}"
- cn_phones = [
- "13812345678",
- "15987654321",
- "12345678901",
- "1891234567"
- ]
- for phone in cn_phones:
- if re.match(cn_phone_pattern, phone):
- print(f"{phone} 是有效的中国手机号码")
- else:
- print(f"{phone} 不是有效的中国手机号码")
复制代码
URL解析
解析URL并提取其中的各个部分:
- import re
- url_pattern = r"(https?|ftp)://([^/\r\n]+)(/[^\s]*)?"
- url = "https://www.example.com/path/to/page?query=param#fragment"
- match = re.match(url_pattern, url)
- if match:
- protocol = match.group(1)
- domain = match.group(2)
- path = match.group(3)
-
- print(f"协议: {protocol}")
- print(f"域名: {domain}")
- print(f"路径: {path}")
复制代码
输出:
- 协议: https
- 域名: www.example.com
- 路径: /path/to/page?query=param#fragment
复制代码
HTML标签处理
提取或处理HTML标签:
- import re
- # 提取HTML标签内容
- html = "<div class='example'><p>Hello <span>world</span></p></div>"
- tag_pattern = r"<([^>]+)>(.*?)</\1>"
- def find_html_tags(html):
- matches = re.finditer(tag_pattern, html, re.DOTALL)
- for match in matches:
- tag_name = match.group(1)
- content = match.group(2)
- print(f"标签: {tag_name}, 内容: {content}")
- # 递归查找嵌套标签
- find_html_tags(content)
- find_html_tags(html)
复制代码
输出:
- 标签: div, 内容: <p>Hello <span>world</span></p>
- 标签: p, 内容: Hello <span>world</span>
- 标签: span, 内容: world
复制代码
日期格式匹配
匹配不同格式的日期:
- import re
- # 匹配多种日期格式
- date_pattern = r"(\d{4})[-/](\d{1,2})[-/](\d{1,2})|(\d{1,2})[-/](\d{1,2})[-/](\d{4})"
- dates = [
- "2023-05-15",
- "05/15/2023",
- "2023/5/15",
- "15-05-2023"
- ]
- for date in dates:
- match = re.match(date_pattern, date)
- if match:
- if match.group(1): # YYYY-MM-DD 或 YYYY/MM/DD 格式
- year, month, day = match.group(1), match.group(2), match.group(3)
- else: # MM-DD-YYYY 或 DD-MM-YYYY 格式
- month_or_day, day_or_month, year = match.group(4), match.group(5), match.group(6)
- # 假设第一个数字大于12,则为DD-MM-YYYY格式
- if int(month_or_day) > 12:
- day, month = month_or_day, day_or_month
- else:
- month, day = month_or_day, day_or_month
-
- print(f"日期: {date}, 解析为: {year}年{month}月{day}日")
复制代码
密码强度验证
验证密码强度,通常要求包含大小写字母、数字和特殊字符:
- import re
- def check_password_strength(password):
- # 至少8个字符
- if len(password) < 8:
- return "弱:密码长度至少为8个字符"
-
- # 包含大写字母
- if not re.search(r"[A-Z]", password):
- return "弱:密码必须包含至少一个大写字母"
-
- # 包含小写字母
- if not re.search(r"[a-z]", password):
- return "弱:密码必须包含至少一个小写字母"
-
- # 包含数字
- if not re.search(r"\d", password):
- return "弱:密码必须包含至少一个数字"
-
- # 包含特殊字符
- if not re.search(r"[!@#$%^&*(),.?":{}|<>]", password):
- return "中:建议包含特殊字符以增强安全性"
-
- return "强:密码符合所有安全要求"
- passwords = [
- "password",
- "Password",
- "Password123",
- "Password123!",
- "P@ssw0rd"
- ]
- for pwd in passwords:
- print(f"密码 '{pwd}': {check_password_strength(pwd)}")
复制代码
正则表达式在不同编程语言中的应用
Python中的正则表达式
Python通过re模块提供正则表达式支持:
- import re
- # re.match: 从字符串开头匹配
- pattern = r"\d+"
- text = "123 abc 456"
- result = re.match(pattern, text)
- if result:
- print("match:", result.group()) # 输出: match: 123
- # re.search: 在整个字符串中搜索第一个匹配
- result = re.search(pattern, text)
- if result:
- print("search:", result.group()) # 输出: search: 123
- # re.findall: 查找所有匹配
- result = re.findall(pattern, text)
- print("findall:", result) # 输出: findall: ['123', '456']
- # re.finditer: 返回一个迭代器,包含所有匹配对象
- for match in re.finditer(pattern, text):
- print(f"finditer: {match.group()} at position {match.span()}")
- # re.sub: 替换匹配的子串
- result = re.sub(pattern, "NUM", text)
- print("sub:", result) # 输出: sub: NUM abc NUM
- # re.split: 根据匹配分割字符串
- result = re.split(r"\s+", text)
- print("split:", result) # 输出: split: ['123', 'abc', '456']
- # 编译正则表达式
- compiled_pattern = re.compile(r"\d+")
- result = compiled_pattern.findall(text)
- print("compiled:", result) # 输出: compiled: ['123', '456']
复制代码
JavaScript中的正则表达式
JavaScript通过RegExp对象提供正则表达式支持:
- // 创建正则表达式
- const pattern1 = /\d+/g; // 字面量方式
- const pattern2 = new RegExp("\\d+", "g"); // 构造函数方式
- const text = "123 abc 456";
- // test(): 测试是否匹配
- console.log(pattern1.test(text)); // 输出: true
- // exec(): 执行匹配,返回匹配结果
- let result;
- while ((result = pattern1.exec(text)) !== null) {
- console.log(`exec: ${result[0]} at position ${result.index}`);
- }
- // match(): 字符串方法,返回匹配结果
- const matches = text.match(/\d+/g);
- console.log("match:", matches); // 输出: match: ['123', '456']
- // search(): 字符串方法,返回匹配位置
- const position = text.search(/\d+/);
- console.log("search:", position); // 输出: search: 0
- // replace(): 字符串方法,替换匹配的子串
- const replaced = text.replace(/\d+/g, "NUM");
- console.log("replace:", replaced); // 输出: replace: NUM abc NUM
- // split(): 字符串方法,根据匹配分割字符串
- const parts = text.split(/\s+/);
- console.log("split:", parts); // 输出: split: ['123', 'abc', '456']
复制代码
Java中的正则表达式
Java通过java.util.regex包提供正则表达式支持:
- import java.util.regex.*;
- public class RegexExample {
- public static void main(String[] args) {
- String text = "123 abc 456";
-
- // Pattern和Matcher
- Pattern pattern = Pattern.compile("\\d+");
- Matcher matcher = pattern.matcher(text);
-
- // find(): 查找下一个匹配
- while (matcher.find()) {
- System.out.println("find: " + matcher.group() + " at position " + matcher.start());
- }
-
- // matches(): 尝试将整个区域与模式匹配
- matcher.reset();
- System.out.println("matches: " + matcher.matches()); // 输出: matches: false
-
- // lookingAt(): 尝试从区域开头开始匹配
- matcher.reset();
- System.out.println("lookingAt: " + matcher.lookingAt()); // 输出: lookingAt: true
-
- // String类的正则方法
- String[] parts = text.split("\\s+");
- System.out.println("split: " + Arrays.toString(parts)); // 输出: split: [123, abc, 456]
-
- String replaced = text.replaceAll("\\d+", "NUM");
- System.out.println("replaceAll: " + replaced); // 输出: replaceAll: NUM abc NUM
-
- boolean matches = text.matches("\\d+.*");
- System.out.println("String.matches(): " + matches); // 输出: String.matches(): true
- }
- }
复制代码
实际案例分析
数据清洗和转换
假设我们有一份包含用户信息的原始数据,需要从中提取并格式化:
- import re
- # 原始数据
- raw_data = """
- Name: John Doe, Age: 30, Email: john.doe@example.com
- Name: Jane Smith, Age: 25, Email: jane.smith@test.org
- Name: Bob Johnson, Age: 42, Email: bob.johnson@demo.net
- """
- # 提取并格式化数据
- pattern = r"Name: ([^,]+), Age: (\d+), Email: ([^\s]+)"
- formatted_data = []
- for line in raw_data.strip().split('\n'):
- match = re.search(pattern, line)
- if match:
- name = match.group(1)
- age = match.group(2)
- email = match.group(3)
- formatted_data.append(f"{name} ({age} years old) can be contacted at {email}")
- # 输出格式化后的数据
- for entry in formatted_data:
- print(entry)
复制代码
输出:
- John Doe (30 years old) can be contacted at john.doe@example.com
- Jane Smith (25 years old) can be contacted at jane.smith@test.org
- Bob Johnson (42 years old) can be contacted at bob.johnson@demo.net
复制代码
日志文件分析
分析Web服务器日志,提取关键信息:
- import re
- from collections import Counter
- # 示例日志数据
- log_data = """
- 127.0.0.1 - - [10/Oct/2023:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 2326
- 192.168.1.1 - - [10/Oct/2023:13:56:12 -0700] "GET /about.html HTTP/1.1" 200 3548
- 10.0.0.1 - - [10/Oct/2023:13:57:05 -0700] "GET /contact.html HTTP/1.1" 200 1876
- 127.0.0.1 - - [10/Oct/2023:13:58:22 -0700] "GET /index.html HTTP/1.1" 200 2326
- 192.168.1.1 - - [10/Oct/2023:13:59:15 -0700] "POST /submit-form HTTP/1.1" 302 0
- 10.0.0.1 - - [10/Oct/2023:14:00:33 -0700] "GET /products.html HTTP/1.1" 404 512
- """
- # 日志解析模式
- log_pattern = r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\].*?"(GET|POST) (.*?) HTTP/.*?" (\d+) (\d+)'
- ips = []
- status_codes = []
- pages = []
- for line in log_data.strip().split('\n'):
- match = re.search(log_pattern, line)
- if match:
- ip = match.group(1)
- timestamp = match.group(2)
- method = match.group(3)
- page = match.group(4)
- status_code = match.group(5)
- size = match.group(6)
-
- ips.append(ip)
- status_codes.append(status_code)
- pages.append(page)
- # 统计IP访问次数
- ip_counter = Counter(ips)
- print("Top IP addresses:")
- for ip, count in ip_counter.most_common():
- print(f"{ip}: {count} requests")
- # 统计状态码
- status_counter = Counter(status_codes)
- print("\nStatus code distribution:")
- for status, count in status_counter.items():
- print(f"{status}: {count} responses")
- # 统计访问最多的页面
- page_counter = Counter(pages)
- print("\nMost requested pages:")
- for page, count in page_counter.most_common():
- print(f"{page}: {count} requests")
复制代码
输出:
- Top IP addresses:
- 127.0.0.1: 2 requests
- 192.168.1.1: 2 requests
- 10.0.0.1: 2 requests
- Status code distribution:
- 200: 4 responses
- 302: 1 responses
- 404: 1 responses
- Most requested pages:
- /index.html: 2 requests
- /about.html: 1 requests
- /contact.html: 1 requests
- /submit-form: 1 requests
- /products.html: 1 requests
复制代码
Web Scraping
从HTML中提取特定信息:
- import re
- # 示例HTML内容
- html = """
- <html>
- <head>
- <title>Example Page</title>
- </head>
- <body>
- <h1>Welcome to the Example Page</h1>
- <div class="content">
- <p>This is the first paragraph.</p>
- <p>This is the second paragraph with <a href="https://example.com/link1">a link</a>.</p>
- <ul>
- <li>Item 1</li>
- <li>Item 2</li>
- <li>Item 3</li>
- </ul>
- </div>
- <div class="footer">
- <p>Copyright © 2023 Example Company</p>
- </div>
- </body>
- </html>
- """
- # 提取标题
- title_pattern = r"<title>(.*?)</title>"
- title_match = re.search(title_pattern, html)
- if title_match:
- print("Page title:", title_match.group(1))
- # 提取所有段落
- paragraph_pattern = r"<p>(.*?)</p>"
- paragraphs = re.findall(paragraph_pattern, html)
- print("\nParagraphs:")
- for i, p in enumerate(paragraphs, 1):
- print(f"{i}. {p}")
- # 提取所有链接
- link_pattern = r"<a href="(.*?)">.*?</a>"
- links = re.findall(link_pattern, html)
- print("\nLinks:")
- for link in links:
- print(f"- {link}")
- # 提取列表项
- list_pattern = r"<li>(.*?)</li>"
- list_items = re.findall(list_pattern, html)
- print("\nList items:")
- for i, item in enumerate(list_items, 1):
- print(f"{i}. {item}")
复制代码
输出:
- Page title: Example Page
- Paragraphs:
- 1. This is the first paragraph.
- 2. This is the second paragraph with a link.
- 3. Copyright © 2023 Example Company
- Links:
- - https://example.com/link1
- List items:
- 1. Item 1
- 2. Item 2
- 3. Item 3
复制代码
文本替换和格式化
批量替换和格式化文本:
- import re
- # 原始文本
- text = """
- In the year 2023, the company revenue was $1,234,567.89.
- By 2024, it is expected to reach $1,500,000.00.
- Contact us at info@example.com or support@example.org for more information.
- """
- # 格式化日期
- formatted_text = re.sub(r"\b(\d{4})\b", r"Year \1", text)
- print("Formatted dates:")
- print(formatted_text)
- # 格式化货币
- formatted_text = re.sub(r"\$([0-9,]+\.\d{2})", r"USD \1", formatted_text)
- print("\nFormatted currency:")
- print(formatted_text)
- # 隐藏邮箱地址
- formatted_text = re.sub(r"(\w+)@(\w+\.\w+)", r"\1 at \2", formatted_text)
- print("\nObfuscated emails:")
- print(formatted_text)
- # 提取所有数字
- numbers = re.findall(r"\b\d+(?:,\d+)*(?:\.\d+)?\b", text)
- print("\nExtracted numbers:")
- for num in numbers:
- print(f"- {num}")
复制代码
输出:
- Formatted dates:
- In the Year 2023, the company revenue was $1,234,567.89.
- By Year 2024, it is expected to reach $1,500,000.00.
- Contact us at info@example.com or support@example.org for more information.
- Formatted currency:
- In the Year 2023, the company revenue was USD 1,234,567.89.
- By Year 2024, it is expected to reach USD 1,500,000.00.
- Contact us at info@example.com or support@example.org for more information.
- Obfuscated emails:
- In the Year 2023, the company revenue was USD 1,234,567.89.
- By Year 2024, it is expected to reach USD 1,500,000.00.
- Contact us at info at example.com or support at example.org for more information.
- Extracted numbers:
- - 2023
- - 1,234,567.89
- - 2024
- - 1,500,000.00
复制代码
性能优化与最佳实践
避免贪婪匹配
默认情况下,量词是贪婪的,会尽可能多地匹配字符。这可能导致性能问题或意外的匹配结果。使用非贪婪量词(在量词后加?)可以避免这个问题:
- import re
- # 贪婪匹配示例
- html = "<div>Content 1</div><div>Content 2</div>"
- greedy_pattern = r"<div>.*</div>"
- greedy_match = re.search(greedy_pattern, html)
- print("Greedy match:", greedy_match.group(0)) # 匹配整个字符串
- # 非贪婪匹配示例
- non_greedy_pattern = r"<div>.*?</div>"
- non_greedy_match = re.search(non_greedy_pattern, html)
- print("Non-greedy match:", non_greedy_match.group(0)) # 只匹配第一个div
复制代码
输出:
- Greedy match: <div>Content 1</div><div>Content 2</div>
- Non-greedy match: <div>Content 1</div>
复制代码
使用非捕获组
当只需要分组而不需要捕获匹配的文本时,使用非捕获组(?:...)可以提高性能:
- import re
- # 使用捕获组
- capturing_pattern = r"(ab|cd)+"
- text = "ab cd abcd"
- result = re.findall(capturing_pattern, text)
- print("Capturing groups:", result) # 输出: ['ab', 'cd', 'ab']
- # 使用非捕获组
- non_capturing_pattern = r"(?:ab|cd)+"
- result = re.findall(non_capturing_pattern, text)
- print("Non-capturing groups:", result) # 输出: ['ab', 'cd', 'abcd']
复制代码
预编译正则表达式
如果多次使用同一个正则表达式,预编译它可以提高性能:
- import re
- import time
- # 不预编译
- text = "The quick brown fox jumps over the lazy dog. " * 1000
- pattern = r"\b\w{3}\b"
- start_time = time.time()
- for _ in range(1000):
- re.findall(pattern, text)
- end_time = time.time()
- print(f"Without compilation: {end_time - start_time:.4f} seconds")
- # 预编译
- compiled_pattern = re.compile(r"\b\w{3}\b")
- start_time = time.time()
- for _ in range(1000):
- compiled_pattern.findall(text)
- end_time = time.time()
- print(f"With compilation: {end_time - start_time:.4f} seconds")
复制代码
避免回溯问题
复杂的正则表达式可能导致大量的回溯,影响性能。以下是一些避免回溯问题的技巧:
1. 使用更具体的字符类而不是.
- import re
- # 低效模式:使用点号
- inefficient_pattern = r"<div>.*?</div>"
- html = "<div>Content 1</div><div>Content 2</div>"
- # 高效模式:使用更具体的字符类
- efficient_pattern = r"<div>[^<]*?</div>"
- # 测试性能
- import timeit
- inefficient_time = timeit.timeit(lambda: re.findall(inefficient_pattern, html), number=10000)
- efficient_time = timeit.timeit(lambda: re.findall(efficient_pattern, html), number=10000)
- print(f"Inefficient pattern: {inefficient_time:.4f} seconds")
- print(f"Efficient pattern: {efficient_time:.4f} seconds")
复制代码
1. 使用原子组或占有量词(如果支持)
- import regex # 注意:这里使用第三方regex模块,支持更多特性
- # 使用原子组 (?>...)
- atomic_pattern = r"<div>(?>.*?)</div>"
- # 使用占有量词 *+, ++, ?+
- possessive_pattern = r"<div>.*+</div>"
- html = "<div>Content 1</div><div>Content 2</div>"
- print("Atomic group:", regex.findall(atomic_pattern, html))
- print("Possessive quantifier:", regex.findall(possessive_pattern, html))
复制代码
进阶技巧与资源推荐
断言(零宽断言)
断言用于匹配特定的位置,而不消耗字符。它们分为四种类型:
1. 正向先行断言(?=...):匹配后面的模式
2. 负向先行断言(?!...):匹配后面不是的模式
3. 正向后行断言(?<=...):匹配前面的模式
4. 负向后行断言(?<!...):匹配前面不是的模式
- import re
- # 正向先行断言:匹配后面跟着"fox"的单词
- text = "The quick brown fox jumps over the lazy dog"
- pattern = r"\w+(?= fox)"
- result = re.search(pattern, text)
- print("Positive lookahead:", result.group()) # 输出: brown
- # 负向先行断言:匹配后面不跟着"fox"的单词
- pattern = r"\w+(?! fox)"
- result = re.findall(pattern, text)
- print("Negative lookahead:", result) # 输出: ['The', 'quick', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
- # 正向后行断言:匹配前面是"quick"的单词
- pattern = r"(?<=quick )\w+"
- result = re.search(pattern, text)
- print("Positive lookbehind:", result.group()) # 输出: brown
- # 负向后行断言:匹配前面不是"quick"的单词
- pattern = r"(?<!quick )\w+"
- result = re.findall(pattern, text)
- print("Negative lookbehind:", result) # 输出: ['The', 'quick', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
复制代码
回溯引用
回溯引用允许你引用前面捕获的组:
- import re
- # 匹配重复的单词
- text = "hello hello world world test"
- pattern = r"\b(\w+)\s+\1\b"
- result = re.findall(pattern, text)
- print("Repeated words:", result) # 输出: ['hello', 'world']
- # 匹配HTML标签(开始和结束标签相同)
- html = "<div>Content</div><p>Paragraph</p>"
- pattern = r"<([a-z]+)>.*?</\1>"
- result = re.findall(pattern, html, re.DOTALL)
- print("HTML tags:", result) # 输出: ['div', 'p']
复制代码
条件匹配
某些正则表达式引擎支持条件匹配:
- import regex # 使用第三方regex模块
- # 条件匹配:如果前面有数字,则匹配"number",否则匹配"word"
- text1 = "123 number"
- text2 = "abc word"
- pattern = r"(\d)?(?(1)number|word)"
- result1 = regex.search(pattern, text1)
- result2 = regex.search(pattern, text2)
- print("Text 1:", result1.group()) # 输出: number
- print("Text 2:", result2.group()) # 输出: word
复制代码
正则表达式调试工具
调试复杂的正则表达式可能很困难,以下是一些有用的工具:
1. Regex101(https://regex101.com/) - 在线正则表达式测试和调试工具
2. Debuggex(https://www.debuggex.com/) - 可视化正则表达式调试工具
3. RegExr(https://regexr.com/) - 在线正则表达式学习和测试工具
学习资源推荐
1. 书籍:《精通正则表达式》(Mastering Regular Expressions)- Jeffrey E.F. Friedl《正则表达式必知必会》(Regular Expressions Cookbook)- Jan Goyvaerts, Steven Levithan
2. 《精通正则表达式》(Mastering Regular Expressions)- Jeffrey E.F. Friedl
3. 《正则表达式必知必会》(Regular Expressions Cookbook)- Jan Goyvaerts, Steven Levithan
4. 在线教程:MDN Web Docs - 正则表达式 (https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Guide/Regular_Expressions)RegexOne - 交互式正则表达式教程 (https://regexone.com/)Regular-Expressions.info - 正则表达式教程和参考 (https://www.regular-expressions.info/)
5. MDN Web Docs - 正则表达式 (https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Guide/Regular_Expressions)
6. RegexOne - 交互式正则表达式教程 (https://regexone.com/)
7. Regular-Expressions.info - 正则表达式教程和参考 (https://www.regular-expressions.info/)
8. 练习平台:HackerRank - 正则表达式练习 (https://www.hackerrank.com/domains/regex)Codewars - 正则表达式挑战 (https://www.codewars.com/?language=python)
9. HackerRank - 正则表达式练习 (https://www.hackerrank.com/domains/regex)
10. Codewars - 正则表达式挑战 (https://www.codewars.com/?language=python)
书籍:
• 《精通正则表达式》(Mastering Regular Expressions)- Jeffrey E.F. Friedl
• 《正则表达式必知必会》(Regular Expressions Cookbook)- Jan Goyvaerts, Steven Levithan
在线教程:
• MDN Web Docs - 正则表达式 (https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Guide/Regular_Expressions)
• RegexOne - 交互式正则表达式教程 (https://regexone.com/)
• Regular-Expressions.info - 正则表达式教程和参考 (https://www.regular-expressions.info/)
练习平台:
• HackerRank - 正则表达式练习 (https://www.hackerrank.com/domains/regex)
• Codewars - 正则表达式挑战 (https://www.codewars.com/?language=python)
总结
正则表达式是一种强大而灵活的文本处理工具,掌握它可以显著提升你的工作效率。本文从基础语法到高级技巧,全面介绍了正则表达式的核心概念和应用场景。通过学习和实践这些技巧,你将能够轻松应对各种复杂数据处理任务。
记住,正则表达式的学习是一个渐进的过程。开始时可能会觉得语法复杂,但随着实践的增加,你会逐渐熟悉并能够灵活运用它们。建议从简单的模式开始,逐步尝试更复杂的表达式,并利用在线工具进行测试和调试。
无论你是数据分析师、软件开发者还是系统管理员,正则表达式都将成为你工具箱中不可或缺的一部分。希望本文能够帮助你掌握正则表达式的核心技巧,并在实际工作中发挥其强大的威力。 |
|