This article explains how to extract a substring from a string in Python. You can get a substring by specifying its position and length, or by using regular expression (regex) patterns.
Contents
- Extract a substring by specifying the position and length
- Extract a character by index
- Extract a substring by slicing
- Extract based on the number of characters
- Extract a substring with regex: re.search(), re.findall()
- Regex pattern examples
- Wildcard-like patterns
- Greedy and non-greedy matching
- Extract part of the pattern with parentheses
- Match any single character
- Match the start/end of the string
- Extract by multiple patterns
- Case-insensitive matching
To find the position of a substring or replace it with another string, see the following articles.
- Search for a string in Python (Check if a substring is included/Get a substring position)
- Replace strings in Python (replace, translate, re.sub, re.subn)
If you want to extract a substring from a text file, read the file as a string.
- Read, write, and create files in Python (with and open())
Extract a substring by specifying the position and length
Extract a character by index
You can get a character at the desired position by specifying an index in []
. Indexes start at 0
(zero-based indexing).
s = 'abcde'print(s[0])# aprint(s[4])# e
source: str_index_slice.py
You can specify a backward position with negative values. -1
represents the last character.
print(s[-1])# eprint(s[-5])# a
source: str_index_slice.py
An error is raised if you specify an index that doesn't exist.
# print(s[5])# IndexError: string index out of range# print(s[-6])# IndexError: string index out of range
source: str_index_slice.py
Extract a substring by slicing
You can extract a substring in the range start <= x < stop
with [start:stop]
. If start
is omitted, the range begins at the start of the string, and if stop
is omitted, the range extends to the end of the string.
s = 'abcde'print(s[1:3])# bcprint(s[:3])# abcprint(s[1:])# bcde
source: str_index_slice.py
You can also use negative values.
print(s[-4:-2])# bcprint(s[:-2])# abcprint(s[-4:])# bcde
source: str_index_slice.py
If start > stop
, no error is raised, and an empty string (''
) is extracted.
print(s[3:1])# print(s[3:1] == '')# True
source: str_index_slice.py
Out-of-range values are ignored.
print(s[-100:100])# abcde
source: str_index_slice.py
In addition to the start position start
and end position stop
, you can also specify an increment step
using the syntax [start:stop:step]
. If step
is negative, the substring will be extracted in reverse order.
print(s[1:4:2])# bdprint(s[::2])# aceprint(s[::3])# adprint(s[::-1])# edcbaprint(s[::-2])# eca
source: str_index_slice.py
For more information on slicing, see the following article.
- How to slice a list, string, tuple in Python
Extract based on the number of characters
The built-in len()
function returns the number of characters in a string. You can use it to get the central character or to extract the first or second half of the string by slicing.
Note that you can specify only integers (int
) for index []
and slice [:]
. Division by /
in indexing or slicing raises an error because the result is a floating-point number (float
).
The following example uses integer division //
which truncates the decimal part of the result.
s = 'abcdefghi'print(len(s))# 9# print(s[len(s) / 2])# TypeError: string indices must be integersprint(s[len(s) // 2])# eprint(s[:len(s) // 2])# abcdprint(s[len(s) // 2:])# efghi
source: str_index_slice.py
Extract a substring with regex: re.search()
, re.findall()
You can use regular expressions (regex) with the re
module of the standard library.
- Regular expressions with the re module in Python
Use re.search()
to extract a substring matching a regex pattern. Specify the regex pattern as the first argument and the target string as the second argument.
import res = '012-3456-7890'print(re.search(r'\d+', s))# <re.Match object; span=(0, 3), match='012'>
source: str_extract_re.py
In regex, \d
matches a digit character, while +
matches one or more repetitions of the preceding pattern. Therefore, \d+
matches one or more consecutive digits.
Since backslash \
is used in regex special sequences such as \d
, it is convenient to use a raw string by adding r
before ''
or ""
.
- Raw strings in Python
When a string matches the pattern, re.search()
returns a match object. You can get the matched part as a string (str
) using the group()
method of the match object.
m = re.search(r'\d+', s)print(m.group())# 012print(type(m.group()))# <class 'str'>
source: str_extract_re.py
For more information on regex match objects, see the following article.
- How to use regex match objects in Python
As shown in the example above, re.search()
returns the match object for the first occurrence only, even if there are multiple matching parts in the string.
re.findall()
returns a list of all matching substrings.
print(re.findall(r'\d+', s))# ['012', '3456', '7890']
source: str_extract_re.py
Regex pattern examples
This section provides examples of regex patterns using metacharacters and special sequences.
Wildcard-like patterns
.
matches any single character except a newline, and *
matches zero or more repetitions of the preceding pattern.
For example, a.*b
matches the string starting with a
and ending with b
. Since *
matches zero repetitions, it also matches ab
.
print(re.findall('a.*b', 'axyzb'))# ['axyzb']print(re.findall('a.*b', 'a---b'))# ['a---b']print(re.findall('a.*b', 'aあいうえおb'))# ['aあいうえおb']print(re.findall('a.*b', 'ab'))# ['ab']
source: str_extract_re.py
+
matches one or more repetitions of the preceding pattern. a.+b
does not match ab
.
print(re.findall('a.+b', 'ab'))# []print(re.findall('a.+b', 'axb'))# ['axb']print(re.findall('a.+b', 'axxxxxxb'))# ['axxxxxxb']
source: str_extract_re.py
?
matches zero or one preceding pattern. In the case of a.?b
, it matches ab
and the string with only one character between a
and b
.
print(re.findall('a.?b', 'ab'))# ['ab']print(re.findall('a.?b', 'axb'))# ['axb']print(re.findall('a.?b', 'axxb'))# []
source: str_extract_re.py
Greedy and non-greedy matching
*
, +
, and ?
are greedy matches, matching as much text as possible. In contrast, *?
, +?
, and ??
are non-greedy, minimal matches, matching as few characters as possible.
s = 'axb-axxxxxxb'print(re.findall('a.*b', s))# ['axb-axxxxxxb']print(re.findall('a.*?b', s))# ['axb', 'axxxxxxb']
source: str_extract_re.py
Extract part of the pattern with parentheses
If you enclose part of a regex pattern in parentheses ()
, you can extract a substring in that part.
print(re.findall('a(.*)b', 'axyzb'))# ['xyz']
source: str_extract_re.py
If you want to match parentheses ()
as characters, escape them with backslash \
.
print(re.findall(r'\(.+\)', 'abc(def)ghi'))# ['(def)']print(re.findall(r'\((.+)\)', 'abc(def)ghi'))# ['def']
source: str_extract_re.py
Match any single character
Using square brackets []
in a pattern matches any single character from the enclosed string.
Using a hyphen -
between consecutive Unicode code points, like [a-z]
, creates a character range. For example, [a-z]
matches any single lowercase alphabetical character.
print(re.findall('[abc]x', 'ax-bx-cx'))# ['ax', 'bx', 'cx']print(re.findall('[abc]+', 'abc-aaa-cba'))# ['abc', 'aaa', 'cba']print(re.findall('[a-z]+', 'abc-xyz'))# ['abc', 'xyz']
source: str_extract_re.py
Match the start/end of the string
^
matches the start of the string, and $
matches the end of the string.
s = 'abc-def-ghi'print(re.findall('[a-z]+', s))# ['abc', 'def', 'ghi']print(re.findall('^[a-z]+', s))# ['abc']print(re.findall('[a-z]+$', s))# ['ghi']
source: str_extract_re.py
Extract by multiple patterns
Use |
to match a substring that conforms to any of the specified patterns. For example, to match substrings that follow either pattern A
or pattern B
, use A|B
.
s = 'axxxb-012'print(re.findall('a.*b', s))# ['axxxb']print(re.findall(r'\d+', s))# ['012']print(re.findall(r'a.*b|\d+', s))# ['axxxb', '012']
source: str_extract_re.py
Case-insensitive matching
The re
module is case-sensitive by default. Set the flags
argument to re.IGNORECASE
to perform case-insensitive matching.
s = 'abc-Abc-ABC'print(re.findall('[a-z]+', s))# ['abc', 'bc']print(re.findall('[A-Z]+', s))# ['A', 'ABC']print(re.findall('[a-z]+', s, flags=re.IGNORECASE))# ['abc', 'Abc', 'ABC']
source: str_extract_re.py