Split string in a spark dataframe column by regular expressions capturing groups
Here's how you can do this without using a udf
:
df = df.withColumn(
"numbers",
f.split(f.regexp_replace("numbers", "([0-9]{3})(?!$)", r"$1,"), ",")
)
df.show(truncate=False)
#+---+------------------------------+
#|id |numbers |
#+---+------------------------------+
#|742|[000, 000, 000] |
#|744|[000, 000] |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+
First use pyspark.sql.functions.regexp_replace
to replace sequences of 3 digits with the sequence followed by a comma. Then split the resulting string on a comma.
The replacement pattern "$1,"
means first capturing group, followed by a comma.
In the match pattern, we also include a negative lookahead for end of string, (?!$)
, to avoid adding a comma to the end of the string.
Reference: REGEXP_REPLACE capturing groups
How to split column on the first occurrence of a string?
It looks like you want to split on the first .
, which you can achieve with regex look around. This will save you from the regexp_replace
step:
(df.withColumn("Username", split($"Username", "(?<=^[^.]*)\\."))
.withColumn("A", $"Username"(0))
.withColumn("B", $"Username"(1))).show
+----+-----------+---+------+
|Name| Username| A| B|
+----+-----------+---+------+
| ABC|[a, b.cdef]| a|b.cdef|
+----+-----------+---+------+
Detail on pattern (?<=^[^.]*)\\.
:
- Use look behind
?<=
to restrict the.
to split on; ^[^.]*
matches a pattern from the beginning of the string which doesn't contain a.
. Notice the first^
means beginning of string.[^.]
is a character class that matches any character except.
and*
is a quantifier which means zero or more.
Combined together the pattern matches the first .
in the string.
PySpark - String matching to create new column
In short:
regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
This expression extracts employee name from any position where it is after by then space(s) in text column(
col('Notes')
)
In Detail:
Create a sample dataframe
data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
Do the needed imports
from pyspark.sql.functions import regexp_extract, col
On df
extract Employee
name from column using regexp_extract(column_name, regex, group_number)
.
Here regex('(.)(by)(\s+)(\w+)'
) means
- (.) - Any character (except newline)
- (by) - Word by in the text
- (\s+) - One or many spaces
- (\w+) - Alphanumeric or underscore chars of length one
and group_number is 4 because group (\w+)
is in 4th position in expression
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
Databricks notebook
Note:
regexp_extract(col('Notes'), '.by\s+(\w+)', 1))
seems much cleaner version and check the Regex in use here
How can I extract all the instances of a regular expression pattern in PySpark dataframe?
In Spark 3.1+ regexp_extract_all
is available.
regexp_extract_all(str, regexp[, idx])
- Extract all strings in thestr
that match theregexp
expression and corresponding to the regex group index.
df = df.withColumn('output', F.expr(r"regexp_extract_all(stringValue, '[a-z]*(\\d+)', 1)"))
df.show()
#+-----------+-----------+
#|stringValue| output|
#+-----------+-----------+
#| a1234bc123|[1234, 123]|
#| av1tb12h18|[1, 12, 18]|
#| abcd| []|
#+-----------+-----------+
REGEXP_REPLACE capturing groups
Hive's supported notation (at least for 0.14, and I think I recall it being this way for 0.13.x as well) for regex backreferences seems to be $1
for capture group 1, $2
for capture group 2, etc. It looks like it is based upon (and may even be implemented by) the replaceAll method from the Matcher class. This is the germane portion of that documentation:
Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
So I think what you want is this:
select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');
For example:
hive> select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT+0:00
Time taken: 0.072 seconds, Fetched: 1 row(s)
hive> select regexp_replace('2015-01-01 02:03:04 -1:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT-1:00
Time taken: 0.144 seconds, Fetched: 1 row(s)
Find out substring from url/value of a key from url
PySpark:
df \
.withColumn("partialURL", split("url", "tag=")[1]) \
.withColumn("tag", split("partialURL", "&")[0]) \
.drop("partialURL")
I have an issue with regex extract with multiple matches
Here is how you can do it with a python UDF:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import re
data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')
# Define the function you want to return
def extract(s)
all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
return all_matches
# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))
# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))
Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.
pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)
df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))
How to validate (and drop) a column based on a regex condition in Pyspark without multiple scanning and shuffling?
In terms of performance you should always try to use the pyspark functions over python functions. Pyspark functions are optimized to utilize the ressource of your cluster and the data doesn't need to be converted to python objects.
The appropriate pyspark functions for your use case is rlike. Have a look at the example below:
from pyspark.sql import Row
my_cols = Row("Column1", "Column2", "Column3", "Column4")
row_1 = my_cols('(617)-283-3811', 'Salah', 'Messi', None)
row_2 = my_cols('617-2833811', 'Messi', 'Virgil', 'Messi')
row_3 = my_cols('617-283-3811', 'Ronaldo', 'Messi', 'Ronaldo')
row_seq = [row_1, row_2, row_3]
df = spark.createDataFrame(row_seq)
numberOfRows = df.count()
#I have simplified your regexes a bit because I don't see a reason
#why you need non capturing groups
expr = "^(\(\d{3}\)-\d{3}-\d{4})|(\d{3}-\d{3}-\d{4})|(\d{3}-\d{7})|(\d{10})$"
#you can also set it to df.columns
columnsToCheck = ['Column1']
columnsToRemove = []
for col in columnsToCheck:
numberOfMatchingRows = df.filter(df[col].rlike(expr)).count()
if numberOfMatchingRows < numberOfRows:
columnsToRemove.append(col)
df = df.select(*[c for c in df.columns if c not in columnsToRemove])
df.show()
Output:
+--------------+-------+-------+-------+
| Column1|Column2|Column3|Column4|
+--------------+-------+-------+-------+
|(617)-283-3811| Salah| Messi| null|
| 617-2833811| Messi| Virgil| Messi|
| 617-283-3811|Ronaldo| Messi|Ronaldo|
+--------------+-------+-------+-------+
Extracting several regex matches in PySpark
Unfortunately, there is no way to get all the matches in spark. You can specify matched index using idx
func.regexp_extract('col', my_regex, idx=1)
There is an unmerged request for same which can be found here
TL;DR: As of now, you will need to write a UDF for this
Related Topics
How to Check Whether All Elements of Array Are in Between Two Values
How to Sort a List of Lists by a Specific Index of the Inner List
Tensorflow:Attributeerror: 'Module' Object Has No Attribute 'Mul'
How to Get the Latest File in a Folder
Python Pip Install Error [Ssl: Certificate_Verify_Failed]
Python: String Iteration Replace a Space With a Hyphen (Or Other Character)
Calculate Rgb Value for a Range of Values to Create Heat Map
How to Read Gz Compressed File by Pyspark
Large File Crashing on Jupyter Notebook
How to Read Pdf Files One by One from a Folder in Python
Python 2D List Performance, Without Numpy
How to Print Colored Text to the Terminal
List of the Most Recently Updated Files in Python
How to Limit Iterations of a Loop in Python
Find All CSV Files in a Directory Using Python
Django Viewset Has Not Attribute 'Get_Extra_Actions'
Python: Filenotfounderror: [Winerror 3] the System Cannot Find the Path Specified: ''