Split String in a Spark Dataframe Column by Regular Expressions Capturing Groups

Split string in a spark dataframe column by regular expressions capturing groups

Here's how you can do this without using a udf:

df = df.withColumn(
"numbers",
f.split(f.regexp_replace("numbers", "([0-9]{3})(?!$)", r"$1,"), ",")
)

df.show(truncate=False)
#+---+------------------------------+
#|id |numbers |
#+---+------------------------------+
#|742|[000, 000, 000] |
#|744|[000, 000] |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+

First use pyspark.sql.functions.regexp_replace to replace sequences of 3 digits with the sequence followed by a comma. Then split the resulting string on a comma.

The replacement pattern "$1," means first capturing group, followed by a comma.

In the match pattern, we also include a negative lookahead for end of string, (?!$), to avoid adding a comma to the end of the string.

Reference: REGEXP_REPLACE capturing groups

How to split column on the first occurrence of a string?

It looks like you want to split on the first ., which you can achieve with regex look around. This will save you from the regexp_replace step:

(df.withColumn("Username", split($"Username", "(?<=^[^.]*)\\."))
.withColumn("A", $"Username"(0))
.withColumn("B", $"Username"(1))).show

+----+-----------+---+------+
|Name| Username| A| B|
+----+-----------+---+------+
| ABC|[a, b.cdef]| a|b.cdef|
+----+-----------+---+------+

Detail on pattern (?<=^[^.]*)\\.:

  • Use look behind ?<= to restrict the . to split on;
  • ^[^.]* matches a pattern from the beginning of the string which doesn't contain a .. Notice the first ^ means beginning of string. [^.] is a character class that matches any character except . and * is a quantifier which means zero or more.

Combined together the pattern matches the first . in the string.

PySpark - String matching to create new column

In short:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes'))



In Detail:

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

from pyspark.sql.functions import regexp_extract, col

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(\s+)(\w+)') means

  • (.) - Any character (except newline)
  • (by) - Word by in the text
  • (\s+) - One or many spaces
  • (\w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (\w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+

Databricks notebook

Note:

regexp_extract(col('Notes'), '.by\s+(\w+)', 1)) seems much cleaner version and check the Regex in use here

How can I extract all the instances of a regular expression pattern in PySpark dataframe?

In Spark 3.1+ regexp_extract_all is available.

regexp_extract_all(str, regexp[, idx]) - Extract all strings in the str that match the regexp expression and corresponding to the regex group index.

df = df.withColumn('output', F.expr(r"regexp_extract_all(stringValue, '[a-z]*(\\d+)', 1)"))

df.show()
#+-----------+-----------+
#|stringValue| output|
#+-----------+-----------+
#| a1234bc123|[1234, 123]|
#| av1tb12h18|[1, 12, 18]|
#| abcd| []|
#+-----------+-----------+

REGEXP_REPLACE capturing groups

Hive's supported notation (at least for 0.14, and I think I recall it being this way for 0.13.x as well) for regex backreferences seems to be $1 for capture group 1, $2 for capture group 2, etc. It looks like it is based upon (and may even be implemented by) the replaceAll method from the Matcher class. This is the germane portion of that documentation:

Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.

So I think what you want is this:

select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');

For example:

hive> select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT+0:00
Time taken: 0.072 seconds, Fetched: 1 row(s)
hive> select regexp_replace('2015-01-01 02:03:04 -1:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT-1:00
Time taken: 0.144 seconds, Fetched: 1 row(s)

Find out substring from url/value of a key from url

PySpark:

df \
.withColumn("partialURL", split("url", "tag=")[1]) \
.withColumn("tag", split("partialURL", "&")[0]) \
.drop("partialURL")

I have an issue with regex extract with multiple matches

Here is how you can do it with a python UDF:

from pyspark.sql.types import *
from pyspark.sql.functions import *
import re

data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')

# Define the function you want to return
def extract(s)
all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
return all_matches

# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))

# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))

Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.

pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)

df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))

How to validate (and drop) a column based on a regex condition in Pyspark without multiple scanning and shuffling?

In terms of performance you should always try to use the pyspark functions over python functions. Pyspark functions are optimized to utilize the ressource of your cluster and the data doesn't need to be converted to python objects.

The appropriate pyspark functions for your use case is rlike. Have a look at the example below:

from pyspark.sql import Row

my_cols = Row("Column1", "Column2", "Column3", "Column4")
row_1 = my_cols('(617)-283-3811', 'Salah', 'Messi', None)
row_2 = my_cols('617-2833811', 'Messi', 'Virgil', 'Messi')
row_3 = my_cols('617-283-3811', 'Ronaldo', 'Messi', 'Ronaldo')
row_seq = [row_1, row_2, row_3]

df = spark.createDataFrame(row_seq)

numberOfRows = df.count()

#I have simplified your regexes a bit because I don't see a reason
#why you need non capturing groups
expr = "^(\(\d{3}\)-\d{3}-\d{4})|(\d{3}-\d{3}-\d{4})|(\d{3}-\d{7})|(\d{10})$"

#you can also set it to df.columns
columnsToCheck = ['Column1']
columnsToRemove = []

for col in columnsToCheck:
numberOfMatchingRows = df.filter(df[col].rlike(expr)).count()
if numberOfMatchingRows < numberOfRows:
columnsToRemove.append(col)

df = df.select(*[c for c in df.columns if c not in columnsToRemove])
df.show()

Output:

+--------------+-------+-------+-------+
| Column1|Column2|Column3|Column4|
+--------------+-------+-------+-------+
|(617)-283-3811| Salah| Messi| null|
| 617-2833811| Messi| Virgil| Messi|
| 617-283-3811|Ronaldo| Messi|Ronaldo|
+--------------+-------+-------+-------+

Extracting several regex matches in PySpark

Unfortunately, there is no way to get all the matches in spark. You can specify matched index using idx

func.regexp_extract('col', my_regex, idx=1)

There is an unmerged request for same which can be found here

TL;DR: As of now, you will need to write a UDF for this



Related Topics



Leave a reply



Submit