Pyspark Regexp_Replace With List Elements Are Not Replacing the String

Pyspark regexp_replace with list elements are not replacing the string

You should write a udf function and loop in your reg_patterns as below

reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]

import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def replaceUdf(column):
res_split=[]
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
for x in res_split[0].split("|"):
column = column.replace(x,res_split[1])
return column

reg_replaceUdf = f.udf(replaceUdf, t.StringType())

df = df.withColumn('NotesUPD', reg_replaceUdf(f.col('Notes')))
df.show()

and you should have

+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

Using regular expression in pyspark to replace in order to replace a string even inside an array?

Having a column with multiple types is not currently supported. However, the column contained an array of string, you could explode the array (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode), which creates a row for each element in the array, and apply the regular expression to the new column. Example:

from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

sql_context = SQLContext(spark.sparkContext)

df = sql_context.createDataFrame([("hello world",),
("hello madam",),
("hello sir",),
("hello everybody",),
("goodbye world",)], schema=['test'])

df = df.withColumn('test', F.array(F.col('test')))

print(df.show())

df = df.withColumn('test-exploded', F.explode(F.col('test')))

df = df.withColumn('test-exploded-regex', F.regexp_replace(F.col('test-exploded'), "hello", "goodbye"))


print(df.show())

Output:

+-----------------+
| test|
+-----------------+
| [hello world]|
| [hello madam]|
| [hello sir]|
|[hello everybody]|
| [goodbye world]|
+-----------------+

+-----------------+---------------+-------------------+
| test| test-exploded|test-exploded-regex|
+-----------------+---------------+-------------------+
| [hello world]| hello world| goodbye world|
| [hello madam]| hello madam| goodbye madam|
| [hello sir]| hello sir| goodbye sir|
|[hello everybody]|hello everybody| goodbye everybody|
| [goodbye world]| goodbye world| goodbye world|
+-----------------+---------------+-------------------+

And if you wanted to put the results back in an array:

df = df.withColumn('test-exploded-regex-array', F.array(F.col('test-exploded-regex')))

Output:

+-----------------+---------------+-------------------+-------------------------+
| test| test-exploded|test-exploded-regex|test-exploded-regex-array|
+-----------------+---------------+-------------------+-------------------------+
| [hello world]| hello world| goodbye world| [goodbye world]|
| [hello madam]| hello madam| goodbye madam| [goodbye madam]|
| [hello sir]| hello sir| goodbye sir| [goodbye sir]|
|[hello everybody]|hello everybody| goodbye everybody| [goodbye everybody]|
| [goodbye world]| goodbye world| goodbye world| [goodbye world]|
+-----------------+---------------+-------------------+-------------------------+

Hope this helps!

Update

Updated to include case where the array column has several strings:

from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

sql_context = SQLContext(spark.sparkContext)

df = sql_context.createDataFrame([("hello world", "foo"),
("hello madam", "bar"),
("hello sir", "baz"),
("hello everybody", "boo"),
("goodbye world", "bah")], schema=['test', 'test2'])

df = df.withColumn('test', F.array(F.col('test'), F.col('test2'))).drop('test2')

df = df.withColumn('id', F.monotonically_increasing_id())

print(df.show())

df = df.withColumn('test-exploded', F.explode(F.col('test')))

df = df.withColumn('test-exploded-regex', F.regexp_replace(F.col('test-exploded'), "hello", "goodbye"))

df = df.groupBy('id').agg(F.collect_list(F.col('test-exploded-regex')).alias('test-exploded-regex-array'))


print(df.show())

Output:

+--------------------+-----------+
| test| id|
+--------------------+-----------+
| [hello world, foo]| 0|
| [hello madam, bar]| 8589934592|
| [hello sir, baz]|17179869184|
|[hello everybody,...|25769803776|
|[goodbye world, bah]|25769803777|
+--------------------+-----------+

+-----------+-------------------------+
| id|test-exploded-regex-array|
+-----------+-------------------------+
| 8589934592| [goodbye madam, bar]|
| 0| [goodbye world, foo]|
|25769803776| [goodbye everybod...|
|25769803777| [goodbye world, bah]|
|17179869184| [goodbye sir, baz]|
+-----------+-------------------------+

Just drop the id column when you're finished processing!

Replace more than one element in Pyspark

Use a pipe | (OR) to combine the two patterns into a single regex pattern www\.|\.com, which will match www. or .com, notice you need to escape . to match it literally since . matches (almost) any character in regex:

df.withColumn('site', regexp_replace('url', 'www\.|\.com', '')).show()
+--------------+------+
| url| site|
+--------------+------+
|www.google.com|google|
| google.com|google|
| www.goole| goole|
+--------------+------+

PySpark regexp_replace does not work as expected for the following pattern

IIUC,

If you want only the output use regexp_extract and if you want to replace it use regexp replace

the working regex for me are:

df.select(regexp_extract('value','someMessage=\w+\.\ \[\w+\]',0)).show(2,False)
#and
df.select(regexp_extract('value','someMessage=(.*)]',0)).show(2,False)

+-------------------------------------------+
|val |
+-------------------------------------------+
|someMessage=Test. [BL056] |
|someMessage=Test. [BL056] |
+-------------------------------------------+


And if you want to replace use this

df.select(regexp_replace('value','someMessage=(.*)]',''))

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?

This is what you're looking for:

Using when() (most readable)

df1.withColumn('name', 
when(col('name') == 'George', 'George_renamed1')
.when(col('name') == 'Ravi', 'Ravi_renamed2')
.otherwise(col('name'))
)

With mapping expr (less explicit but handy if there's many values to replace)

df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))

or if you already have a list to use i.e.
name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']

# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))

the above but only using pyspark imported functions

mapping_expr = create_map([lit(x) for x in name_changes])

df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))

Result

df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58| 13|
|George_renamed1|2020-09-24 03:19:06| 8|
|George_renamed1|2009-12-12 17:21:30| 5|
| Micheal|2010-11-22 13:29:40| 12|
| Maggie|2010-02-08 03:31:23| 8|
| Ravi_renamed2|2009-01-01 04:19:47| 2|
| Xien|2010-03-02 04:33:51| 3|
+---------------+-------------------+-------------+

PySpark replace multiple words in string column based on values in array column

Use aggregate function on text_entity array with splitted text column as the initial value like this:

from pyspark.sql import functions as F

jsonSting = """{"id":1,"text":"I talked with Christian today at Cafe Heimdal last Wednesday","text_entity":[{"word":"Christian","index":4,"start":14,"end":23},{"word":"Heimdal","index":8,"start":38,"end":45}]}"""
df = spark.read.json(spark.sparkContext.parallelize([jsonSting]))

df1 = df.withColumn(
"text",
F.array_join(
F.expr(r"""aggregate(
text_entity,
split(text, " "),
(acc, x) -> transform(acc, (y, i) -> IF(i=x.index, '(BLEEP)', y))
)"""),
" "
)
)

df1.show(truncate=False)
#+---+----------------------------------------------------------+----------------------------------------------+
#|id |text |text_entity |
#+---+----------------------------------------------------------+----------------------------------------------+
#|1 |I talked with (BLEEP) today at Cafe (BLEEP) last Wednesday|[{23, 4, 14, Christian}, {45, 8, 38, Heimdal}]|
#+---+----------------------------------------------------------+----------------------------------------------+

How does regexp_replace function in PySpark?

Your call to REGEXP_REPLACE will find elements in curly braces and replace with the same elements in square brackets.

Here is an {ELEMENT}.

becomes

Here is an [ELEMENT].

As a side note, you probably want to use lazy dot in your regex pattern, to avoid crossing across matches. If so, then use this version:

new_df = df.withColumn('a_col', regexp_replace('b_col','\\{(.*?)\\}', '\\[$1\\]'))

How to replace any instances of an integer with NULL in a column meant for strings using PySpark?

I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr

from pyspark.sql import functions as F

(df
.withColumn('states', F
.when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
.otherwise(F.col('states'))
)
.show()
)

# Output
# +----------+------------+
# | states|states_fixed|
# +----------+------------+
# | Illinois| Illinois|
# | 12| null|
# |California| California|
# | 01| null|
# | Nevada| Nevada|
# +----------+------------+

How to remove specific strings from a list in pyspark dataframe column

You can use regexp_replace with '|'.join(). The first is commonly used to replace substring matches. The latter will join the different elements of the list with |. The combination of the two will remove any parts of your column that are present in your list.

import pyspark.sql.functions as F

df = df.withColumn('column_a', F.regexp_replace('column_a', '|'.join(lst), ''))


Related Topics



Leave a reply



Submit