Pyspark Regexp_Replace With List Elements Are Not Replacing the String

Pyspark regexp_replace with list elements are not replacing the string

You should write a udf function and loop in your reg_patterns as below

reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]

import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def replaceUdf(column):
    res_split=[]
    for i in range(len(reg_patterns)):
        res_split=re.findall(r"[^/]+",reg_patterns[i])
        for x in res_split[0].split("|"):
            column = column.replace(x,res_split[1])
    return column

reg_replaceUdf = f.udf(replaceUdf, t.StringType())

df = df.withColumn('NotesUPD', reg_replaceUdf(f.col('Notes')))
df.show()

and you should have

+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|         ADV by John|
|2398| ADVANCED by ADVANCE|          ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

Using regular expression in pyspark to replace in order to replace a string even inside an array?

Having a column with multiple types is not currently supported. However, the column contained an array of string, you could explode the array (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode), which creates a row for each element in the array, and apply the regular expression to the new column. Example:

from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

sql_context = SQLContext(spark.sparkContext)

df = sql_context.createDataFrame([("hello world",),
                                  ("hello madam",),
                                  ("hello sir",),
                                  ("hello everybody",),
                                  ("goodbye world",)], schema=['test'])

df = df.withColumn('test', F.array(F.col('test')))

print(df.show())

df = df.withColumn('test-exploded', F.explode(F.col('test')))

df = df.withColumn('test-exploded-regex', F.regexp_replace(F.col('test-exploded'), "hello", "goodbye"))


print(df.show())

Output:

+-----------------+
|             test|
+-----------------+
|    [hello world]|
|    [hello madam]|
|      [hello sir]|
|[hello everybody]|
|  [goodbye world]|
+-----------------+

+-----------------+---------------+-------------------+
|             test|  test-exploded|test-exploded-regex|
+-----------------+---------------+-------------------+
|    [hello world]|    hello world|      goodbye world|
|    [hello madam]|    hello madam|      goodbye madam|
|      [hello sir]|      hello sir|        goodbye sir|
|[hello everybody]|hello everybody|  goodbye everybody|
|  [goodbye world]|  goodbye world|      goodbye world|
+-----------------+---------------+-------------------+

And if you wanted to put the results back in an array:

df = df.withColumn('test-exploded-regex-array', F.array(F.col('test-exploded-regex')))

Output:

+-----------------+---------------+-------------------+-------------------------+
|             test|  test-exploded|test-exploded-regex|test-exploded-regex-array|
+-----------------+---------------+-------------------+-------------------------+
|    [hello world]|    hello world|      goodbye world|          [goodbye world]|
|    [hello madam]|    hello madam|      goodbye madam|          [goodbye madam]|
|      [hello sir]|      hello sir|        goodbye sir|            [goodbye sir]|
|[hello everybody]|hello everybody|  goodbye everybody|      [goodbye everybody]|
|  [goodbye world]|  goodbye world|      goodbye world|          [goodbye world]|
+-----------------+---------------+-------------------+-------------------------+

Hope this helps!

Update

Updated to include case where the array column has several strings:

from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

sql_context = SQLContext(spark.sparkContext)

df = sql_context.createDataFrame([("hello world", "foo"),
                                  ("hello madam", "bar"),
                                  ("hello sir", "baz"),
                                  ("hello everybody", "boo"),
                                  ("goodbye world", "bah")], schema=['test', 'test2'])

df = df.withColumn('test', F.array(F.col('test'), F.col('test2'))).drop('test2')

df = df.withColumn('id', F.monotonically_increasing_id())

print(df.show())

df = df.withColumn('test-exploded', F.explode(F.col('test')))

df = df.withColumn('test-exploded-regex', F.regexp_replace(F.col('test-exploded'), "hello", "goodbye"))

df = df.groupBy('id').agg(F.collect_list(F.col('test-exploded-regex')).alias('test-exploded-regex-array'))


print(df.show())

Output:

+--------------------+-----------+
|                test|         id|
+--------------------+-----------+
|  [hello world, foo]|          0|
|  [hello madam, bar]| 8589934592|
|    [hello sir, baz]|17179869184|
|[hello everybody,...|25769803776|
|[goodbye world, bah]|25769803777|
+--------------------+-----------+

+-----------+-------------------------+
|         id|test-exploded-regex-array|
+-----------+-------------------------+
| 8589934592|     [goodbye madam, bar]|
|          0|     [goodbye world, foo]|
|25769803776|     [goodbye everybod...|
|25769803777|     [goodbye world, bah]|
|17179869184|       [goodbye sir, baz]|
+-----------+-------------------------+

Just drop the id column when you're finished processing!

Replace more than one element in Pyspark

Use a pipe | (OR) to combine the two patterns into a single regex pattern www\.|\.com, which will match www. or .com, notice you need to escape . to match it literally since . matches (almost) any character in regex:

df.withColumn('site', regexp_replace('url', 'www\.|\.com', '')).show()
+--------------+------+
|           url|  site|
+--------------+------+
|www.google.com|google|
|    google.com|google|
|     www.goole| goole|
+--------------+------+

PySpark regexp_replace does not work as expected for the following pattern

IIUC,

If you want only the output use regexp_extract and if you want to replace it use regexp replace

the working regex for me are:

df.select(regexp_extract('value','someMessage=\w+\.\ \[\w+\]',0)).show(2,False)
#and
df.select(regexp_extract('value','someMessage=(.*)]',0)).show(2,False)

+-------------------------------------------+
|val                                        |
+-------------------------------------------+
|someMessage=Test. [BL056]                  |
|someMessage=Test. [BL056]                  |
+-------------------------------------------+


And if you want to replace use this

df.select(regexp_replace('value','someMessage=(.*)]',''))

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?

This is what you're looking for:

Using `when()` (most readable)

df1.withColumn('name', 
               when(col('name') == 'George', 'George_renamed1')
               .when(col('name') == 'Ravi', 'Ravi_renamed2')
               .otherwise(col('name'))
              )

With mapping expr (less explicit but handy if there's many values to replace)

df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))

or if you already have a list to use i.e.
name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']

# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))

the above but only using pyspark imported functions

mapping_expr = create_map([lit(x) for x in name_changes])

df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))

Result

df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
|           name|   trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58|           13|
|George_renamed1|2020-09-24 03:19:06|            8|
|George_renamed1|2009-12-12 17:21:30|            5|
|        Micheal|2010-11-22 13:29:40|           12|
|         Maggie|2010-02-08 03:31:23|            8|
|  Ravi_renamed2|2009-01-01 04:19:47|            2|
|           Xien|2010-03-02 04:33:51|            3|
+---------------+-------------------+-------------+

PySpark replace multiple words in string column based on values in array column

Use aggregate function on text_entity array with splitted text column as the initial value like this:

from pyspark.sql import functions as F

jsonSting = """{"id":1,"text":"I talked with Christian today at Cafe Heimdal last Wednesday","text_entity":[{"word":"Christian","index":4,"start":14,"end":23},{"word":"Heimdal","index":8,"start":38,"end":45}]}"""
df = spark.read.json(spark.sparkContext.parallelize([jsonSting]))

df1 = df.withColumn(
    "text",
    F.array_join(
        F.expr(r"""aggregate(
                  text_entity, 
                  split(text, " "), 
                  (acc, x) -> transform(acc, (y, i) -> IF(i=x.index, '(BLEEP)', y))
           )"""),
        " "
    )
)

df1.show(truncate=False)
#+---+----------------------------------------------------------+----------------------------------------------+
#|id |text                                                      |text_entity                                   |
#+---+----------------------------------------------------------+----------------------------------------------+
#|1  |I talked with (BLEEP) today at Cafe (BLEEP) last Wednesday|[{23, 4, 14, Christian}, {45, 8, 38, Heimdal}]|
#+---+----------------------------------------------------------+----------------------------------------------+

How does regexp_replace function in PySpark?

Your call to REGEXP_REPLACE will find elements in curly braces and replace with the same elements in square brackets.

Here is an {ELEMENT}.

becomes

Here is an [ELEMENT].

As a side note, you probably want to use lazy dot in your regex pattern, to avoid crossing across matches. If so, then use this version:

new_df = df.withColumn('a_col', regexp_replace('b_col','\\{(.*?)\\}', '\\[$1\\]'))

How to replace any instances of an integer with NULL in a column meant for strings using PySpark?

I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr

from pyspark.sql import functions as F

(df
    .withColumn('states', F
        .when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
        .otherwise(F.col('states'))
    )
    .show()
)

# Output
# +----------+------------+
# |    states|states_fixed|
# +----------+------------+
# |  Illinois|    Illinois|
# |        12|        null|
# |California|  California|
# |        01|        null|
# |    Nevada|      Nevada|
# +----------+------------+

How to remove specific strings from a list in pyspark dataframe column

You can use regexp_replace with '|'.join(). The first is commonly used to replace substring matches. The latter will join the different elements of the list with |. The combination of the two will remove any parts of your column that are present in your list.

import pyspark.sql.functions as F

df = df.withColumn('column_a', F.regexp_replace('column_a', '|'.join(lst), ''))

Pyspark Regexp_Replace With List Elements Are Not Replacing the String