Pyspark regexp_replace with list elements are not replacing the string
You should write a udf
function and loop in your reg_patterns
as below
reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]
import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def replaceUdf(column):
res_split=[]
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
for x in res_split[0].split("|"):
column = column.replace(x,res_split[1])
return column
reg_replaceUdf = f.udf(replaceUdf, t.StringType())
df = df.withColumn('NotesUPD', reg_replaceUdf(f.col('Notes')))
df.show()
and you should have
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
Using regular expression in pyspark to replace in order to replace a string even inside an array?
Having a column with multiple types is not currently supported. However, the column contained an array of string, you could explode the array (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode), which creates a row for each element in the array, and apply the regular expression to the new column. Example:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("hello world",),
("hello madam",),
("hello sir",),
("hello everybody",),
("goodbye world",)], schema=['test'])
df = df.withColumn('test', F.array(F.col('test')))
print(df.show())
df = df.withColumn('test-exploded', F.explode(F.col('test')))
df = df.withColumn('test-exploded-regex', F.regexp_replace(F.col('test-exploded'), "hello", "goodbye"))
print(df.show())
Output:
+-----------------+
| test|
+-----------------+
| [hello world]|
| [hello madam]|
| [hello sir]|
|[hello everybody]|
| [goodbye world]|
+-----------------+
+-----------------+---------------+-------------------+
| test| test-exploded|test-exploded-regex|
+-----------------+---------------+-------------------+
| [hello world]| hello world| goodbye world|
| [hello madam]| hello madam| goodbye madam|
| [hello sir]| hello sir| goodbye sir|
|[hello everybody]|hello everybody| goodbye everybody|
| [goodbye world]| goodbye world| goodbye world|
+-----------------+---------------+-------------------+
And if you wanted to put the results back in an array:
df = df.withColumn('test-exploded-regex-array', F.array(F.col('test-exploded-regex')))
Output:
+-----------------+---------------+-------------------+-------------------------+
| test| test-exploded|test-exploded-regex|test-exploded-regex-array|
+-----------------+---------------+-------------------+-------------------------+
| [hello world]| hello world| goodbye world| [goodbye world]|
| [hello madam]| hello madam| goodbye madam| [goodbye madam]|
| [hello sir]| hello sir| goodbye sir| [goodbye sir]|
|[hello everybody]|hello everybody| goodbye everybody| [goodbye everybody]|
| [goodbye world]| goodbye world| goodbye world| [goodbye world]|
+-----------------+---------------+-------------------+-------------------------+
Hope this helps!
UpdateUpdated to include case where the array column has several strings:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("hello world", "foo"),
("hello madam", "bar"),
("hello sir", "baz"),
("hello everybody", "boo"),
("goodbye world", "bah")], schema=['test', 'test2'])
df = df.withColumn('test', F.array(F.col('test'), F.col('test2'))).drop('test2')
df = df.withColumn('id', F.monotonically_increasing_id())
print(df.show())
df = df.withColumn('test-exploded', F.explode(F.col('test')))
df = df.withColumn('test-exploded-regex', F.regexp_replace(F.col('test-exploded'), "hello", "goodbye"))
df = df.groupBy('id').agg(F.collect_list(F.col('test-exploded-regex')).alias('test-exploded-regex-array'))
print(df.show())
Output:
+--------------------+-----------+
| test| id|
+--------------------+-----------+
| [hello world, foo]| 0|
| [hello madam, bar]| 8589934592|
| [hello sir, baz]|17179869184|
|[hello everybody,...|25769803776|
|[goodbye world, bah]|25769803777|
+--------------------+-----------+
+-----------+-------------------------+
| id|test-exploded-regex-array|
+-----------+-------------------------+
| 8589934592| [goodbye madam, bar]|
| 0| [goodbye world, foo]|
|25769803776| [goodbye everybod...|
|25769803777| [goodbye world, bah]|
|17179869184| [goodbye sir, baz]|
+-----------+-------------------------+
Just drop the id
column when you're finished processing!
Replace more than one element in Pyspark
Use a pipe |
(OR) to combine the two patterns into a single regex pattern www\.|\.com
, which will match www.
or .com
, notice you need to escape .
to match it literally since .
matches (almost) any character in regex:
df.withColumn('site', regexp_replace('url', 'www\.|\.com', '')).show()
+--------------+------+
| url| site|
+--------------+------+
|www.google.com|google|
| google.com|google|
| www.goole| goole|
+--------------+------+
PySpark regexp_replace does not work as expected for the following pattern
IIUC,
If you want only the output use regexp_extract and if you want to replace it use regexp replace
the working regex for me are:
df.select(regexp_extract('value','someMessage=\w+\.\ \[\w+\]',0)).show(2,False)
#and
df.select(regexp_extract('value','someMessage=(.*)]',0)).show(2,False)
+-------------------------------------------+
|val |
+-------------------------------------------+
|someMessage=Test. [BL056] |
|someMessage=Test. [BL056] |
+-------------------------------------------+
And if you want to replace use this
df.select(regexp_replace('value','someMessage=(.*)]',''))
Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?
This is what you're looking for:
Using when()
(most readable)
df1.withColumn('name',
when(col('name') == 'George', 'George_renamed1')
.when(col('name') == 'Ravi', 'Ravi_renamed2')
.otherwise(col('name'))
)
With mapping expr (less explicit but handy if there's many values to replace)
df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))
or if you already have a list to use i.e.name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']
# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))
the above but only using pyspark imported functions
mapping_expr = create_map([lit(x) for x in name_changes])
df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))
Result
df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58| 13|
|George_renamed1|2020-09-24 03:19:06| 8|
|George_renamed1|2009-12-12 17:21:30| 5|
| Micheal|2010-11-22 13:29:40| 12|
| Maggie|2010-02-08 03:31:23| 8|
| Ravi_renamed2|2009-01-01 04:19:47| 2|
| Xien|2010-03-02 04:33:51| 3|
+---------------+-------------------+-------------+
PySpark replace multiple words in string column based on values in array column
Use aggregate
function on text_entity
array with splitted text
column as the initial value like this:
from pyspark.sql import functions as F
jsonSting = """{"id":1,"text":"I talked with Christian today at Cafe Heimdal last Wednesday","text_entity":[{"word":"Christian","index":4,"start":14,"end":23},{"word":"Heimdal","index":8,"start":38,"end":45}]}"""
df = spark.read.json(spark.sparkContext.parallelize([jsonSting]))
df1 = df.withColumn(
"text",
F.array_join(
F.expr(r"""aggregate(
text_entity,
split(text, " "),
(acc, x) -> transform(acc, (y, i) -> IF(i=x.index, '(BLEEP)', y))
)"""),
" "
)
)
df1.show(truncate=False)
#+---+----------------------------------------------------------+----------------------------------------------+
#|id |text |text_entity |
#+---+----------------------------------------------------------+----------------------------------------------+
#|1 |I talked with (BLEEP) today at Cafe (BLEEP) last Wednesday|[{23, 4, 14, Christian}, {45, 8, 38, Heimdal}]|
#+---+----------------------------------------------------------+----------------------------------------------+
How does regexp_replace function in PySpark?
Your call to REGEXP_REPLACE
will find elements in curly braces and replace with the same elements in square brackets.
Here is an {ELEMENT}.
becomes
Here is an [ELEMENT].
As a side note, you probably want to use lazy dot in your regex pattern, to avoid crossing across matches. If so, then use this version:
new_df = df.withColumn('a_col', regexp_replace('b_col','\\{(.*?)\\}', '\\[$1\\]'))
How to replace any instances of an integer with NULL in a column meant for strings using PySpark?
I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr
from pyspark.sql import functions as F
(df
.withColumn('states', F
.when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
.otherwise(F.col('states'))
)
.show()
)
# Output
# +----------+------------+
# | states|states_fixed|
# +----------+------------+
# | Illinois| Illinois|
# | 12| null|
# |California| California|
# | 01| null|
# | Nevada| Nevada|
# +----------+------------+
How to remove specific strings from a list in pyspark dataframe column
You can use regexp_replace
with '|'.join()
. The first is commonly used to replace substring matches. The latter will join the different elements of the list with |
. The combination of the two will remove any parts of your column that are present in your list.
import pyspark.sql.functions as F
df = df.withColumn('column_a', F.regexp_replace('column_a', '|'.join(lst), ''))
Related Topics
How to Put a Space Between Two String Items in Python
How to Mention a User in Discord.Py
Using Python, How to Access a Shared Folder on Windows Network
Python Pandas - Get Row Based on Previous Row Value
How to Drop Rows from Pandas Data Frame That Contains a Particular String in a Particular Column
How to Insert String Value into Specific Column Value on Python Pandas
Pip Error: Microsoft Visual C++ 14.0 Is Required
I Received an Error Message That I Don't Quite Understand
Python: How to Print Separate Lines from a List
How to Update a Pyspark Dataframe With New Values from Another Dataframe
Combine Date and Time Columns Using Python Pandas
Calling a Function of a Module by Using Its Name (A String)
Tensorflow - Valueerror: Failed to Convert a Numpy Array to a Tensor (Unsupported Object Type Float)
Comparing Two Json Objects Irrespective of the Sequence of Elements in Them
How to Increase the Font Size of the Legend in My Seaborn Factorplot/Facetgrid
How to Fill in Arbitrary Missing Dates in Pandas Dataframe
How to Get the Response Json Data from Network Call in Xhr Using Python Selenium Web Driver Chorme