Remove Last Few Characters in Pyspark Dataframe Column

remove last few characters in PySpark dataframe column

You can use expr function

>>> from pyspark.sql.functions import substring, length, col, expr
>>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
>>> df.show()
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+

remove last character from pyspark df columns

Use list comprehension

df.select(*[regexp_replace(F.col(c),"'",'').alias(c) for c in df.columns]).show()

+-----+-----+
| col1| col2|
+-----+-----+
|12345|abcde|
+-----+-----+

Trim String Characters in Pyspark dataframe

Based upon your input and expected output. See below logic -

from pyspark.sql.functions import *

df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])

(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()

Output

+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+

PySpark: Remove leading numbers and full stop from dataframe column

With a Dataframe like the following.

df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| 123.John|
| 2| 5.42Anna|
| 3| .203Josh|
| 4| 102Paul|
+---+-----------+

You can do remove the leading numbers and periods like this.

import pyspark.sql.functions as F

df = (df.withColumn("runner_name",
F.regexp_replace('runner_name', r'(^[\d\.]+)', '')))

df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| John|
| 2| Anna|
| 3| Josh|
| 4| Paul|
+---+-----------+

Pyspark dataframe drop last element in list column

If you just want to remove last element from array column of spark you can try below code where I have used array except

from pyspark.sql import functions as F
testdf=spark.createDataFrame([("abc@gmail.com",),("faden@domain.domain2.com",),("fba@a.org",)],"id string")
testdf.withColumn("bef@",F.split("id","@")[0]).withColumn("aft@",F.split("id","@")[1]).withColumn("extention",F.element_at(F.split("aft@",'\.'),-1)).withColumn("domain",F.array_except(F.split("aft@",'\.'),F.array(F.element_at(F.split("aft@",'\.'),-1)))).show()

#output
+--------------------+-----+------------------+---------+-----------------+
| id| bef@| aft@|extention| domain|
+--------------------+-----+------------------+---------+-----------------+
| abc@gmail.com| abc| gmail.com| com| [gmail]|
|faden@domain.doma...|faden|domain.domain2.com| com|[domain, domain2]|
| fba@a.org| fba| a.org| org| [a]|
+--------------------+-----+------------------+---------+-----------------+

As per the update if you just want to extract string after last . (which you named as an extension) and string between @ and last . (which you named as domain) then you can use regexp extract as below

from pyspark.sql import functions as F
testdf=spark.createDataFrame([("abc@gmail.com",),("faden@domain.domain2.com",),("fba@a.org",),("faden@domain.domain2.dom3.com",)],"id string")
testdf.withColumn("domain",F.regexp_extract("id","(?<=@).+(?=\.)",0)).withColumn("extention",F.regexp_extract("id","[^\.]+$",0)).show()


#output
+--------------------+-------------------+---------+
| id| domain|extention|
+--------------------+-------------------+---------+
| abc@gmail.com| gmail| com|
|faden@domain.doma...| domain.domain2| com|
| fba@a.org| a| org|
|faden@domain.doma...|domain.domain2.dom3| com|
+--------------------+-------------------+---------+

Pyspark removing multiple characters in a dataframe column


You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.

For example, let's say you had the following DataFrame:

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#| A|
#+------------------+
#| $100,00|
#| #foobar|
#|foo, bar, #, and $|
#+------------------+

and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:

df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#| A| replaced|
#+------------------+------------------+
#| $100,00| X100Z00|
#| #foobar| Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+

If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().

df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#| A| replaced|
#+------------------+-------------+
#| $100,00| 10000|
#| #foobar| foobar|
#|foo, bar, #, and $|foo bar and |
#+------------------+-------------+

The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.

substring multiple characters from the last index of a pyspark string column using negative indexing

This is how you use substring. Your position will be -3 and the length is 3.

pyspark.sql.functions.substring(str, pos, len)

You need to change your substring function call to:

from pyspark.sql.functions import substring
df.select(substring(df['number'], -3, 3), 'event_type').show(2)
#+------------------------+----------+
#|substring(number, -3, 3)|event_type|
#+------------------------+----------+
#| 022| 11|
#| 715| 11|
#+------------------------+----------+


Related Topics



Leave a reply



Submit