Remove Last Few Characters in Pyspark Dataframe Column

remove last few characters in PySpark dataframe column

You can use expr function

>>> from pyspark.sql.functions import substring, length, col, expr
>>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
>>> df.show()
+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+

remove last character from pyspark df columns

Use list comprehension

df.select(*[regexp_replace(F.col(c),"'",'').alias(c) for c in df.columns]).show()

+-----+-----+
| col1| col2|
+-----+-----+
|12345|abcde|
+-----+-----+

Trim String Characters in Pyspark dataframe

Based upon your input and expected output. See below logic -

from pyspark.sql.functions import *

df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])

(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
  .withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()

Output

+--------------+--------------+-----------+
|        values|      new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
|    XYZ7394949|    XYZ7394949|    7394949|
|    XYZ3898302|    XYZ3898302|    3898302|
|   PQR3799_ABZ|       PQR3799|       3799|
|   MGE8983_ABZ|       MGE8983|       8983|
+--------------+--------------+-----------+

PySpark: Remove leading numbers and full stop from dataframe column

With a Dataframe like the following.

df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
|  1|   123.John|
|  2|   5.42Anna|
|  3|   .203Josh|
|  4|    102Paul|
+---+-----------+

You can do remove the leading numbers and periods like this.

import pyspark.sql.functions as F

df = (df.withColumn("runner_name", 
         F.regexp_replace('runner_name', r'(^[\d\.]+)', '')))

df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
|  1|       John|
|  2|       Anna|
|  3|       Josh|
|  4|       Paul|
+---+-----------+

Pyspark dataframe drop last element in list column

If you just want to remove last element from array column of spark you can try below code where I have used array except

from pyspark.sql import functions as F
testdf=spark.createDataFrame([("abc@gmail.com",),("faden@domain.domain2.com",),("fba@a.org",)],"id string")
testdf.withColumn("bef@",F.split("id","@")[0]).withColumn("aft@",F.split("id","@")[1]).withColumn("extention",F.element_at(F.split("aft@",'\.'),-1)).withColumn("domain",F.array_except(F.split("aft@",'\.'),F.array(F.element_at(F.split("aft@",'\.'),-1)))).show()

#output
+--------------------+-----+------------------+---------+-----------------+
|                  id| bef@|              aft@|extention|           domain|
+--------------------+-----+------------------+---------+-----------------+
|       abc@gmail.com|  abc|         gmail.com|      com|          [gmail]|
|faden@domain.doma...|faden|domain.domain2.com|      com|[domain, domain2]|
|           fba@a.org|  fba|             a.org|      org|              [a]|
+--------------------+-----+------------------+---------+-----------------+

As per the update if you just want to extract string after last . (which you named as an extension) and string between @ and last . (which you named as domain) then you can use regexp extract as below

from pyspark.sql import functions as F
testdf=spark.createDataFrame([("abc@gmail.com",),("faden@domain.domain2.com",),("fba@a.org",),("faden@domain.domain2.dom3.com",)],"id string")
testdf.withColumn("domain",F.regexp_extract("id","(?<=@).+(?=\.)",0)).withColumn("extention",F.regexp_extract("id","[^\.]+$",0)).show()


#output
+--------------------+-------------------+---------+
|                  id|             domain|extention|
+--------------------+-------------------+---------+
|       abc@gmail.com|              gmail|      com|
|faden@domain.doma...|     domain.domain2|      com|
|           fba@a.org|                  a|      org|
|faden@domain.doma...|domain.domain2.dom3|      com|
+--------------------+-------------------+---------+

Pyspark removing multiple characters in a dataframe column

You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.

For example, let's say you had the following DataFrame:

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#|                 A|
#+------------------+
#|           $100,00|
#|           #foobar|
#|foo, bar, #, and $|
#+------------------+

and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:

df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#|                 A|          replaced|
#+------------------+------------------+
#|           $100,00|           X100Z00|
#|           #foobar|           Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+

If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().

df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#|                 A|     replaced|
#+------------------+-------------+
#|           $100,00|        10000|
#|           #foobar|       foobar|
#|foo, bar, #, and $|foo bar  and |
#+------------------+-------------+

The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.

substring multiple characters from the last index of a pyspark string column using negative indexing

This is how you use substring. Your position will be -3 and the length is 3.

pyspark.sql.functions.substring(str, pos, len)

You need to change your substring function call to:

from pyspark.sql.functions import substring
df.select(substring(df['number'], -3, 3), 'event_type').show(2)
#+------------------------+----------+
#|substring(number, -3, 3)|event_type|
#+------------------------+----------+
#|                     022|        11|
#|                     715|        11|
#+------------------------+----------+

Remove Last Few Characters in Pyspark Dataframe Column