remove last few characters in PySpark dataframe column
You can use expr function
>>> from pyspark.sql.functions import substring, length, col, expr
>>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
>>> df.show()
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+
remove last character from pyspark df columns
Use list comprehension
df.select(*[regexp_replace(F.col(c),"'",'').alias(c) for c in df.columns]).show()
+-----+-----+
| col1| col2|
+-----+-----+
|12345|abcde|
+-----+-----+
Trim String Characters in Pyspark dataframe
Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+
PySpark: Remove leading numbers and full stop from dataframe column
With a Dataframe like the following.
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| 123.John|
| 2| 5.42Anna|
| 3| .203Josh|
| 4| 102Paul|
+---+-----------+
You can do remove the leading numbers and periods like this.
import pyspark.sql.functions as F
df = (df.withColumn("runner_name",
F.regexp_replace('runner_name', r'(^[\d\.]+)', '')))
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| John|
| 2| Anna|
| 3| Josh|
| 4| Paul|
+---+-----------+
Pyspark dataframe drop last element in list column
If you just want to remove last element from array column of spark you can try below code where I have used array except
from pyspark.sql import functions as F
testdf=spark.createDataFrame([("abc@gmail.com",),("faden@domain.domain2.com",),("fba@a.org",)],"id string")
testdf.withColumn("bef@",F.split("id","@")[0]).withColumn("aft@",F.split("id","@")[1]).withColumn("extention",F.element_at(F.split("aft@",'\.'),-1)).withColumn("domain",F.array_except(F.split("aft@",'\.'),F.array(F.element_at(F.split("aft@",'\.'),-1)))).show()
#output
+--------------------+-----+------------------+---------+-----------------+
| id| bef@| aft@|extention| domain|
+--------------------+-----+------------------+---------+-----------------+
| abc@gmail.com| abc| gmail.com| com| [gmail]|
|faden@domain.doma...|faden|domain.domain2.com| com|[domain, domain2]|
| fba@a.org| fba| a.org| org| [a]|
+--------------------+-----+------------------+---------+-----------------+
As per the update if you just want to extract string after last . (which you named as an extension) and string between @ and last . (which you named as domain) then you can use regexp extract as below
from pyspark.sql import functions as F
testdf=spark.createDataFrame([("abc@gmail.com",),("faden@domain.domain2.com",),("fba@a.org",),("faden@domain.domain2.dom3.com",)],"id string")
testdf.withColumn("domain",F.regexp_extract("id","(?<=@).+(?=\.)",0)).withColumn("extention",F.regexp_extract("id","[^\.]+$",0)).show()
#output
+--------------------+-------------------+---------+
| id| domain|extention|
+--------------------+-------------------+---------+
| abc@gmail.com| gmail| com|
|faden@domain.doma...| domain.domain2| com|
| fba@a.org| a| org|
|faden@domain.doma...|domain.domain2.dom3| com|
+--------------------+-------------------+---------+
Pyspark removing multiple characters in a dataframe column
You can use pyspark.sql.functions.translate()
to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.
For example, let's say you had the following DataFrame:
import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#| A|
#+------------------+
#| $100,00|
#| #foobar|
#|foo, bar, #, and $|
#+------------------+
and wanted to replace ('$', '#', ',')
with ('X', 'Y', 'Z')
. Simply use translate
like:
df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#| A| replaced|
#+------------------+------------------+
#| $100,00| X100Z00|
#| #foobar| Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+
If instead you wanted to remove all instances of ('$', '#', ',')
, you could do this with pyspark.sql.functions.regexp_replace()
.
df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#| A| replaced|
#+------------------+-------------+
#| $100,00| 10000|
#| #foobar| foobar|
#|foo, bar, #, and $|foo bar and |
#+------------------+-------------+
The pattern "[\$#,]"
means match any of the characters inside the brackets. The $
has to be escaped because it has a special meaning in regex.
substring multiple characters from the last index of a pyspark string column using negative indexing
This is how you use substring
. Your position will be -3 and the length is 3.
pyspark.sql.functions.substring(str, pos, len)
You need to change your substring function call to:
from pyspark.sql.functions import substring
df.select(substring(df['number'], -3, 3), 'event_type').show(2)
#+------------------------+----------+
#|substring(number, -3, 3)|event_type|
#+------------------------+----------+
#| 022| 11|
#| 715| 11|
#+------------------------+----------+
Related Topics
How to Locate the Index With in a Nested List Python
Making a Discord Bot Change Playing Status Every 10 Seconds
Beautifulsoup Findall() Given Multiple Classes
Iterating Over Every Two Elements in a List
Pandas Get Frequency of Item Occurrences in a Column as Percentage
Python: Split a List into Multiple Lists Based on a Subset of Elements
Convert Float to Float Time in Python
Python Key Error=0 - Can't Find Dict Error in Code
How to Send Smtp Email for Office365 With Python Using Tls/Ssl
Remove Last Few Characters in Pyspark Dataframe Column
How to Download Outlook Attachment from Python Script
Type Conversion in Python Attributeerror: 'Str' Object Has No Attribute 'Astype'
Counting the Number of Duplicates in a List
How to Display a Plot in Fullscreen
How to Make the Program to Rerun Itself in Python
Finding an Exact Substring in a String in Python
How to Disable the Security Certificate Check in Python Requests