How to change dataframe column names in PySpark?
There are many ways to do that:
Option 1. Using selectExpr.
data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "askdaosdka"])
data.show()
data.printSchema()
# Output
#+-------+----------+
#| Name|askdaosdka|
#+-------+----------+
#|Alberto| 2|
#| Dakota| 2|
#+-------+----------+
#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)
df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace
xrange
withrange
.from functools import reduce
oldColumns = data.schema.names
newColumns = ["name", "age"]
df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()Option 3. using
alias, in Scala you can also use as.from pyspark.sql.functions import col
data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+Option 4. Using sqlContext.sql, which lets you use SQL queries on
DataFrames
registered as tables.sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
df2.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
Column Renaming in pyspark dataframe
Try Using it as below -
Input_df
from pyspark.sql.types import *
from pyspark.sql.functions import *
data = [("xyz", 1)]
schema = StructType([StructField("Number_of_data_samples_(input)", StringType(), True), StructField("id", IntegerType())])
df = spark.createDataFrame(data=data, schema=schema)
df.show()
+------------------------------+---+
|Number_of_data_samples_(input)| id|
+------------------------------+---+
| xyz| 1|
+------------------------------+---+
Method 1Using regular expressions
to replace the special characters and then use toDF()
import re
cols=[re.sub("\.|\)|\(","",i) for i in df.columns]
df.toDF(*cols).show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 2Using .withColumnRenamed()
for i,j in zip(df.columns,cols):
df=df.withColumnRenamed(i,j)
df.show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 3Using .withColumn
to create a new column and drop the existing column
df = df.withColumn("Number_of_data_samples_input", lit(col("Number_of_data_samples_(input)"))).drop(col("Number_of_data_samples_(input)"))
df.show()
+---+----------------------------+
| id|Number_of_data_samples_input|
+---+----------------------------+
| 1| xyz|
+---+----------------------------+
how to rename column name of dataframe in pyspark?
Possible way of renaming at dataframe level-
oldColumns=['rate%year']
newColumns = ["rateyear"]
df1 = reduce(lambda df, idx: df.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), df)
this is working fine at dataframe level. any suggestion how to resolve at table level?
Replace characters in column names in pyspark data frames
Try with regular expression
replace(re.sub) in python way.
import re
cols=[re.sub(r'(^_|_$)','',f.replace("/","_")) for f in df.columns]
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.toDF(*cols).show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
#or using for loop on schema.names
for name in df.schema.names:
df = df.withColumnRenamed(name, re.sub(r'(^_|_$)','',name.replace('/', '_')))
df.show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
PySpark - rename more than one column using withColumnRenamed
It is not possible to use a single withColumnRenamed
call.
You can use
DataFrame.toDF
method*data.toDF('x3', 'x4')
or
new_names = ['x3', 'x4']
data.toDF(*new_names)It is also possible to rename with simple
select
:from pyspark.sql.functions import col
mapping = dict(zip(['x1', 'x2'], ['x3', 'x4']))
data.select([col(c).alias(mapping.get(c, c)) for c in data.columns])
Similarly in Scala you can:
Rename all columns:
val newNames = Seq("x3", "x4")
data.toDF(newNames: _*)Rename from mapping with
select
:val mapping = Map("x1" -> "x3", "x2" -> "x4")
df.select(
df.columns.map(c => df(c).alias(mapping.get(c).getOrElse(c))): _*
)or
foldLeft
+withColumnRenamed
mapping.foldLeft(data){
case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName)
}
* Not to be confused with RDD.toDF
which is not a variadic functions, and takes column names as a list,
Concatenate PySpark Dataframe Column Names by Value and Sum
I don't see anything wrong with writing a for loop here
from pyspark.sql import functions as F
cols = ['a', 'b', 'c', 'd', 'e']
temp = (df.withColumn('key', F.concat(*[F.when(F.col(c) == 1, c).otherwise('') for c in cols])))
+---+---+---+---+---+---+------------+----+
| id| a| b| c| d| e|extra_column| key|
+---+---+---+---+---+---+------------+----+
| 1| 0| 1| 1| 1| 1| something|bcde|
| 2| 0| 1| 1| 1| 0| something| bcd|
| 3| 1| 0| 0| 0| 0| something| a|
| 4| 0| 1| 0| 0| 0| something| b|
| 5| 1| 0| 0| 0| 0| something| a|
| 6| 0| 0| 0| 0| 0| something| |
+---+---+---+---+---+---+------------+----+
(temp
.groupBy('key')
.agg(F.count('*').alias('value'))
.where(F.col('key') != '')
.show()
)
+----+-----+
| key|value|
+----+-----+
|bcde| 1|
| b| 1|
| a| 2|
| bcd| 1|
+----+-----+
Related Topics
Python - Pygame Error When Executing Exe File
Does Python Have a Stack/Heap and How Is Memory Managed
How to Write a File or Data to an S3 Object Using Boto3
How to Stop Flask from Initialising Twice in Debug Mode
Django Smtpauthenticationerror
Python and Beautifulsoup Encoding Issues
Accessing Every 1St Element of Pandas Dataframe Column Containing Lists
How Can One Continuously Generate and Track Several Random Objects with a Time Delay in Pygame
Python: Change the Scripts Working Directory to the Script's Own Directory
How to Hide the Browser in Selenium Rc
Is There a Generator Version of 'String.Split()' in Python
What Is the Pythonic Way to Unpack Tuples
How to Suppress the Newline After a Print Statement
Scrollbar on Matplotlib Showing Page
How to Add a Custom Ca Root Certificate to the Ca Store Used by Pip in Windows