How to Change Dataframe Column Names in Pyspark

How to change dataframe column names in PySpark?

There are many ways to do that:

Option 1. Using selectExpr.

 data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                   ["Name", "askdaosdka"])
 data.show()
 data.printSchema()

 # Output
 #+-------+----------+
 #|   Name|askdaosdka|
 #+-------+----------+
 #|Alberto|         2|
 #| Dakota|         2|
 #+-------+----------+

 #root
 # |-- Name: string (nullable = true)
 # |-- askdaosdka: long (nullable = true)

 df = data.selectExpr("Name as name", "askdaosdka as age")
 df.show()
 df.printSchema()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

 #root
 # |-- name: string (nullable = true)
 # |-- age: long (nullable = true)

Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

 from functools import reduce

 oldColumns = data.schema.names
 newColumns = ["name", "age"]

 df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
 df.printSchema()
 df.show()

Option 3. using
alias, in Scala you can also use as.

 from pyspark.sql.functions import col

 data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
 data.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

 sqlContext.registerDataFrameAsTable(data, "myTable")
 df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")

 df2.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

Column Renaming in pyspark dataframe

Try Using it as below -

Input_df

from pyspark.sql.types import *
from pyspark.sql.functions import *

data = [("xyz", 1)]

schema = StructType([StructField("Number_of_data_samples_(input)", StringType(), True), StructField("id", IntegerType())])

df = spark.createDataFrame(data=data, schema=schema)

df.show()

+------------------------------+---+
|Number_of_data_samples_(input)| id|
+------------------------------+---+
|                           xyz|  1|
+------------------------------+---+

Method 1
Using regular expressions to replace the special characters and then use toDF()

import re

cols=[re.sub("\.|\)|\(","",i) for i in df.columns]
df.toDF(*cols).show()

+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
|                         xyz|  1|
+----------------------------+---+

Method 2
Using .withColumnRenamed()

for i,j in zip(df.columns,cols):
    df=df.withColumnRenamed(i,j)

df.show()

+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
|                         xyz|  1|
+----------------------------+---+

Method 3
Using .withColumn to create a new column and drop the existing column

df = df.withColumn("Number_of_data_samples_input", lit(col("Number_of_data_samples_(input)"))).drop(col("Number_of_data_samples_(input)"))

df.show()

+---+----------------------------+
| id|Number_of_data_samples_input|
+---+----------------------------+
|  1|                         xyz|
+---+----------------------------+

how to rename column name of dataframe in pyspark?

Possible way of renaming at dataframe level-

oldColumns=['rate%year']
newColumns = ["rateyear"]
df1 = reduce(lambda df, idx: df.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), df)

this is working fine at dataframe level. any suggestion how to resolve at table level?

Replace characters in column names in pyspark data frames

Try with regular expression replace(re.sub) in python way.

import re
cols=[re.sub(r'(^_|_$)','',f.replace("/","_")) for f in df.columns]

df = spark.createDataFrame([(2,'john',1,1),
                            (2,'john',1,2),
                            (3,'pete',8,3),
                            (3,'pete',8,4),
                            (5,'steve',9,5)],
                           ['id','/na/me','val/ue', 'rank/'])

df.toDF(*cols).show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#|  2| john|     1|   1|
#|  2| john|     1|   2|
#|  3| pete|     8|   3|
#|  3| pete|     8|   4|
#|  5|steve|     9|   5|
#+---+-----+------+----+

#or using for loop on schema.names
for name in df.schema.names:
  df = df.withColumnRenamed(name, re.sub(r'(^_|_$)','',name.replace('/', '_')))

df.show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#|  2| john|     1|   1|
#|  2| john|     1|   2|
#|  3| pete|     8|   3|
#|  3| pete|     8|   4|
#|  5|steve|     9|   5|
#+---+-----+------+----+

PySpark - rename more than one column using withColumnRenamed

It is not possible to use a single withColumnRenamed call.

You can use DataFrame.toDF method*

data.toDF('x3', 'x4')

new_names = ['x3', 'x4']
data.toDF(*new_names)

It is also possible to rename with simple select:

from pyspark.sql.functions import col

mapping = dict(zip(['x1', 'x2'], ['x3', 'x4']))
data.select([col(c).alias(mapping.get(c, c)) for c in data.columns])

Similarly in Scala you can:

Rename all columns:

val newNames = Seq("x3", "x4")

data.toDF(newNames: _*)

Rename from mapping with select:

val  mapping = Map("x1" -> "x3", "x2" -> "x4")

df.select(
  df.columns.map(c => df(c).alias(mapping.get(c).getOrElse(c))): _*
)

or foldLeft + withColumnRenamed

mapping.foldLeft(data){
  case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) 
}

* Not to be confused with RDD.toDF which is not a variadic functions, and takes column names as a list,

Concatenate PySpark Dataframe Column Names by Value and Sum

I don't see anything wrong with writing a for loop here

from pyspark.sql import functions as F

cols = ['a', 'b', 'c', 'd', 'e']

temp = (df.withColumn('key', F.concat(*[F.when(F.col(c) == 1, c).otherwise('') for c in cols])))

+---+---+---+---+---+---+------------+----+
| id|  a|  b|  c|  d|  e|extra_column| key|
+---+---+---+---+---+---+------------+----+
|  1|  0|  1|  1|  1|  1|   something|bcde|
|  2|  0|  1|  1|  1|  0|   something| bcd|
|  3|  1|  0|  0|  0|  0|   something|   a|
|  4|  0|  1|  0|  0|  0|   something|   b|
|  5|  1|  0|  0|  0|  0|   something|   a|
|  6|  0|  0|  0|  0|  0|   something|    |
+---+---+---+---+---+---+------------+----+

(temp
    .groupBy('key')
    .agg(F.count('*').alias('value'))
    .where(F.col('key') != '')
    .show()
)

+----+-----+
| key|value|
+----+-----+
|bcde|    1|
|   b|    1|
|   a|    2|
| bcd|    1|
+----+-----+

How to Change Dataframe Column Names in Pyspark