Pyspark - Pass List as Parameter to Udf

PySpark - Pass list as parameter to UDF

from pyspark.sql.functions import udf, col

#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
label_list = ["Great", "Good", "OK", "Please Move", "Dead"]

def cate(label, feature_list):
if feature_list == 0:
return label[4]
else: #you may need to add 'else' condition as well otherwise 'null' will be added in this case
return 'I am not sure!'

def udf_score(label_list):
return udf(lambda l: cate(l, label_list))
a.withColumn("category", udf_score(label_list)(col("distances"))).show()

Output is:

+------+---------+--------------+
|Letter|distances| category|
+------+---------+--------------+
| A| 20|I am not sure!|
| B| 30|I am not sure!|
| D| 80|I am not sure!|
+------+---------+--------------+

Passing a data frame column and external list to udf under withColumn

The cleanest solution is to pass additional arguments using closure:

def make_topic_word(topic_words):
return udf(lambda c: label_maker_topic(c, topic_words))

df = sc.parallelize([(["union"], )]).toDF(["tokens"])

(df.withColumn("topics", make_topic_word(keyword_list)(col("tokens")))
.show())

This doesn't require any changes in keyword_list or the function you wrap with UDF. You can also use this method to pass an arbitrary object. This can be used to pass for example a list of sets for efficient lookups.

If you want to use your current UDF and pass topic_words directly you'll have to convert it to a column literal first:

from pyspark.sql.functions import array, lit

ks_lit = array(*[array(*[lit(k) for k in ks]) for ks in keyword_list])
df.withColumn("ad", topicWord(col("tokens"), ks_lit)).show()

Depending on your data and requirements there can alternative, more efficient solutions, which don't require UDFs (explode + aggregate + collapse) or lookups (hashing + vector operations).

Pyspark pass function as a parameter to UDF

You are not so far from the solution. Here is how I would do it :

def foo_fun_udf(func):

def foo_fun(row) -> str:
return 'a' + func()

out_udf = udf(foo_fun, StringType())
return out_udf

df_to_test.withColumn(
'foo',
foo_fun_udf(bar_fun)(struct([df_to_test[x] for x in df_to_test.columns]))
).show()

Pass list to udf in dataframe with Colum

the below dynamic column approach might solve your problem.

from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = spark.createDataFrame(values,['col1','col2','col3','col4'])
df.show()

'''
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A1| 11| A3| A4|
| B1| 22| B3| B4|
| C1| 33| C3| C4|
+----+----+----+----+
'''

col_list = ['col1','col2']
df = df.withColumn('concatenated_cols2',concat(*col_list))
col_list = ['col1','col2','col3']
df = df.withColumn('concatenated_cols3',concat(*col_list))
col_list = ['col1','col2','col3','col4']
df = df.withColumn('concatenated_cols4',concat(*col_list))
df.show()

'''
+----+----+----+----+------------------+------------------+------------------+
|col1|col2|col3|col4|concatenated_cols2|concatenated_cols3|concatenated_cols4|
+----+----+----+----+------------------+------------------+------------------+
| A1| 11| A3| A4| A111| A111A3| A111A3A4|
| B1| 22| B3| B4| B122| B122B3| B122B3B4|
| C1| 33| C3| C4| C133| C133C3| C133C3C4|
+----+----+----+----+------------------+------------------+------------------+
'''



Related Topics



Leave a reply



Submit