Pyspark - Pass List as Parameter to Udf

PySpark - Pass list as parameter to UDF

from pyspark.sql.functions import udf, col

#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
label_list = ["Great", "Good", "OK", "Please Move", "Dead"]

def cate(label, feature_list):
    if feature_list == 0:
        return label[4]
    else:  #you may need to add 'else' condition as well otherwise 'null' will be added in this case
        return 'I am not sure!'

def udf_score(label_list):
    return udf(lambda l: cate(l, label_list))
a.withColumn("category", udf_score(label_list)(col("distances"))).show()

Output is:

+------+---------+--------------+
|Letter|distances|      category|
+------+---------+--------------+
|     A|       20|I am not sure!|
|     B|       30|I am not sure!|
|     D|       80|I am not sure!|
+------+---------+--------------+

Passing a data frame column and external list to udf under withColumn

The cleanest solution is to pass additional arguments using closure:

def make_topic_word(topic_words):
     return udf(lambda c: label_maker_topic(c, topic_words))

df = sc.parallelize([(["union"], )]).toDF(["tokens"])

(df.withColumn("topics", make_topic_word(keyword_list)(col("tokens")))
    .show())

This doesn't require any changes in keyword_list or the function you wrap with UDF. You can also use this method to pass an arbitrary object. This can be used to pass for example a list of sets for efficient lookups.

If you want to use your current UDF and pass topic_words directly you'll have to convert it to a column literal first:

from pyspark.sql.functions import array, lit

ks_lit = array(*[array(*[lit(k) for k in ks]) for ks in keyword_list])
df.withColumn("ad", topicWord(col("tokens"), ks_lit)).show()

Depending on your data and requirements there can alternative, more efficient solutions, which don't require UDFs (explode + aggregate + collapse) or lookups (hashing + vector operations).

Pyspark pass function as a parameter to UDF

You are not so far from the solution. Here is how I would do it :

def foo_fun_udf(func):

    def foo_fun(row) -> str:
        return 'a' + func()

    out_udf = udf(foo_fun, StringType())
    return out_udf 

df_to_test.withColumn(
    'foo', 
    foo_fun_udf(bar_fun)(struct([df_to_test[x] for x in df_to_test.columns]))
).show()

Pass list to udf in dataframe with Colum

the below dynamic column approach might solve your problem.

from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = spark.createDataFrame(values,['col1','col2','col3','col4'])
df.show()

'''
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  A1|  11|  A3|  A4|
|  B1|  22|  B3|  B4|
|  C1|  33|  C3|  C4|
+----+----+----+----+
'''
 
col_list = ['col1','col2']
df = df.withColumn('concatenated_cols2',concat(*col_list))
col_list = ['col1','col2','col3']
df = df.withColumn('concatenated_cols3',concat(*col_list))
col_list = ['col1','col2','col3','col4']
df = df.withColumn('concatenated_cols4',concat(*col_list))
df.show()

'''
+----+----+----+----+------------------+------------------+------------------+
|col1|col2|col3|col4|concatenated_cols2|concatenated_cols3|concatenated_cols4|
+----+----+----+----+------------------+------------------+------------------+
|  A1|  11|  A3|  A4|              A111|            A111A3|          A111A3A4|
|  B1|  22|  B3|  B4|              B122|            B122B3|          B122B3B4|
|  C1|  33|  C3|  C4|              C133|            C133C3|          C133C3C4|
+----+----+----+----+------------------+------------------+------------------+
'''

Pyspark - Pass List as Parameter to Udf