Filtering Dataframe Using the Length of a Column

Filtering DataFrame using the length of a column

In Spark >= 1.5 you can use size function:

from pyspark.sql.functions import col, size

df = sqlContext.createDataFrame([
    (["L", "S", "Y", "S"],  ),
    (["L", "V", "I", "S"],  ),
    (["I", "A", "N", "A"],  ),
    (["I", "L", "S", "A"],  ),
    (["E", "N", "N", "Y"],  ),
    (["E", "I", "M", "A"],  ),
    (["O", "A", "N", "A"],  ),
    (["S", "U", "S"],  )], 
    ("tokens", ))

df.where(size(col("tokens")) <= 3).show()

## +---------+
## |   tokens|
## +---------+
## |[S, U, S]|
## +---------+

In Spark < 1.5 an UDF should do the trick:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

size_ = udf(lambda xs: len(xs), IntegerType())

df.where(size_(col("tokens")) <= 3).show()

## +---------+
## |   tokens|
## +---------+
## |[S, U, S]|
## +---------+

If you use HiveContext then size UDF with raw SQL should work with any version:

df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()

## +--------------------+
## |              tokens|
## +--------------------+
## |ArrayBuffer(S, U, S)|
## +--------------------+

For string columns you can either use an udf defined above or length function:

from pyspark.sql.functions import length

df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
df.where(length(col("k")) <= 3).show()

## +---+
## |  k|
## +---+
## |bar|
## +---+

Filter string data based on its string length

import pandas as pd

df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)

Applied to filex.csv:

A,B
123,abc
1234,abcd
1234567890,abcdefghij

the code above prints

            A           B
2  1234567890  abcdefghij

filter dataframe rows based on length of column values

If based on column A

In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
     A  B
0    1  2
1  NaN  1

If based on all columns

In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
     A  B
0    1  2
1  NaN  1

How to filter a pandas dataframe based on the length of a entry

If you specifically need len, then @MaxU's answer is best.

For a more general solution, you can use the map method of a Series.

df[df['amp'].map(len) == 495]

This will apply len to each element, which is what you want. With this method, you can use any arbitrary function, not just len.

Creating a column based on filtering two data frames of different lengths using R

Actually, you are doing it wrong way. Let me explain-

In sample data it is working because larger df have rows (20) which is multiple of rows in smaller df (10).
So in you syntax what you are doing is, to check one complete vector with another complete vector (column of another df), because R normally works in vectorised way of operations.
the correct way of matching one to many is through purrr::map where each individual value in first argument (code2 here) operates with another vector i.e. df1$code which is not in argument of map.

df1 <- structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1, 
                                       0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA, 
                                                                                                                                 -10L))
df2 <- structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21, 
                                     3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0, 
                                                                                                    0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                   -20L))
library(tidyverse)

df2 %>%
  mutate(d_Activity = map(code2, ~ +(.x %in% df1$code[df1$activity == 1])))
#>    id2 code2 d_Activity
#> 1    1     2          0
#> 2    2     5          0
#> 3    3    11          0
#> 4    4    15          0
#> 5    5     9          0
#> 6    6    18          0
#> 7    7    21          0
#> 8    8     3          1
#> 9    9    27          0
#> 10  10    55          0
#> 11  11     2          0
#> 12  12     5          0
#> 13  13    11          0
#> 14  14    15          0
#> 15  15     3          1
#> 16  16    18          0
#> 17  17    21          0
#> 18  18     3          1
#> 19  19    27          0
#> 20  20    55          0

^{Created on 2021-06-17 by the reprex package (v2.0.0)}

Filtering a dataframe (where 1 specific column is type 'object') based on the columns element number length (Association Rule Analysis)

You can use the built-in .apply() function.

def antecedent_length(s):
    # Calculate the length of the antecedent
    # `s` is the value in each row
    return len(s)

df[df.antecedents.apply(antecedent_length) == 1]

Depending on the exact format of the data, you may need to adjust the function such that it calculates the length properly.

Filtering Dataframe Using the Length of a Column