Filtering Dataframe Using the Length of a Column

Filtering DataFrame using the length of a column

In Spark >= 1.5 you can use size function:

from pyspark.sql.functions import col, size

df = sqlContext.createDataFrame([
(["L", "S", "Y", "S"], ),
(["L", "V", "I", "S"], ),
(["I", "A", "N", "A"], ),
(["I", "L", "S", "A"], ),
(["E", "N", "N", "Y"], ),
(["E", "I", "M", "A"], ),
(["O", "A", "N", "A"], ),
(["S", "U", "S"], )],
("tokens", ))

df.where(size(col("tokens")) <= 3).show()

## +---------+
## | tokens|
## +---------+
## |[S, U, S]|
## +---------+

In Spark < 1.5 an UDF should do the trick:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

size_ = udf(lambda xs: len(xs), IntegerType())

df.where(size_(col("tokens")) <= 3).show()

## +---------+
## | tokens|
## +---------+
## |[S, U, S]|
## +---------+

If you use HiveContext then size UDF with raw SQL should work with any version:

df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()

## +--------------------+
## | tokens|
## +--------------------+
## |ArrayBuffer(S, U, S)|
## +--------------------+

For string columns you can either use an udf defined above or length function:

from pyspark.sql.functions import length

df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
df.where(length(col("k")) <= 3).show()

## +---+
## | k|
## +---+
## |bar|
## +---+

Filter string data based on its string length

import pandas as pd

df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)

Applied to filex.csv:

A,B
123,abc
1234,abcd
1234567890,abcdefghij

the code above prints

            A           B
2 1234567890 abcdefghij

filter dataframe rows based on length of column values

If based on column A

In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
A B
0 1 2
1 NaN 1

If based on all columns

In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
A B
0 1 2
1 NaN 1

How to filter a pandas dataframe based on the length of a entry

If you specifically need len, then @MaxU's answer is best.

For a more general solution, you can use the map method of a Series.

df[df['amp'].map(len) == 495]

This will apply len to each element, which is what you want. With this method, you can use any arbitrary function, not just len.

Creating a column based on filtering two data frames of different lengths using R

Actually, you are doing it wrong way. Let me explain-

  • In sample data it is working because larger df have rows (20) which is multiple of rows in smaller df (10).
  • So in you syntax what you are doing is, to check one complete vector with another complete vector (column of another df), because R normally works in vectorised way of operations.
  • the correct way of matching one to many is through purrr::map where each individual value in first argument (code2 here) operates with another vector i.e. df1$code which is not in argument of map.
df1 <- structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1, 
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
library(tidyverse)

df2 %>%
mutate(d_Activity = map(code2, ~ +(.x %in% df1$code[df1$activity == 1])))
#> id2 code2 d_Activity
#> 1 1 2 0
#> 2 2 5 0
#> 3 3 11 0
#> 4 4 15 0
#> 5 5 9 0
#> 6 6 18 0
#> 7 7 21 0
#> 8 8 3 1
#> 9 9 27 0
#> 10 10 55 0
#> 11 11 2 0
#> 12 12 5 0
#> 13 13 11 0
#> 14 14 15 0
#> 15 15 3 1
#> 16 16 18 0
#> 17 17 21 0
#> 18 18 3 1
#> 19 19 27 0
#> 20 20 55 0

Created on 2021-06-17 by the reprex package (v2.0.0)

Filtering a dataframe (where 1 specific column is type 'object') based on the columns element number length (Association Rule Analysis)

You can use the built-in .apply() function.

def antecedent_length(s):
# Calculate the length of the antecedent
# `s` is the value in each row
return len(s)

df[df.antecedents.apply(antecedent_length) == 1]

Depending on the exact format of the data, you may need to adjust the function such that it calculates the length properly.



Related Topics



Leave a reply



Submit