What Are Python Pandas Equivalents for R Functions Like Str(), Summary(), and Head()

What are Python pandas equivalents for R functions like str(), summary(), and head()?

  • summary() ~ describe()
  • head() ~ head()

I'm not sure about the str() equivalent.

R summary() equivalent in numpy

No. You'll need to use pandas.

R is for language for statistics, so many of the basic functionality you need, like summary() and lm(), are loaded when you boot it up. Python has many uses, so you need to install and import the appropriate statistical packages. numpy isn't a statistics package - it's for numerical computation more generally, so you need to use packages like pandas, scipy and statsmodels to allow Python to do what R can do out of the box.

Python equivalent of R's head and tail function

Suppose you want to output the first and last 10 rows of the iris data set.

In R:

data(iris)
head(iris, 10)
tail(iris, 10)

In Python (scikit-learn required to load the iris data set):

import pandas as pd
from sklearn import datasets
iris = pd.DataFrame(datasets.load_iris().data)
iris.head(10)
iris.tail(10)

Now, as previously answered, if your data frame is too large for the display you use in the terminal, a summary is output. To visualize your data in a terminal, you could either expend the terminal or reduce the number of columns to display, as follows.

iris.iloc[:,1:2].head(10)

EDIT. Changed .ix to .iloc. From the pandas documentation,

Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

datatype in python (looking for something similar to str in R)

You probably want dtypes

>>> import pandas as pd
>>> df = pd.DataFrame({'foo': [1, 2, 3], 'bar': [1.0, 2.0, 3.0], 'baz': ['qux', 'quux', 'quuux']})
>>> df.dtypes
bar float64
baz object
foo int64
dtype: object

Polars python equivalent to glimpse and summary in R

Polars has a describe method:

df = pl.DataFrame({
'a': [1.0, 2.8, 3.0],
'b': [4, 5, 6],
"c": [True, False, True]
})

df.describe()
shape: (5, 4)
╭──────────┬───────┬─────┬──────╮
│ describe ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═══════╪═════╪══════╡
│ "mean" ┆ 2.267 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "std" ┆ 1.102 ┆ 1 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "min" ┆ 1 ┆ 4 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "max" ┆ 3 ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "median" ┆ 2.8 ┆ 5 ┆ null │

Which reports, like R's summary, descriptive statistics per column. I have not used glimpse before, but a quick Google suggests it does something similar to Polar's head, but then with the output stacked vertically, so it is easier to digest when there are many columns.

Why is my pandas df all object data types as opposed to e.g. int, string etc?

Sample:

df = pd.DataFrame({'strings':['a','d','f'],
'dicts':[{'a':4}, {'c':8}, {'e':9}],
'lists':[[4,8],[7,8],[3]],
'tuples':[(4,8),(7,8),(3,)],
'sets':[set([1,8]), set([7,3]), set([0,1])] })

print (df)
dicts lists sets strings tuples
0 {'a': 4} [4, 8] {8, 1} a (4, 8)
1 {'c': 8} [7, 8] {3, 7} d (7, 8)
2 {'e': 9} [3] {0, 1} f (3,)

All values have same dtypes:

print (df.dtypes)
dicts object
lists object
sets object
strings object
tuples object
dtype: object

But type is different, if need check it by loop:

for col in df:
print (df[col].apply(type))

0 <class 'dict'>
1 <class 'dict'>
2 <class 'dict'>
Name: dicts, dtype: object
0 <class 'list'>
1 <class 'list'>
2 <class 'list'>
Name: lists, dtype: object
0 <class 'set'>
1 <class 'set'>
2 <class 'set'>
Name: sets, dtype: object
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
Name: strings, dtype: object
0 <class 'tuple'>
1 <class 'tuple'>
2 <class 'tuple'>
Name: tuples, dtype: object

Or first value of columns:

print (type(df['strings'].iat[0]))
<class 'str'>

print (type(df['dicts'].iat[0]))
<class 'dict'>

print (type(df['lists'].iat[0]))
<class 'list'>

print (type(df['tuples'].iat[0]))
<class 'tuple'>

print (type(df['sets'].iat[0]))
<class 'set'>

Or by applymap:

print (df.applymap(type))
strings dicts lists tuples \
0 <class 'str'> <class 'dict'> <class 'list'> <class 'tuple'>
1 <class 'str'> <class 'dict'> <class 'list'> <class 'tuple'>
2 <class 'str'> <class 'dict'> <class 'list'> <class 'tuple'>

sets
0 <class 'set'>
1 <class 'set'>
2 <class 'set'>

R function for sorting in Python Zip like structure

You can reverse vector and combine with original selecting one or another based on even/odd sequence:

  newdata <- data.frame(cbind(dat=data, rev=c(1, rev(data)[- length(data)]), n=seq(1:length(data)))) %>%
mutate(res=ifelse(n%%2==1, dat, rev)) %>%
select(res)

Python Pandas equivalent in JavaScript

This wiki will summarize and compare many pandas-like Javascript libraries.

In general, you should check out the d3 Javascript library. d3 is very useful "swiss army knife" for handling data in Javascript, just like pandas is helpful for Python. You may see d3 used frequently like pandas, even if d3 is not exactly a DataFrame/Pandas replacement (i.e. d3 doesn't have the same API; d3 does not have Series / DataFrame classes with methods that match the pandas behavior)

Ahmed's answer explains how d3 can be used to achieve some DataFrame functionality, and some of the libraries below were inspired by things like LearnJsData which uses d3 and lodash.

As for DataFrame-style data transformation (splitting, joining, group by etc) , here is a quick list of some of the Javascript libraries.

Note some libraries are Node.js aka Server-side Javascript, some are browser-compatible aka client-side Javascript, and some are Typescript. So use the option that's right for you.

  • danfo-js (browser-support AND NodeJS-support)
    • From Vignesh's answer

    • danfo (which is often imported and aliased as dfd); has a basic DataFrame-type data structure, with the ability to plot directly

    • Built by the team at Tensorflow: "One of the main goals of Danfo.js is to bring data processing, machine learning and AI tools to JavaScript developers. ... Open-source libraries like Numpy and Pandas..."

    • pandas is built on top of numpy; likewise danfo-js is built on tensorflow-js

    • please note danfo may not (yet?) support multi-column indexes

  • pandas-js
    • UPDATE The pandas-js repo has not been updated in awhile
    • From STEEL and Feras' answers
    • "pandas.js is an open source (experimental) library mimicking the Python pandas library. It relies on Immutable.js as the NumPy logical equivalent. The main data objects in pandas.js are, like in Python pandas, the Series and the DataFrame."
  • dataframe-js
    • "DataFrame-js provides an immutable data structure for javascript and datascience, the DataFrame, which allows to work on rows and columns with a sql and functional programming inspired api."
  • data-forge
    • Seen in Ashley Davis' answer
    • "JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ."
    • Note the old data-forge JS repository is no longer maintained; now a new repository uses Typescript
  • jsdataframe
    • "Jsdataframe is a JavaScript data wrangling library inspired by data frame functionality in R and Python Pandas."
  • dataframe
    • "explore data by grouping and reducing."
  • SQL Frames
    • "DataFrames meet SQL, in the Browser"
    • "SQL Frames is a low code data management framework that can be directly embedded in the browser to provide rich data visualization and UX. Complex DataFrames can be composed using familiar SQL constructs. With its powerful built-in analytics engine, data sources can come in any shape, form and frequency and they can be analyzed directly within the browser. It allows scaling to big data backends by transpiling the composed DataFrame logic to SQL."

Then after coming to this question, checking other answers here and doing more searching, I found options like:

  • Apache Arrow in JS
    • Thanks to user Back2Basics suggestion:
    • "Apache Arrow is a columnar memory layout specification for encoding vectors and table-like containers of flat and nested data. Apache Arrow is the emerging standard for large in-memory columnar data (Spark, Pandas, Drill, Graphistry, ...)"
  • polars
    • Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.
  • Observable
    • At first glance, seems like a JS alternative to the IPython/Jupyter "notebooks"
    • Observable's page promises: "Reactive programming", a "Community", on a "Web Platform"
    • See 5 minute intro here
  • portal.js (formerly recline; from Rufus' answer)
    • MAY BE OUTDATED: Does not use a "DataFrame" API
    • MAY BE OUTDATED: Instead emphasizes its "Multiview" (the UI) API, (similar to jQuery/DOM model) which doesn't require jQuery but does require a browser! More examples
    • MAY BE OUTDATED: Also emphasizes its MVC-ish architecture; including back-end stuff (i.e. database connections)
  • js-data
    • Really more of an ORM! Most of its modules correspond to different data storage questions (js-data-mongodb, js-data-redis, js-data-cloud-datastore), sorting, filtering, etc.
    • On plus-side does work on Node.js as a first-priority; "Works in Node.js and in the Browser."
  • miso (another suggestion from Rufus)
    • Impressive backers like Guardian and bocoup.
  • AlaSQL
    • "AlaSQL" is an open source SQL database for Javascript with a strong focus on query speed and data source flexibility for both relational data and schemaless data. It works in your browser, Node.js, and Cordova."
  • Some thought experiments:
    • "Scaling a DataFrame in Javascript" - Gary Sieling

Here are the criteria we used to consider the above choices

  • General Criteria
    • Language (NodeJS vs browser JS vs Typescript)
    • Dependencies (i.e. if it uses an underlying library / AJAX/remote API's)
    • Actively supported (active user-base, active source repository, etc)
    • Size/speed of JS library
  • Panda's criterias in its R comparison
    • Performance
    • Functionality/flexibility
    • Ease-of-use
  • Similarity to Pandas / Dataframe API's
    • Specifically hits on their main features
    • Data-science emphasis
    • Built-in visualization functions
    • Demonstrated integration in combination with other tools like Jupyter
      (interactive notebooks), etc

Difference between describe() and summary() in Apache Spark

If we are passing any args then these functions works for different purposes:

.describe() function takes cols:String*(columns in df) as optional args.

.summary() function takes statistics:String*(count,mean,stddev..etc) as optional args.

Example:

scala> val df_des=Seq((1,"a"),(2,"b"),(3,"c")).toDF("id","name")
scala> df_des.describe().show(false) //without args
//Result:
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//|mean |2.0|null|
//|stddev |1.0|null|
//|min |1 |a |
//|max |3 |c |
//+-------+---+----+
scala> df_des.summary().show(false) //without args
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//|mean |2.0|null|
//|stddev |1.0|null|
//|min |1 |a |
//|25% |1 |null|
//|50% |2 |null|
//|75% |3 |null|
//|max |3 |c |
//+-------+---+----+
scala> df_des.describe("id").show(false) //descibe on id column only
//+-------+---+
//|summary|id |
//+-------+---+
//|count |3 |
//|mean |2.0|
//|stddev |1.0|
//|min |1 |
//|max |3 |
//+-------+---+
scala> df_des.summary("count").show(false) //get count summary only
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//+-------+---+----+


Related Topics



Leave a reply



Submit