Creating Dataframe from a Dictionary Where Entries Have Different Lengths

Creating dataframe from a dictionary where entries have different lengths

In Python 3.x:

import pandas as pd
import numpy as np

d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))

Out[7]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4

In Python 2.x:

replace d.items() with d.iteritems().

Creating a Pandas DataFrame from a Dictionary of different-length list of Tuples

You are on the right track. Just further flatten the tupled values so the resulting table can be pivoted easily.

Code

dic is the given dict data.

df = pd.DataFrame(
[[k, v[0], v[1]] for k, ls_v in dic.items() for v in ls_v],
columns=["Person", "GUID", "value"]
).pivot(index="GUID", columns="Person")

# drop hierarchical level of "value"
df.columns = df.columns.droplevel(0)

Result

print(df)

Person Person A Person B Person C
GUID
abc123 1.0 4.0 2.0
bcc222 2.0 NaN NaN
icy558 NaN NaN 7.0
igh643 1.0 NaN NaN

How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?

You can filter values of dict in dict comprehension, then DataFrame works perfectly:

print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}

df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
a b c
0 1 1 2
1 2 2 45
2 3 3 67

If is possible some length can be less as min_length add Series:

data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3

df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
a b c
0 1 1.0 2
1 2 2.0 45
2 3 NaN 67

Timings:

In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop

In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop

#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop

Code for timings:

np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000

data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}

Panda Dataframe from Dict with different length value

Check with Series

pd.Series(my_dict).rename_axis('Key_Column').reset_index(name='Value_Column')

Key_Column Value_Column
0 a [1, 2, 3]
1 b [0]
2 c [3, 5]

Generate a dataframe from list with different length

Use

In [9]: pd.DataFrame({'a': pd.Series(a), 'b': pd.Series(b)})
Out[9]:
a b
0 1 2.0
1 2 3.0
2 3 NaN

Or,

In [10]: pd.DataFrame.from_dict({'a': a, 'b': b}, orient='index').T
Out[10]:
a b
0 1.0 2.0
1 2.0 3.0
2 3.0 NaN

Create a pandas dataframe from a dict of uneven length

Here is an example of possible solution:

d = {
"a": [1],
"b": 2,
"c": [[7, 8, 9], ["a", "b", "c"], [9, 10, 11]],
"d": None,
}

max_len = max(len(l) if isinstance(l, list) else 1 for l in d.values())

for key in d.keys():
if isinstance(d[key], list):
if len(d[key]) != max_len:
d[key] = np.repeat(d[key], max_len).tolist()
else:
d[key] = np.repeat(np.array(d[key]), max_len).tolist()

Result:

{
'a': [1, 1, 1],
'b': [2, 2, 2],
'c': [[7, 8, 9], ['a', 'b', 'c'], [9, 10, 11]],
'd': [None, None, None]
}

But it will work obviously only for a particular case, when all column but one have only one element. To solve this task generally one should also specify how columns of different length should be handled: should the whole column be repeated and rest trimmed on the last iteration, or should only first / last value be repeated, or some other approach.

Creating a dataframe where one of the arrays has a different length

You can loop each forecast_items value with iter and next for select first value, if not exist is assigned fo dictionary NaN value:

page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")

soup = BeautifulSoup(page.content, 'html.parser')

seven_day = soup.find(id="seven-day-forecast")

forecast_items = seven_day.find_all(class_="tombstone-container")

out = []
for x in forecast_items:
periods = next(iter([t.get_text() for t in x.select('.period-name')]), np.nan)
short_descs = next(iter([t.get_text() for t in x.select('.short-desc')]), np.nan)
temps = next(iter([t.get_text() for t in x.select('.temp')]), np.nan)
descs = next(iter([d['alt'] for d in x.select('img')]), np.nan)
out.append({'period':periods, 'short_desc':short_descs, 'temp':temps, 'descs':descs})

weather = pd.DataFrame(out)
print (weather)
descs period \
0 NOW until4:00pm Sat
1 Today: Showers, with thunderstorms also possib... Today
2 Tonight: Showers likely and possibly a thunder... Tonight
3 Sunday: A chance of showers before 11am, then ... Sunday
4 Sunday Night: Rain before 11pm, then a chance ... SundayNight
5 Monday: A 40 percent chance of showers. Cloud... Monday
6 Monday Night: A 30 percent chance of showers. ... MondayNight
7 Tuesday: A 50 percent chance of rain. Cloudy,... Tuesday
8 Tuesday Night: Rain. Cloudy, with a low aroun... TuesdayNight

short_desc temp
0 Wind Advisory NaN
1 Showers andBreezy High: 56 °F
2 ShowersLikely Low: 49 °F
3 Heavy Rainand Windy High: 56 °F
4 Heavy Rainand Breezythen ChanceShowers Low: 52 °F
5 ChanceShowers High: 58 °F
6 ChanceShowers Low: 53 °F
7 Chance Rain High: 59 °F
8 Rain Low: 53 °F


Related Topics



Leave a reply



Submit