[Bajar de Kaggle el dataset the Nobel Price] (https://www.kaggle.com/datasets/imdevskp/nobel-prize)
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)
df = pd.read_csv(r'F:\curso_python_uruguay\datos\complete.csv')
df = df[['awardYear', 'category', 'prizeAmount', 'prizeAmountAdjusted', 'name', 'gender', 'birth_continent']]
print(df.shape)
df.head(2)
Splitting the Original Object into Groups
grouped = df.groupby('category')
grouped.groups
print(grouped)
dir(grouped)
To briefly inspect the resulting GroupBy object and check how exactly the groups were split, we can extract from it the groups or indices attribute. Both of them return a dictionary where the keys are the created groups and the values are the lists of axis labels (for the groups attribute) or indices (for the indices attribute) of the instances of each group in the original DataFrame:
# Cuantos grupos hay
grouped.ngroups
# Podemos visualizar lo que tenemos
for name, entries in grouped:
print(f'Primeros dos premios por categorías: "{name}" ')
print(30*'-')
print(entries.head(2), '\n\n')
grouped.indices
Los indices se utilizan para buscar que observaciones (filas) que corressponden a cada grupo
pd.options.display.float_format = '{:.2f}'.format
grouped["prizeAmount"].mean()
El método agg
trabaja con los subgrupos creados por groupby
y devuelve un output an nivel de subgrupo.
grouped["prizeAmount"].agg(lambda x: x.mean())
def media(x):
return x.mean()
grouped["prizeAmount"].agg(media)
Observar que transform, trabaja a nivel de los subgrupos creados por groupby
pero devuelve un valor para cada observacion del dataframe original
df["premio_estandarizado_cat"] = grouped["prizeAmount"].transform(lambda x: (x - x.mean()) / x.std())
df.columns
df.head(2)
def media(x):
return (x - x.mean())/x.std()
df["premio_estandarizado_cat"] = grouped["prizeAmount"].transform(media)
Ejemplo:
With transformation methods, we can also replace missing data with the group mean, median, mode, or any other value:
grouped['gender'].transform(lambda x: x.fillna(0))
grouped['gender'].transform(lambda x: x.fillna(x.mode()[0]))
Observar lo que tiene la función lambda
x.fillna(x.mode()[0])
No queda muy claro ese cero detras: Probemos dos cosas: Primero:
df['gender'].mode()
Segundo
df['gender'].mode()
Ver cual es la diferencia en cada uno
def miss_fill(x):
m = x.mode()[0]
return x.fillna(m)
grouped['gender'].transform(miss_fill)
Filtration
Filtration methods discard the groups or particular rows from each group based on a predefined condition and return a subset of the original data.
grouped['prizeAmountAdjusted'].filter(lambda x: len(x) < 100)