Pandas: Faster way than rollforward?

measure everything

I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code:

import random
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# prepare some fake data to build frames
subscription_prices = [x - 0.05 for x in range(100, 500, 25)]
companies = ['initech','ingen','weyland','tyrell']
starting_periods = ['2014-12-10','2015-1-15','2014-11-20','2015-2-9']

# use the lists and dict from above to create a fake dataset
pieces = []
for company, period in zip(companies,starting_periods):
    data = {
        'company': company,
        'revenue': random.choice(subscription_prices),
        'invoice_date': pd.date_range(period,periods=12,freq='31D')
    }
    frame = DataFrame(data)
    pieces.append(frame)
df = pd.concat(pieces, ignore_index=True)

I need to normalize invoice dates to monthly. It is best to move all values to the end of the month for a number of reasons . I use this method:invoice_date

from pandas.tseries.offsets import *
df['rev_period'] = df['invoice_date'].apply(lambda x: MonthEnd(normalize=True).rollforward(x))

However, even with just a million rows (this is the size of my actual dataset) this gets very slow:

In [11]: %time df['invoice_date'].apply(lambda x: MonthEnd(normalize=True).rollforward(x))
CPU times: user 3min 11s, sys: 1.44 s, total: 3min 12s
Wall time: 3min 17s

The important thing about this method of date offsetting with P/pandas is that if it invoice_datehappens to happen on the last day of the month, the date is preserved as the last day of the month. Another benefit is that it preserves the dtypeas datetime, which df['invoice_date'].apply(lambda x: x.strftime('%Y-%m'))is faster, but converts the value to str.

Is there a vectorized way? I tried MonthEnd(normalize=True).rollforward(df['invoice_date'])but got an error TypeError: Cannot convert input to Timestamp.

vitamins

right here:

df['rev_period'] = df['invoice_date'] + pd.offsets.MonthEnd(0)

Should be at least an order of magnitude faster.

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

Is there a faster way than a for loop to change a Pandas group

Matt I have the following dataframe which I am using: These are the chess games I'm trying to group by game and then perform a function in each game based on the number of moves taken in that game... game_id move_number colour avg_centi 0 03

Is there a faster way than a for loop to change a Pandas group

Is there any faster way than pandas fillna()?

wanderer Pandas fillna()Very slow, especially if a lot of data is missing in the dataframe. Is there a faster way than this? (I know it would help if only some rows and/or columns containing NA were removed) Jesler I try to test: np.random.seed(123) N = 60000