Pandas: Faster way than rollforward?


measure everything

I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code:

import random
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# prepare some fake data to build frames
subscription_prices = [x - 0.05 for x in range(100, 500, 25)]
companies = ['initech','ingen','weyland','tyrell']
starting_periods = ['2014-12-10','2015-1-15','2014-11-20','2015-2-9']

# use the lists and dict from above to create a fake dataset
pieces = []
for company, period in zip(companies,starting_periods):
    data = {
        'company': company,
        'revenue': random.choice(subscription_prices),
        'invoice_date': pd.date_range(period,periods=12,freq='31D')
    }
    frame = DataFrame(data)
    pieces.append(frame)
df = pd.concat(pieces, ignore_index=True)

I need to normalize invoice dates to monthly. It is best to move all values ​​to the end of the month for a number of reasons . I use this method:invoice_date

from pandas.tseries.offsets import *
df['rev_period'] = df['invoice_date'].apply(lambda x: MonthEnd(normalize=True).rollforward(x))

However, even with just a million rows (this is the size of my actual dataset) this gets very slow:

In [11]: %time df['invoice_date'].apply(lambda x: MonthEnd(normalize=True).rollforward(x))
CPU times: user 3min 11s, sys: 1.44 s, total: 3min 12s
Wall time: 3min 17s

The important thing about this method of date offsetting with P/pandas is that if it invoice_datehappens to happen on the last day of the month, the date is preserved as the last day of the month. Another benefit is that it preserves the dtypeas datetime, which df['invoice_date'].apply(lambda x: x.strftime('%Y-%m'))is faster, but converts the value to str.

Is there a vectorized way? I tried MonthEnd(normalize=True).rollforward(df['invoice_date'])but got an error TypeError: Cannot convert input to Timestamp.

vitamins

right here:

df['rev_period'] = df['invoice_date'] + pd.offsets.MonthEnd(0)

Should be at least an order of magnitude faster.

Related


Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Pandas: Faster way than rollforward?

measure everything I am preparing some data for cohort analysis. The information I have is similar to a fake dataset that can be generated with the following code: import random import numpy as np import pandas as pd from pandas import Series, DataFrame # pre

Is there any faster way than pandas fillna()?

wanderer Pandas fillna()Very slow, especially if a lot of data is missing in the dataframe. Is there a faster way than this? (I know it would help if only some rows and/or columns containing NA were removed) Jesler I try to test: np.random.seed(123) N = 60000

Is there any faster way than pandas fillna()?

wanderer Pandas fillna()Very slow, especially if a lot of data is missing in the dataframe. Is there a faster way than this? (I know it would help if only some rows and/or columns containing NA were removed) Jesler I try to test: np.random.seed(123) N = 60000

Is there any faster way than pandas fillna()?

wanderer Pandas fillna()Very slow, especially if a lot of data is missing in the dataframe. Is there a faster way than this? (I know it would help if only some rows and/or columns containing NA were removed) Jesler I try to test: np.random.seed(123) N = 60000

How to do a rollforward sum in pandas?

Moses Solman: I have this dataframe: dates = pd.date_range(start='2016-01-01', periods=20, freq='d') df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8, 'B': np.concatenate((dates, dates)), 'C': np.arange(40)}) I sorte

How to do a rollforward sum in pandas?

Moses Solman: I have this dataframe: dates = pd.date_range(start='2016-01-01', periods=20, freq='d') df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8, 'B': np.concatenate((dates, dates)), 'C': np.arange(40)}) I sorte

Faster way to slice in pandas

Fred Schwartz I currently have a function and a loop. The purpose is to iterate over each column in the dataframe, if the index value is less than the value defined by the functino, give the value 0, otherwise leave it as the current value. It's working, but i

Faster way than gsub in R

Martin Cabe I am trying to find out, if there is a faster way than the gsub vectorized function in R. I add some "sentences" ($words sent) to the dataframe and then have some words to remove from those sentences (stored in the wordsForRemoving variable). sent

Faster way than gsub in R

Martin Cabe I am trying to find out, if there is a faster way than the gsub vectorized function in R. I add some "sentences" ($words sent) to the dataframe and then have some words to remove from those sentences (stored in the wordsForRemoving variable). sent

Faster way than gsub in R

Martin Cabe I am trying to find out, if there is a faster way than the gsub vectorized function in R. I add some "sentences" ($words sent) to the dataframe and then have some words to remove from those sentences (stored in the wordsForRemoving variable). sent

Faster way than gsub in R

Martin Cabe I am trying to find out, if there is a faster way than the gsub vectorized function in R. I add some "sentences" ($words sent) to the dataframe and then have some words to remove from those sentences (stored in the wordsForRemoving variable). sent

Faster way than gsub in R

Martin Cabe I am trying to find out, if there is a faster way than the gsub vectorized function in R. I add some "sentences" ($words sent) to the dataframe and then have some words to remove from those sentences (stored in the wordsForRemoving variable). sent

Numpy sorts faster than Pandas

Lion_chocolatebar Here is a basic question about sorting arrays in numpy and pandas: I realized that when I used pandas to sort and select specific columns of the dataframe, changing the code to use numpy arrays took almost twice as long. What is the reason fo

Numpy sorts faster than Pandas

Lion_chocolatebar Here is a basic question about sorting arrays in numpy and pandas: I realized that when I was using pandas to sort and select specific columns of the dataframe, changing the code to use numpy arrays took almost twice as long. What is the reas