python - Pandas to_csv() slow saving large dataframe -
i'm guessing easy fix, i'm running issue it's taking hour save pandas dataframe csv file using to_csv() function. i'm using anaconda python 2.7.12 pandas (0.19.1).
import os import glob import pandas pd src_files = glob.glob(os.path.join('/my/path', "*.csv.gz")) # 1 - takes 2 min read 20m records 30 files file_ in sorted(src_files): stage = pd.dataframe() iter_csv = pd.read_csv(file_ , sep=',' , index_col=false , header=0 , low_memory=false , iterator=true , chunksize=100000 , compression='gzip' , memory_map=true , encoding='utf-8') df = pd.concat([chunk chunk in iter_csv]) stage = stage.append(df, ignore_index=true) # 2 - takes 55 min write 20m records 1 dataframe stage.to_csv('output.csv' , sep='|' , header=true , index=false , chunksize=100000 , encoding='utf-8') del stage
i've confirmed hardware , memory working, these wide tables (~ 100 columns) of numeric (decimal) data.
thank you,
you reading compressed files , writing plaintext file. io bottleneck.
writing compressed file speedup writing 10x
stage.to_csv('output.csv.gz' , sep='|' , header=true , index=false , chunksize=100000 , compression='gzip' , encoding='utf-8')
additionally experiment different chunk sizes , compression methods (‘bz2’, ‘xz’).
Comments
Post a Comment