python - Pandas to_csv() slow saving large dataframe -

- September 15, 2014

i'm guessing easy fix, i'm running issue it's taking hour save pandas dataframe csv file using to_csv() function. i'm using anaconda python 2.7.12 pandas (0.19.1).

import os import glob import pandas pd  src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))  # 1 - takes 2 min read 20m records 30 files file_ in sorted(src_files):     stage = pd.dataframe()     iter_csv = pd.read_csv(file_                      , sep=','                      , index_col=false                      , header=0                      , low_memory=false                      , iterator=true                      , chunksize=100000                      , compression='gzip'                      , memory_map=true                      , encoding='utf-8')      df = pd.concat([chunk chunk in iter_csv])     stage = stage.append(df, ignore_index=true)  # 2 - takes 55 min write 20m records 1 dataframe stage.to_csv('output.csv'              , sep='|'              , header=true              , index=false              , chunksize=100000              , encoding='utf-8')  del stage

i've confirmed hardware , memory working, these wide tables (~ 100 columns) of numeric (decimal) data.

thank you,

you reading compressed files , writing plaintext file. io bottleneck.

writing compressed file speedup writing 10x

    stage.to_csv('output.csv.gz'          , sep='|'          , header=true          , index=false          , chunksize=100000          , compression='gzip'          , encoding='utf-8')

additionally experiment different chunk sizes , compression methods (‘bz2’, ‘xz’).

Search This Blog

QR

python - Pandas to_csv() slow saving large dataframe -

Comments

Post a Comment

Popular posts from this blog

java - .class files under target/classes folder Maven -

linux - Could not find a package configuration file provided by "Qt5Svg" -

simple.odata.client - Simple OData Client Unlink -