注册 | 登录

解决python - "Large data" work flows using pandas

itPublisher 分享于 2017-03-13

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.

One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.

My first thought is to use HDFStore to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:

What are some best-practice workflows for accomplishing the following:

  1. Loading flat files into a permanent, on-disk database structure
  2. Querying that database to retrieve data to feed into a pandas data structure
  3. Updating the database after manipulating pieces in pandas

Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".

Edit -- an example of how I would like this to work:

  1. Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory.
  2. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
  3. I would create new columns by performing various operations on the selected columns.
  4. I would then have to append these new columns into the database structure.

I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.

Edit -- Responding to Jeff's questions specifically:

  1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
  2. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset.
  3. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
  4. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
  5. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
  6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).

python mongodb pandas large-data hdf5
  this question
edited May 23 '14 at 11:40 Benjamin 12.3k 16 97 184 asked Jan 10 '13 at 16:20 Zelazny7 12.8k 9 37 55 1   Is the ratio core size / full size 1 %, 10 % ? Does it matter -- if you could compress cols to int8, or filter out noisy rows, would that change your compute-think loop from say hours to minutes ? (Also add tag large-data.) –  denis Mar 18 '13 at 11:26      Storing float32 instead of float64, and int8 where possible, should be trivial (don't know what tools / functions do float64 internally though) –  denis Mar 18 '13 at 11:59 33   Really appreciate all the specific details of your use case. –  Stian Håklev Oct 2 '13 at 0:28 3   Scroll down for a new tool: Dask. –  Private Jul 14 '16 at 9:47


10 Answers

解决方法 +50

I routinely use tens of gigabytes of data in just this fashion e.g. I have tables on disk that I read via queries, create data and append back.

It's worth reading the docs and late in this thread for several suggestions for how to store your data.

Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.

  1. Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
  2. What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
    (Giving a toy example could enable us to offer more specific recommendations.)
  3. After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
  4. Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
  5. Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
  6. Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?


Ensure you have pandas at least 0.10.1 installed.

Read iterating files chunk-by-chunk and multiple table queries.

Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
(The following is pseudocode.)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),


group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

Reading in the files and creating the storage (essentially doing what append_to_multiple does):

for f in files:
   # read in the file, additional options hmay be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).

This is how you get columns and create new ones:

frame =
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

When you are ready for post_processing:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:, where = ['field_1000=foo', 'field_1001>0'])

They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).

You also might want to:

  • create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
  • indexes on certain data columns (makes row-subsetting much faster).
  • enable compression.

Let me know when you have questions!

  this answer
edited Aug 18 '16 at 1:51 Cogito 77 5 answered Jan 10 '13 at 22:57 Jeff 52.8k 5 91 97 2   Thanks for the links. The second link makes me a bit worried that I can't append new columns to the tables in HDFStore? Is that correct? Also, I added an example of how I would use this setup. –  Zelazny7 Jan 10 '13 at 23:53 2   The actual structure in the hdf is up to you. Pytables is row oriented, with fixed columns at creation time. You cannot append columns once a table is created. However, you can create a new table indexed the same as your existing table. (see the select_as_multiple examples in the docs). This way you can create arbitrary sized objects while having pretty efficient queries. The way you use the data is key to how it should be organized on-disk. Send me an off-list e-mail with pseudo code of a more specific example. –  Jeff Jan 11 '13 at 0:27      I have updated my question to respond to your detailed points. I will work on an example to send you off-list. Thanks! –  Zelazny7 Jan 11 '13 at 4:29      I have added a couple of more questions about your ata in my answer –  Jeff Jan 11 '13 at 12:58 4   @Jeff, with Pandas being at 0.17.x now have the issues outlined above been resolved in Pandas? –  toasteez Jan 13 '16 at 9:33  |  show more comments

I think the answers above are missing a simple approach that I've found very useful.

When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)

Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end

One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)

The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats

This approach doesn't cover all scenarios, but is very useful in a lot of them

  this answer
edited Dec 23 '13 at 15:21 answered Dec 19 '13 at 19:46 user1827356 2,972 9 19 3   Agreed. With all the hype, it's easy to forget that command-line tools can be 235x faster than a Hadoop cluster –  zelusp Oct 6 '16 at 21:01


This is the case for pymongo. I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. First and foremost pymongo is a document based DB, so each person would be a document (dict of attributes). Many people form a collection and you can have many collections (people, stock market, income).

pd.dateframe -> pymongo Note: I use the chunksize in read_csv to keep it to 5 to 10k records(pymongo drops the socket if larger)

aCollection.insert((a[1].to_dict() for a in df.iterrows()))

querying: gt = greater than...

pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))

.find() returns an iterator so I commonly use ichunked to chop into smaller iterators.

How about a join since I normally get 10 data sources to paste together:

aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))

then (in my case sometimes I have to agg on aJoinDF first before its "mergeable".)

df = pandas.merge(df, aJoinDF, on=aKey, how='left')

And you can then write the new info to your main collection via the update method below. (logical collection vs physical datasources).


On smaller lookups, just denormalize. For example, you have code in the document and you just add the field code text and do a dict lookup as you create documents.

Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. This works for me for 3 million records with numbers/big text/categories/codes/floats/...

You can also use the two methods built into MongoDB (MapReduce and aggregate framework). See here for more info about the aggregate framework, as it seems to be easier than MapReduce and looks handy for quick aggregate work. Notice I didn't need to define my fields or relations, and I can add items to a document. At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :)

  this answer
edited Aug 14 '14 at 13:50 ali_m 30.6k 6 74 127 answered Jan 11 '13 at 22:11 brian_the_bungler 493 5 10      Hi, I'm playing around with your example as well and I run into this error when trying to insert into a database: In [96]: test.insert((a[1].to_dict() for a in df.iterrows())) --------------- InvalidDocument: Cannot encode object: 0. Any ideas what might be wrong? My dataframe consists of all int64 dtypes and is very simple. –  Zelazny7 Jan 16 '13 at 2:53 2   Yeah i did the same for a simple range DF and the int64 from numpy seems to bother pymongo. All the data I have played with converts from CSV (vs artificially via range()) and has types long and hence no issues. In numpy you can convert but I do see that as detracting. I must admit the 10.1 items for HDF look exciting. –  brian_the_bungler Jan 17 '13 at 22:01


I know this is an old thread but I think the Blaze library is worth checking out. It's built for these types of situations.

From the docs:

Blaze extends the usability of NumPy and Pandas to distributed and out-of-core computing. Blaze provides an interface similar to that of the NumPy ND-Array or Pandas DataFrame but maps these familiar interfaces onto a variety of other computational engines like Postgres or Spark.

Edit: By the way, it's supported by ContinuumIO and Travis Oliphant, author of NumPy.

  this answer
answered Dec 3 '14 at 22:09 chishaku 2,035 1 10 24      Another library that might be worth looking at is GraphLab Create: It has an efficient DataFrame-like structure that is not limited by memory capacity.… –  waterproof Apr 22 '15 at 16:17


If your datasets are between 1 and 20GB, you should get a workstation with 48GB of RAM. Then Pandas can hold the entire dataset in RAM. I know its not the answer you're looking for here, but doing scientific computing on a notebook with 4GB of RAM isn't reasonable.

  this answer
answered Nov 2 '13 at 7:14 rjurney 1,930 2 20 40 2   "doing scientific computing on a notebook with 4GB of RAM isn't reasonable" Define reasonable. I think UNIVAC would take a different view.… –  grisaitis Aug 26 '15 at 14:42 11   Turns out performance is relative for time and place ;) –  rjurney Sep 10 '15 at 19:12


I spotted this a little late, but I work with a similar problem (mortgage prepayment models). My solution has been to skip the pandas HDFStore layer and use straight pytables. I save each column as an individual HDF5 array in my final file.

My basic workflow is to first get a CSV file from the database. I gzip it, so it's not as huge. Then I convert that to a row-oriented HDF5 file, by iterating over it in python, converting each row to a real data type, and writing it to a HDF5 file. That takes some tens of minutes, but it doesn't use any memory, since it's only operating row-by-row. Then I "transpose" the row-oriented HDF5 file into a column-oriented HDF5 file.

The table transpose looks like:

def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
    # Get a reference to the input data.
    tb = h_in.getNode(table_path)
    # Create the output group to hold the columns.
    grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
    for col_name in tb.colnames:
        logger.debug("Processing %s", col_name)
        # Get the data.
        col_data = tb.col(col_name)
        # Create the output array.
        arr = h_out.createCArray(grp,
        # Store the data.
        arr[:] = col_data

Reading it back in then looks like:

def read_hdf5(hdf5_path, group_path="/data", columns=None):
    """Read a transposed data set from a HDF5 file."""
    if isinstance(hdf5_path, tables.file.File):
        hf = hdf5_path
        hf = tables.openFile(hdf5_path)

    grp = hf.getNode(group_path)
    if columns is None:
        data = [(, child[:]) for child in grp]
        data = [(, child[:]) for child in grp if in columns]

    # Convert any float32 columns to float64 for processing.
    for i in range(len(data)):
        name, vec = data[i]
        if vec.dtype == np.float32:
            data[i] = (name, vec.astype(np.float64))

    if not isinstance(hdf5_path, tables.file.File):
    return pd.DataFrame.from_items(data)

Now, I generally run this on a machine with a ton of memory, so I may not be careful enough with my memory usage. For example, by default the load operation reads the whole data set.

This generally works for me, but it's a bit clunky, and I can't use the fancy pytables magic.

Edit: The real advantage of this approach, over the array-of-records pytables default, is that I can then load the data into R using h5r, which can't handle tables. Or, at least, I've been unable to get it to load heterogeneous tables.

  this answer
edited Mar 22 '13 at 15:38 answered Mar 21 '13 at 21:19 Johann Hibschman 1,112 9 14      would you mind sharing with me some of your code? I am interested in how you load the data from some flat text format without knowing the data types before pushing to pytables. Also, it looks like you only work with data of one type. Is that correct? –  Zelazny7 Mar 27 '13 at 2:01 1   First of all, I assume I know the types of the columns before loading, rather than trying to guess from the data. I save a JSON "data spec" file with the column names and types and use that when processing the data. (The file is usually some awful BCP output without any labels.) The data types I use are strings, floats, integers, or monthly dates. I turn the strings into ints by saving an enumeration table and convert the dates into ints (months past 2000), so I'm just left with ints and floats in my data, plus the enumeration. I save the floats as float64 now, but I experimented with float32. –  Johann Hibschman Mar 28 '13 at 16:34 1   if you have time, pls give this a try for external compat with R:…, and if you have difficulty, maybe we can tweak it –  Jeff Apr 1 '13 at 16:50      I'll try, if I can. rhdf5 is a pain, since it's a bioconductor package, rather than just being on CRAN like h5r. I'm at the mercy of our technical architecture team, and there was some issue with rhdf5 last time I asked for it. In any case, it just seems a mistake to go row-oriented rather than column-oriented with an OLAP store, but now I'm rambling. –  Johann Hibschman Apr 1 '13 at 20:09


There is now, two years after the question, an 'out-of-core' pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it.

  this answer
answered Mar 23 '16 at 20:30 Private 805 9 19 1   and for a fully worked out example with dask, just have a look here… –  Noobie Feb 9 at 15:58 1   This answer would strongly benefit from a code example. –  Ninjakannon Feb 21 at 17:29


One more variation

Many of the operations done in pandas can also be done as a db query (sql, mongo)

Using a RDBMS or mongodb allows you to perform some of the aggregations in the DB Query (which is optimized for large data, and uses cache and indexes efficiently)

Later, you can perform post processing using pandas.

The advantage of this method is that you gain the DB optimizations for working with large data, while still defining the logic in a high level declarative syntax - and not having to deal with the details of deciding what to do in memory and what to do out of core.

And although the query language and pandas are different, it's usually not complicated to translate part of the logic from one to another.

  this answer
answered Apr 28 '15 at 5:22 Ophir Yoktan 2,510 20 50


Consider Ruffus if you go the simple path of creating a data pipeline which is broken down into multiple smaller files.

  this answer
answered Oct 9 '14 at 19:07 Golf Monkey 65 1 5


I recently came across a similar issue. I found simply reading the data in chunks and appending it as I write it in chunks to the same csv works well. My problem was adding a date column based on information in another table, using the value of certain columns as follows. This may help those confused by dask and hdf5 but more familiar with pandas like myself.

def addDateColumn():
"""Adds time to the daily rainfall data. Reads the csv as chunks of 100k 
   rows at a time and outputs them, appending as needed, to a single csv. 
   Uses the column of the raster names to get the date.
    df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True, 
                     chunksize=100000) #read csv file as 100k chunks

    '''Do some stuff'''

    count = 1 #for indexing item in time list 
    for chunk in df: #for each 100k rows
        newtime = [] #empty list to append repeating times for different rows
        toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time
        while count <= toiterate.max():
            for i in toiterate: 
                if i ==count:
        print "Finished", str(chunknum), "chunks"
        chunk["time"] = newtime #create new column in dataframe based on time
        outname = "CHIRPS_tanz_time2.csv"
        #append each output to same csv, using no header
        chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)

  this answer
answered Oct 4 '16 at 15:32 timpjohns 154 1 8


protected by chridam Dec 24 '15 at 12:01

Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?

Not the answer you're looking for? Browse other questions tagged python mongodb pandas large-data hdf5 or ask your own question.










您的注册邮箱: 修改

重新发送激活邮件 进入我的邮箱