2013-11-20 23:31:39 8 Comments
Having spent a decent amount of time watching both the r and pandas tags on SO, the impression that I get is that pandas
questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.
How can we create good reproducible examples for pandas
questions? Simple dataframes can be put together, e.g.:
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
But many example datasets need more complicated structure, e.g.:
datetime
indices or data- Multiple categorical variables (is there an equivalent to R's
expand.grid()
function, which produces all possible combinations of some given variables?) - MultiIndex or Panel data
For datasets that are hard to mock up using a few lines of code, is there an equivalent to R's dput()
that allows you to generate copy-pasteable code to regenerate your datastructure?
Related Questions
Sponsored Content
41 Answered Questions
[SOLVED] How do I sort a dictionary by value?
- 2009-03-05 00:49:05
- Gern Blanston
- 2065105 View
- 3267 Score
- 41 Answer
- Tags: python sorting dictionary
39 Answered Questions
[SOLVED] How to make a flat list out of list of lists?
- 2009-06-04 20:30:05
- Emma
- 1404428 View
- 2436 Score
- 39 Answer
- Tags: python list multidimensional-array flatten
17 Answered Questions
29 Answered Questions
25 Answered Questions
36 Answered Questions
40 Answered Questions
[SOLVED] How do I check whether a file exists without exceptions?
- 2008-09-17 12:55:00
- spence91
- 3379802 View
- 4827 Score
- 40 Answer
- Tags: python file file-exists
15 Answered Questions
44 Answered Questions
[SOLVED] How to merge two dictionaries in a single expression?
- 2008-09-02 07:44:30
- Carl Meyer
- 1404600 View
- 3705 Score
- 44 Answer
- Tags: python dictionary merge
16 Answered Questions
[SOLVED] How to make a chain of function decorators?
- 2009-04-11 07:05:31
- Imran
- 460073 View
- 2509 Score
- 16 Answer
- Tags: python decorator python-decorators
5 comments
@sds 2016-12-16 17:57:32
Here is my version of
dput
- the standard R tool to produce reproducible reports - for PandasDataFrame
s. It will probably fail for more complex frames, but it seems to do the job in simple cases:now,
Note that this produces a much more verbose output than
DataFrame.to_dict
, e.g.,vs
for
du
above, but it preserves column types. E.g., in the above test case,because
du.dtypes
isuint8
andpd.DataFrame(du.to_dict()).dtypes
isint64
.@Paul H 2017-02-23 16:03:50
I didn't downvote, but what is
dput
, why is it needed, and how does it answer this question? You state that it "much more verbose output than DataFrame.to_dict" but don't show us what that output is.@sds 2017-02-23 16:11:25
@PaulH: I wrongly assumed that everyone who uses pandas knows R :-) Sorry. I edited the answer, is it clearer now?
@Paul H 2017-02-23 16:55:01
it is clearer, though i admit i don't see why i would want to use it over
to_dict
@sds 2017-02-23 16:57:25
Because it preserves column types. More specifically,
du.equals(eval(dput(df)))
.@piRSquared 2016-07-19 18:35:13
Diary of an Answerer
My best advice for asking questions would be to play on the psychology of the people who answer questions. Being one of those people, I can give insight into why I answer certain questions and why I don't answer others.
Motivations
I'm motivated to answer questions for several reasons
All my purest intentions are great and all, but I get that satisfaction if I answer 1 question or 30. What drives my choices for which questions to answer has a huge component of point maximization.
I'll also spend time on interesting problems but that is few and far between and doesn't help an asker who needs a solution to a non-interesting question. Your best bet to get me to answer a question is to serve that question up on a platter ripe for me to answer it with as little effort as possible. If I'm looking at two questions and one has code I can copy paste to create all the variables I need... I'm taking that one! I'll come back to the other one if I have time, maybe.
Main Advice
Make it easy for the people answering questions.
Your reputation is more than just your reputation.
I like points (I mentioned that above). But those points aren't really really my reputation. My real reputation is an amalgamation of what others on the site think of me. I strive to be fair and honest and I hope others can see that. What that means for an asker is, we remember the behaviors of askers. If you don't select answers and upvote good answers, I remember. If you behave in ways I don't like or in ways I do like, I remember. This also plays into which questions I'll answer.
Anyway, I can probably go on, but I'll spare all of you who actually read this.
@Andy Hayden 2013-11-23 06:19:13
Note: The ideas here are pretty generic for StackOverflow, indeed questions.
Disclaimer: Writing a good question is HARD.
The Good:
do include small* example DataFrame, either as runnable code:
or make it "copy and pasteable" using
pd.read_clipboard(sep='\s\s+')
, you can format the text for StackOverflow highlight and use Ctrl+K (or prepend four spaces to each line):test
pd.read_clipboard(sep='\s\s+')
yourself.* I really do mean small, the vast majority of example DataFrames could be fewer than 6 rowscitation needed, and I bet I can do it in 5 rows. Can you reproduce the error with
df = df.head()
, if not fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.* Every rule has an exception, the obvious one is for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate (consider using np.random.seed so we have the exact same frame):
df = pd.DataFrame(np.random.randn(100000000, 10))
. Saying that, "make this code fast for me" is not strictly on topic for the site...write out the outcome you desire (similarly to above)
Explain what the numbers come from: the 5 is sum of the B column for the rows where A is 1.
do show the code you've tried:
But say what's incorrect: the A column is in the index rather than a column.
do show you've done some research (search the docs, search StackOverflow), give a summary:
Aside: the answer here is to use
df.groupby('A', as_index=False).sum()
.if it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply
pd.to_datetime
to them for good measure**.** Sometimes this is the issue itself: they were strings.
The Bad:
don't include a MultiIndex, which we can't copy and paste (see above), this is kind of a grievance with pandas default display but nonetheless annoying:
The correct way is to include an ordinary DataFrame with a
set_index
call:do provide insight to what it is when giving the outcome you want:
Be specific about how you got the numbers (what are they)... double check they're correct.
If your code throws an error, do include the entire stacktrace (this can be edited out later if it's too noisy). Show the line number (and the corresponding line of your code which it's raising against).
The Ugly:
don't link to a csv we don't have access to (ideally don't link to an external source at all...)
Most data is proprietary we get that: Make up similar data and see if you can reproduce the problem (something small).
don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably noone is even going to read to the end of this paragraph.
Essays are bad, it's easier with small examples.
don't include 10+ (100+??) lines of data munging before getting to your actual question.
Please, we see enough of this in our day jobs. We want to help, but not like this....
Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.
Anyways, have fun learning python, numpy and pandas!
@zelusp 2016-04-13 17:32:12
+1 for the
pd.read_clipboard(sep='\s\s+')
tip. When I post SO questions that need a special but easily shared dataframe, like this one I build it in excel, copy it to my clipboard, then instruct SOers to do the same. Saves so much time!@user5359531 2016-12-09 17:50:15
the
pd.read_clipboard(sep='\s\s+')
suggestion does not seem to work if you're using Python on a remote server, which is where a lot of large data sets live.@MarianD 2018-12-26 22:32:33
Why
pd.read_clipboard(sep='\s\s+')
, and not a simplerpd.read_clipboard()
(with the default‘s+’
)? The first need at least 2 whitespace characters, which may cause problems if there is only 1 (e. g. see such in the @JohnE 's answer).@Andy Hayden 2018-12-27 20:45:57
@MarianD the reason that \s\s+ is so popular is that there is often one e.g. in a column name, but multiple is rarer, and pandas output nicely puts in at least two between columns. Since this is just for toy/small datasets it's pretty powerful/majority of cases. Note: tabs separated would be a different story, though stackoverflow replaces tabs with spaces, but if you have a tsv then just use \t.
@JohnE 2015-05-24 14:22:30
How to create sample datasets
This is to mainly to expand on @AndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) numpy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.
After importing numpy and pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.
A kitchen sink example
Here's an example showing a variety of things you can do. All kinds of useful sample dataframes could be created from a subset of this:
This produces:
Some notes:
np.repeat
andnp.tile
(columnsd
ande
) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r'sexpand.grid()
but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.expand.grid()
see theitertools
solution in the pandas cookbook or thenp.meshgrid
solution shown here. Those will allow any number of dimensions.np.random.choice
. For example, in columng
, we have a random selection of 6 dates from 2011. Additionally, by settingreplace=False
we can assure these dates are unique -- very handy if we want to use this as an index with unique values.Fake stock market data
In addition to taking subsets of the above code, you can further combine the techniques to do just about anything. For example, here's a short example that combines
np.tile
anddate_range
to create sample ticker data for 4 stocks covering the same dates:Now we have a sample dataset with 100 lines (25 dates per ticker), but we have only used 4 lines to do it, making it easy for everyone else to reproduce without copying and pasting 100 lines of code. You can then display subsets of the data if it helps to explain your question:
@Marius 2015-05-24 23:29:16
Great answer. After writing this question I actually did write a very short, simple implementation of
expand.grid()
that's included in the pandas cookbook, you could include that in your answer as well. Your answer shows how to create more complex datasets than myexpand_grid()
function could handle, which is great.@Alexander 2015-09-12 07:06:37
The Challenge One of the most challenging aspects of responding to SO questions is the time it takes to recreate the problem (including the data). Questions which don't have a clear way to reproduce the data are less likely to be answered. Given that you are taking the time to write a question and you have an issue that you'd like help with, you can easily help yourself by providing data that others can then use to help solve your problem.
The instructions provided by @Andy for writing good Pandas questions are an excellent place to start. For more information, refer to how to ask and how to create Minimal, Complete, and Verifiable examples.
Please clearly state your question upfront. After taking the time to write your question and any sample code, try to read it and provide an 'Executive Summary' for your reader which summarizes the problem and clearly states the question.
Original question:
Depending on the amount of data, sample code and error stacks provided, the reader needs to go a long way before understanding what the problem is. Try restating your question so that the question itself is on top, and then provide the necessary details.
Revised Question:
PROVIDE SAMPLE DATA IF NEEDED!!!
Sometimes just the head or tail of the DataFrame is all that is needed. You can also use the methods proposed by @JohnE to create larger datasets that can be reproduced by others. Using his example to generate a 100 row DataFrame of stock prices:
If this was your actual data, you may just want to include the head and/or tail of the dataframe as follows (be sure to anonymize any sensitive data):
You may also want to provide a description of the DataFrame (using only the relevant columns). This makes it easier for others to check the data types of each column and identify other common errors (e.g. dates as string vs. datetime64 vs. object):
NOTE: If your DataFrame has a MultiIndex:
If your DataFrame has a multiindex, you must first reset before calling
to_dict
. You then need to recreate the index usingset_index
: