But pandas’ magical simplicity makes things like computed columns immediately intuitive:
  > data['% of total'] = data.amount / data.amount.sum()

Is that immediately intuitive? I'm staring at this trying to understand what it's doing. Is the / operator overloaded? data.amount is one particular amount, and data.amount.sum() is the sum of all amounts? Why does the "computed column" property goes on the same data object as the actual data? Maybe it's immediately intuitive if you've used pandas.

OTOH I think it's immediately intuitive if you are not a programmer. :)

When you see amount / sum, you think of how a list can be divided by what appears to a scalar.

When they see it, they parse it out for what they naturally understand a percentage to mean. And all is well.

david_eads

Exactly this. I'm the author of the post and was a programmer by trade for a long time before I became a journalist. I _don't_ actually find this more intuitive than more explicit and fundamental programming techniques. But my students grokked it immediately, whereas even simple structures like loops seem to be harder to get for them to get their heads around.

Given I had ten weeks to cram a lot of material in but did want to show them some amount of programming, this worked pretty nicely.

gravypod

I've been very troubled by coming to this stuff as a programmer. I'm having the same instant dis-satisfactory response that your students are having with looping structures.

I've recently started working on some projects where I need to do a lot of data visualization, story telling, and investigation "into the data". As a programmer getting into this stuff is far worse then I expected. Nothing works as I would think would make sense. My biggest problem is that I'm thinking like a programmer not like a mathematician. I expect objects, segregation or elimination of state, application and reduction, re-usability, and algorithms.

Are there any good frameworks that allow for processing, caching, data visualization (layout -> data population -> rendering), then exporting to some format (PNG/PDF/TeX)?

What follows, below this line, is my groveling about the things that have bothered me. Be warned if you don't like rambling and complaining. -------

Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators"). Matplotlib is unintuitive and poorly documented for anyone who isn't a mathematician (.plot(lons, lats, latlons=True) is correct). Dealing with anything more then 100,000 data points is a pain to revision on. State everywhere it shouldn't be (matplotlib.pyplot).

While I've been working on this project I probably (each spin) spend an hour or two getting the data out of a format that doesn't make sense from a programmers perspective, I spend another 5 to 10 minutes writing an application/reduction, then I spend another hour to go back into the strange data formats that matplotlib will take. All the while re-running expensive computations and waiting because I have no good persistence layer for my project.

There are just things in this community that are common that I'd never dream of. What follows is a list of these things.

1. Functions with 20-40 arguments are the norm for some reason. They also love to throw in a few insane defaults, undocumented options, and even magical flags (not enums).

Things like "draw a line, connect the dots" makes it so you need to know what 5 to 7 arguments of a massive function. In C/Java when I need some flags they probably look like this:

    some_operation(some_data, DO_A | DO_C | DO_Z)

Or, if someone was feeling really nice and defined an enum & used varargs, it looks more like this:

    some_operation(some_data, SomeOperationFeatures.DO_A, SomeOperationFeatures.DO_C, SomeOperationFeatures.DO_Z)

Where all of these have appropriate documentation. My IDE place nice and can complete these things. My compiler likes it and can typecheck these things. I like it because I know all of my options available (SomeOperationFeatures.).

With matplotlib you have things like `linestyle=""`. You have to go to a webpage, look through the docs, and figure out what you want. It's worth reading the docs [1] if you never have. This could have very easily have been LineStyle.DOTTED, LineStyle.DASHED, LineStyle.BLANK. IDEs would have played nice. The 3.6 runtime's typechecking would have played nice. You would be able to see what your options are (LineStyle.).

2. Non-standard ways of treating python-isms

Pandas, for some reason, cannot stick to python-isms. I can't do simple things like...

    if not df: # Check if DF is empty
        return ...

    for row in df: # Iterate through the rows of a DF
        row.date = datetime(row.year, row.month, row.day, ...) # Create a new column in the row based on the row's data.

    subset = [a for a in df if some_condition(a)] # Do simple filtering

Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.

3. All these libraries separate logically grouped concepts.

Lets say I have time series data from 10 sensors.

    class SomeMagicalSample:
        def __init__(self, a, b, c, d. ..., occurred)
            self.a = a
            ...
            self.occurred = occurred

With this code I can generate very complex filtering, combinations, and what not. Things like extracting "real" meaning from measured values becomes easy to express.

    def get_magical_scalar(self): return ... some interpolation ...

    def is_some_magical_type(self): return ... some check ...

Now I can use my already tried and true reduction and application.

    sum(map(SomeMagicalSample.get_magical_scalar,
            filter(SomeMagicalSample.is_some_magical_type, samples)))

Pandas, matplotlib, numpy, scipy and the lot are designed to make me avoid this style of organization. I'm instead forced to do something like this.

    a = [...]
    b = [...]
    c = [...]
    d = [...]
    ....
    occurred = [...]

Then I have to jump through hoops to keep all of this data in the same order, shift it around together.

4. Because everything is meaningless lists of numbers there are no ways to reuse code.

Most of the code I have written to show off a single value over time, or pull some data out of some other data and visualize it, is never going to be used again. Unless I want to look at this exact same thing this code will not be useful. If there was some way pass objects around, hide the internals, and process them independently of their meaning then this would not be the case.

The one case where this was not true in the past few days was when I rendered a model's prediction into a pcolormesh and drew it onto a basemap. By passing it a basemap it will automatically find the place to generate data for with the model. This was an undocumented feature that I had to read the source of basemap to find was possible (pulling the top left and bottom right Lat Lons from a basemap regardless of projection).

Maybe these warts just hurt for a little while? Do these go away? Are there alternatives that can handle >10 million data points? I don't have a good analysis framework setup for the work I'm doing. Maybe this is the issue. I don't even know what a good analysis framework would look like.

[1] - https://matplotlib.org/api/lines_api.html#matplotlib.lines.L...

stdbrouw

Many of the things you list are indeed annoyances when doing data analysis in Python and they make things harder than they should be, but others are typical grievances I see from people new to it, and these do actually go away once you've been working with e.g. Pandas for longer.

> Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators").

What makes Pandas so great is that you can apply arbitrary functions to rows and columns, with the full expressivity of Python. In some cases it might be clunkier (though you should almost never need `.loc` and other indexing methods) but mostly it's just `df.groupby(...).apply(...)` or vectorized methods like `df.column + df.other_column`. This is a huge improvement over having half of your analysis in database queries and half in a programming language.

> Matplotlib is unintuitive and poorly documented

Try https://seaborn.pydata.org/ for statistical graphics.

> Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.

This sucks but is unavoidable, because Python does not have fast data types with support for missing values built in, so all your columns would have to be of mixed type (the actual type + None) and everything would slow down and simple things like computing the mean of a column with missing values would not work.

Note that you don't actually "need to go back and forth" because Pandas will happily convert plain Python objects to their Numpy equivalents for you.

> 3. All these libraries separate logically grouped concepts.

It's not functional, you're just going to have to deal with that. But split-apply-combine and similar patterns are quite elegant in Pandas: http://pandas.pydata.org/pandas-docs/stable/groupby.html

> 4. Because everything is meaningless lists of numbers there are no ways to reuse code.

A lot of data analysis is throw-away code. Some of it can be abstracted into reusable code, some of it can't.

Lastly, don't forget that Python does have a lot of things going for it when it comes to data analysis, from geospatial tools (http://toblerity.org/shapely/) to Bayesian modeling (http://pymc-devs.github.io/pymc3/index.html), as well as interactive coding with Jupyter and Hydrogen for the Atom editor (https://github.com/nteract/hydrogen).