With org-mode you can keep data, code, and documentation in one file.
Suppose you have an org-mode file containing the following table.
#+NAME: mydata | Drug | Patients | |------+----------| | X | 232 | | Y | 351 | | Z | 117 |
Note that there cannot be a blank line between the NAME header and the beginning of the table.
You can bring this table into Python simply by declaring it to be a variable in the header of a Python code block.
#+begin_src python :var tbl=mydata :results output print(tbl) #+end_src
When you evaluate this block, you see that the table is imported as a list of lists.
[['X', 232], ['Y', 351], ['Z', 117]]
Note that the column headings were not imported into Python. Now suppose you would like to retain the headers, and use them as column names in a pandas data frame.
#+begin_src python :var tbl=mydata :colnames no :results output import pandas as pd df = pd.DataFrame(tbl[1:], columns=tbl[0]) print(df, "\n") print(df["Patients"].mean()) #+end_src
When evaluated, this block produces the following.
Drug Patients 0 X 232 1 Y 351 2 Z 117 233.33333333333334
Note that in order to import the column names, we told org-mode that there are no column names! We did this with the header option
:colnames no
This seems backward, but it makes sense. It says do bring in the first row of the table, even though it appears to be a column header that isn’t imported by default. But then we tell pandas that we want to make a data frame out of all but the first row (i.e. tbl[1:]
) and we want to use the first row (i.e. tbl[0])
as the column names.
A possible disadvantage to keeping data and code together is that the data could be large. But since org files are naturally in outline mode, you could collapse the part of the outline containing the data so that you don’t have to look at it unless you need to.
I’ve tried this approach a few times, and my experience is that it doesn’t scale well.
Org-mode’s performance is getting better, but there’s still a several second lag when hiding or displaying a large table, with 20,000 x 100 cells or so. As datasets go this is tiny, of course. In fact it slows down all editing operations in the buffer.
Feeding the data to the underlying Python process ends up being memory intensive, because it can’t be read in and processed one line at a time with a python reader object. I think the memory use is at least 2x the size of the table because the buffer is already loaded into memory in Emacs.
Finally it has the same problem as Jupyter notebooks — it’s never clear if code blocks that depend on other blocks have been rerun when the state of the buffer has changed. I don’t know if Org-mode provides hooks to automatically rerun dependent blocks.
That said, having code, data and documentation (including equations) in the same buffer is a really neat feature for creating pedagogical documents or webpages .
The data sets I run across are usually small (e.g. on the order of 100 or 1000 rows) or enormous (say 100,000,000 rows). I wouldn’t consider using org-mode for the latter, but I may use it on the former next time the need arises.