Integrated Workflows¶
There are three
primary interfaces to running python
within stata
:
We will then look at how to transfer data between python
and stata
in both directions through the stata function interface
.
Running python
Interactively with a First Example¶
You can run python
interactively within stata
in a manner that is the
equivalent of running the python
REPL program through a terminal.
This is activated by typing python
in the command window.
You are now interfacing directly with the python
interpreter as indicated in
the Result
window.
You can now write python code
such as:
print("Hello World!")
once you hit enter stata
sends the code snippet to the python interpreter
for processing and shows the result
To stop interfacing with the python
interpreter you need to type end
in
the command window
this will return you to the standard stata
interface.
Tip
If you have a one line python
command you can use
python: print("Hello World!")
which will pass the code to python
, display the results
directly below in the Results
window, and return you to
the stata
command environment.
Running python
in a do
file¶
Another option for running python
code is through the do
file.
Let’s open the do
file editor and add:
di "Stata Here"
python: print("Python Here")
and when you click on the Do
button you get the result:
where the results from python
are displayed similarly to stata
output.
However, most of the time you will want to add in a block of code such as:
for i in range(0,2):
print("Python Here")
This can be done by delimiting the python code
within the do
file using either
python
<python code>
end
or
python:
<python code>
end
The difference between these two delimiters
is in how stata
handles any
errors in python
.
The python
delimiter will continue to execute the rest
of the python
code if an error is encountered, while the python:
delimiter will immediately
return control to stata
once the error is encountered.
di "Stata Here"
python
for i in rang(2):
print("Python Here")
print("Python Done")
end
di "Back in Stata Land!"
As you can see stata
has continued to execute code past the point at which there is
an error.
However if you use python:
the execution will halt at the point of the error
.
di "Stata Here"
python:
for i in rang(2):
print("Python Here")
print("Python Done")
end
di "Back in Stata Land!"
Tip
I tend to use python:
as I prefer to get to the error quickly to fix the problem
without any distracting output below it. Also in a long running program you will want
to fix the issue prior to the rest of the program executing.
We can use the error message to fix the issue now and run the fixed do
file
di "Stata Here"
python:
for i in range(2):
print("Python Here")
print("Python Done")
end
di "Back in Stata Land!"
The Do File Editor and White Space¶
Reminder
Whitespace is used by python
to declare scopes
and is an integral part
of the language definition
The do
file editor doesn’t provide you with full text editor
support when writing
python
code in the do
file editor.
For example if you type:
python:
for i in range(10):
|<curser placed here>
the editor
will not automatically indent your code.
However once you have set the curser to the correct indentation level it will retain that indentation level for subsequent lines.
python:
for i in range(10):
|
|<curser placed here>
So you need to be careful with whitespace
Also what you type in the delimiters
is directly passed to python
so you can’t indent these code-blocks
such as:
di "Stata Here"
python:
print("Python Here")
python
will return the following error:
Running python
scripts in stata
¶
A third option is to run a python script
that contains some python code
If you save the following code in a file example3.py
:
print("Python Here")
for i in range(2):
print(f"{i} times hello")
print("I'm outta here")
you can then run this script in stata
using:
python script example.py
with the output:
Tip
This can be a very useful way to run python
code as it leaves you
to write python
code in any text editor you like such as
vscode.
Interacting between Stata
and Python
¶
Tip
In many cases it can be simpler to keep python
and stata
workflows independent of each other and use files
to transfer
data between them.
This is covered in File based Workflows
So far the python
and stata
runtime environments have been
independent of each other to learn about how to run python
code
within stata
(i.e. they haven’t shared any data)
For many applications we want some level of interaction
between stata
and python
by copying back and forth objects between the different runtime
environments.
Stata
makes various components of its internals available to python
via
the stata function interface (sfi)
to enable such interaction with:
Dataset which connects
python
with the current in memorystata
datasetMacros which connects
python
withstata
macros
In addition it also provides access to many other stata
components.
Copying Data from Stata to Python¶
Stata Blog Post
This section is heavily inspired by this excellent stata blog post
sysuse auto
list foreign
Listing the foreign
data in stata
shows
We can then use sfi.Data
to transfer the raw
data to python
using the .get
method
of the Data
object from the stata function interface
package.
python
from sfi import Data
dataraw = Data.get('foreign')
dataraw
end
and it looks like
Notice that the data
looks different.
Note
stata
has a concept of labels
If you use the data explorer
you will see that the foreign
variable consists of
0,1
that are associated with labels domestic
and foreign
(respectively).
We may want to get more information about the get
method so the best place
to look is the documentation on sfi.Data.
Then you can click on the get method
Tip
You can’t use the ipython
features such as Data.get?
in this context because
python
is interfacing directly with the python
interpreter and not the
ipython
interpreter (such as when you’re using jupyter
)
That page looks like:
You can see that an option is to fetch the value label
using valuelabel=True
python
from sfi import Data
dataraw = Data.get('foreign', valuelabel=True)
dataraw
end
and the raw data
is now returned as strings taking the value of the labels
that
have been applied to the data
Obtaining more variables at once¶
You can obtain more variables using the get
method. Based on the documentation you can use
the following methods to specify what variables to fetch:
var (int, str, or list-like, optional) – Variables to access.
It can be specified as a single variable index or name, or an
iterable of variable indices or names. If var is not specified,
all the variables are specified.
In addition you can also specify which observations (obs
) you would like:
obs (int or list-like, optional) – Observations to access.
It can be specified as a single observation index or an iterable
of observation indices. If obs is not specified, all the
observations are specified.
So let’s use this information and run
python
from sfi import Data
dataraw = Data.get('foreign mpg rep78', range(45,56))
dataraw
end
this code saves a list of list
type object into the python
object dataraw
The data is written as a list of rows
/obs
in the order that the variables are requested,
which in this case is: foreign mpg rep78
such as the first element:
[[0, 18, 2], ...
The range(45,56)
request will fetch observations 46
to 56
as shown in the data
browser
As per the documentation
you can also specify a list-like
object instead of a string separated
by a space such as ['foreign', 'mpg', 'rep78']
:
python
from sfi import Data
dataraw = Data.get(['foreign', 'mpg', 'rep78'], range(45,56))
dataraw
end
which will return the same data
What happens now if you specify valuelabel=True
for the above python
code?
pd.DataFrame and pd.Series:¶
The discussion so far has focused on fetching raw data
out of stata
and copying
it to the python
environment. But in many applications we are likely to want higher
productivity objects such as pandas DataFrame
and Series
.
Let’s try
python
from sfi import Data
import pandas as pd
dataraw = Data.get('foreign mpg rep78', range(45,56))
df = pd.DataFrame(dataraw)
df
end
You will notice that the raw data
has now been placed in a pd.DataFrame
but columns
and index
variables haven’t come across:
You may want to parameterize your requests so you can use them in both
the sfi.Data.get
method in addition to a pd.DataFrame
method when
converting the raw data
into a pd.DataFrame
You can save the variable selection as a python variable:
vars = ['foreign', 'mpg', 'rep78']
then you can use these variables for both stata
and python
python
from sfi import Data
import pandas as pd
vars = ['foreign', 'mpg', 'rep78']
dataraw = Data.get(vars, range(45,56), valuelabel=True)
df = pd.DataFrame(dataraw, index=range(46,57), columns=vars)
df
end
which provides a much more consistent pd.DataFrame
and lines up closely with
the stata context.
You can compare with stata using in the command window
list foreign mpg rep78 in 46/56
How can you explain the value for the variable rep78
for observation 51
?
Note
There is also a method available sfi.Data.getAsDict()
that includes the
variable names in a returned dictionary so you can use:
python
from sfi import Data
import pandas as pd
vars = ['foreign', 'mpg', 'rep78']
dataraw = Data.getAsDict(vars, range(45,56), valuelabel=True)
df = pd.DataFrame(dataraw)
df
end
Missing Values:¶
Missing values in stata
are internally represented by the largest
value for
each type.
Within stata
you typically work with missing values using .
such as:
list rep78 if rep78 != .
and much of this detail is taken care of for you.
AS missing values are represented by the maximum value
:
python
will interpret this data as an actual value.
You will want to specify missingval=np.nan
python
from sfi import Data
import numpy as np
import pandas as pd
vars = ['foreign', 'mpg', 'rep78']
dataraw = Data.get(vars, range(45,56), valuelabel=True, missingval=np.nan)
df = pd.DataFrame(dataraw, index=range(46,57), columns=vars)
df
end
which returns the following
Copying Data from Python to Stata¶
Stata Blog Post
This section is heavily inspired by this excellent stata blog post
It is often the case you will want to do some data work in python
and have a need to
transfer it to stata
to do some statistical anaylsis.
The sfi.Data
interface also contains methods for saving data from python
into
the default stata dataframe
(or a frame
which is new in Stata16
)
Let us fetch some data from Yahoo Finance using the yfinance
package in python
python:
import yfinance as yf
dowjones = yf.Ticker("^DJI")
data = dowjones.history(start="2010-01-01", end="2020-12-31")[['Close', 'Volume']]
data
end
the yfinance
package has returned the dowjones
history tables containing data
between 2010-01-01
and 2020-12-31
Now we need to migrate that data from python
into stata
python:
from sfi import Data
Data.setObsTotal(len(data))
end
the stata
data editor now contains space for len(data)
observations to
be transferred.
You can then setup 3
variables in stata to save date
, close
,
and volume
information across.
python:
Data.addVarStr("date", 10) # Str10
Data.addVarDouble("close") # Double
Data.addVarInt("volume") # Int
end
the stata
data editor now contains 3
variables
Warning
You should start this work with an empty stata
dataset. The sfi.Data
package can return some cryptic
errors. When trying to create a date
Str variable using the code above you will get the following error if the
variable already exists in the dataset.
>>> Data.addVarStr("date", 10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Applications/Stata/ado/base/py/sfi.py", line 487, in addVarStr
return _stp._st_addvarstr(name, length)
SystemError: failed to add a variable of type str to the current Stata dataset
r(7102);
Clearing can be done in stata using
clear
The next step is to migrate the actual data.
You might try saving the data
directly from the pandas
dataframe into the stata
dataset using the
sfi.Data.store() method.
Note
This method interface is expecting
static store(var, obs, val, selectvar=None)
where,
var
,obs
, andval
arepython arguments
, andselectvar=None
is apython keyword argument
with a default value ofNone
This means that var
, obs
, and val
are required inputs
This deviates from sfi.Data.get()
python
Data.store("date", None, data.index)
end
however you will run into trouble with the following error:
Stata
is similar to numpy
in that it is very specific about how it saves data
in memory in
accordance with specified types
.
In the code above we tried to send through a list of datetime
objects from
pandas
and the stata function interface
doesn’t know how to represent
this data in the stata dataset
.
python
data.index
data.index[0]
end
As you can see the index from the pandas dataframe data
consists of Timestamp
objects:
Therefore some translation is required in this case to convert dates
into a format
that stata
can copy into its dataset and then use stata
tools to convert to stata
dates.
We know stata
has a date
function that we can use:
clear
gen stringdates = ""
set obs 1
replace stringdates = "2010-01-04" in 1
gen date = date(stringdates, "YMD")
list
format %tdCCYY-NN-DD date
list
So now we can look to convert the pandas.Timestamp
objects to be represented as simpler string
based data that contain the information needed for stata
to convert those dates
.
Pandas has a useful method .astype()
for useful data conversions.
python
data.index = data.index.astype(str)
data.index[0]
end
this has used the in-built
type converter to represent the index
as strings
that is formatted as YYYY-MM-DD
Now lets try and save this information into the stata
dataset:
python
Data.store("date", None, data.index)
end
You can now open the data viewer
and see that the dates (as strings) has been copied
over to stata
:
Let’s bring in the numerical data, which is a much simpler process
python
Data.store("close", None, data.Close)
Data.store("volume", None, data.Volume)
end
We now have the data we need in the stata
dataset as seen in the data editor
Now that the data is copied across we can switch back to stata
to run
any analysis
or construct a plot
We will first want to convert those dates in stata
as a post transfer step
gen sdate = date(date, "YMD")
format %tdCCYY-NN-DD sdate
and we can check the conversion in the stata data editor
and then we can construct the plot
as demonstrated in the original blog post
replace volume = volume / 1000000
twoway (line close sdate, lcolor(green) lwidth(medium)) ///
(bar volume sdate, fcolor(blue) lcolor(blue) yaxis(2)), ///
title("Dow Jones Industrial Average (2010 - 2019)") ///
xtitle("") ytitle("") ytitle("", axis(2)) ///
xlabel(, labsize(small) angle(horizontal)) ///
ylabel(5000(5000)30000, ///
labsize(small) labcolor(green) ///
angle(horizontal) format(%9.0fc)) ///
ylabel(0(5)30, ///
labsize(small) labcolor(blue) ///
angle(horizontal) axis(2)) ///
legend(order(1 "Closing Price" 2 "Volume (millions)") ///
cols(1) position(10) ring(0))
which produces the following stata
chart
You may be interested in comparing this to a chart built with matplotlib
and pandas
in the python
environment.
You can download this notebook
,
or open this notebook in the cloud
which produces the following matplotlib
figure:
Persistence between python
code-blocks in stata
¶
Once the python
interpreter is initialised it is used throughout the stata
session.
This means that once variables are created in python
they will be
available in future python
code-blocks.
python:
import pandas as pd
df = pd.DataFrame(range(4), index=['a','b','c','d'])
df
end
then you can run some other things in stata
and then return to python
and fetch
the df
object
python:
df
end
such as in this short demonstration
The stata function interface sfi
¶
The python api documentation contains
the details about the sfi
package from stata
.
Class |
Description |
---|---|
Access |
|
Access to the current |
|
Access to |
|
Access to |
|
Access to |
|
An interface with global |
|
Access to |
|
Access to |
|
Access to |
|
Access to |
|
a set of |
|
Provide access to |
|
Access to |