Data Science - blood & tears

My experience while working in Data Science, so I don't have to relearn or reinvent any thing.

Early Stopping, but when?

According to this paper, slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).

No Free Lunch theorem as I understand it

If we have absolutely no knowledge or insights about the problem we intend to solve, which algorithm we choose doesn't matter.

Bayes error rate

In statistical classification, Bayes error rate is the lowest possible error rate for any classifier of a random outcome (into, for example, one of two categories) and is analogous to the irreducible error.

Common Dimension Reduction techniques

Use backslash escapes in f-strings

As you saw earlier, it is possible for you to use backslash escapes in the string portion of an f-string. However, you can’t use backslashes to escape in the expression part of an f-string:

>>> f"{\"Hello World\"}"
  File "<stdin>", line 1
    f"{\"Hello World\"}"
                      ^
SyntaxError: f-string expression part cannot include a backslash

Compare notebooks for version control

Use nbdime

nbdiff notebook_1.ipynb notebook_2.ipynb
nbdiff-web notebook_1.ipynb notebook_2.ipynb

Jupyter notebook tools

facets: Data visualization
papermill: Parameter settings
nbconvert: convert notebooks to other formats

Close multiple Chrome windows with PowerShell

Stop-Process -Name chrome

Tukey test

The Tukey Test (or Tukey procedure), also called Tukey’s Honest Significant Difference test, is a post-hoc test based on the studentized range distribution. An ANOVA test can tell you if your results are significant overall, but it won’t tell you exactly where those differences lie. After you have run an ANOVA and found significant results, then you can run Tukey’s HSD to find out which specific groups’s means (compared with each other) are different. The test compares all possible pairs of means.

SQL tips and tricks

Search for column name

SELECT c.name AS ColName, t.name AS TableName
FROM sys.columns c
    JOIN sys.tables t ON c.object_id = t.object_id
WHERE c.name LIKE '%MyCol%';

Declare a variable

DECLARE @my_var int = 10;

Calculate a column using a calculated column

Not possible, recalculate.

Read SQL script from file on Windows

Use UTF-16 encoding, not utf-8

query = open(filepath, 'r',  encoding="UTF-16")

Add GO at the end of the function.

Cannot save a sql file directly from a text file, must do that from MSSS (for now)

Time series - ARMA model

You should always examine the residuals because the model assumes the errors are Gaussian white noise. Test for white noise with astsa package in R

# Generate 100 observations from the AR(1) model
x <- arima.sim(model = list(order = c(1, 0, 0), ar = .9), n = 100) 

# Plot the generated data 
plot(x)

# Plot the sample P/ACF pair
plot(acf2(x))

# Fit an AR(1) to the data and examine the t-table
sarima(x, p = 1, d = 0, q = 0)

Bad residuals

Pattern in the residuals
ACF has large values
Q-Q plot suggests normality
Q-statistics - all points below line

Various kinds of transformation for categorical variables
Jupyter Notebook tips and tricks

Template to import libraries in Jupyter Notebook

Halfway solution: use the %load magic of IPython

Autoreload submodule in Jupyter Notebook

%load_ext autoreload
%autoreload 2

Use joblib for caching output

It's faster to cache repeated result with joblib

from sklearn.externals.joblib import Memory
memory = Memory(cachedir='/tmp', verbose=0)
@memory.cache
def computation(p1, p2):
    ...

Interactive table
- Use DataTable
- Use qgrid

import qgrid
# Adjust column width with grid_options
qgrid_widget = qgrid.show_grid(df, show_toolbar=True, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 200})
qgrid_widget

    * Sometimes `qgrid.show_grid` does not work. One solution is to close and open notebooks once for `qgrid` to work. Not sure why yet. Can't reproduce on an isolated environment.

Display all columns with pandas

Some related settings:

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Printing all the outputs of a cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

SVN on Windows tips and tricks

Add all unversioned files (This solution do not add ignored files)

svn --force add .

Delete files

Cannot delete manually, must use svn rm path/to/file

Remove manually deleted files

svn status | ? { $_ -match '^!\s+(.*)' } | % { svn rm $Matches[1] }

After ignoring files, must commit for it to work.

Python tips and tricks

Use pytest -s flag to print the statement to the console
Use MonkeyType to automatically add type hint to legacy code.
Suppress LightGBM Warning

For sklearn interface, set verbose=-1 when defining the model (not in fit).

For lgb.train interface, you can set verbose=-1 in param dict.

Financial data

We need to back-adjust all historical data for an instrument when there is a new split or dividend.

Cross Validation is used for model comparison, not model building

After we pick the suitable, we retrain with all the data and use that model for production.

Travis CI tips and tricks

lint a travis.yml file

travis lint [path to your .travis.yml]

Python toolbox

Web Scrapping: BeautifulSoup
Data Visualization: seaborn, matplotlib, facets, qgrid
Feature Engineering: category_encoders
Classical Machine Learning modelling: scikit-learn, LightGBM, CatBoost, XGBoost
Deep Learning modelling: Keras, Tensorflow, Pytorch
Clustering: hdbscan
Autoformat and linting: pylint, pycodestyle, pydocstyle
Database: pyodbc
Dimension Reduction: Factor Analysis, Principal Component Analysis, Independent Component Analysis, t-SNE (scikit-learn), UMAP
Reporting: papermill, scrapbook

Panel data

In statistics and econometrics, panel data or longitudinal data are multi-dimensional data involving measurements over time.

R tips and tricks

Connect to an SQL database

library(RODBC)
dbhandle <- odbcDriverConnect('driver={SQL 
Server};server=mysqlhost;database=mydbname;trusted_connection=true')
res <- sqlQuery(dbhandle, 'select * from information_schema.tables')

Comparison of Likelihood Ratio, Wald test and Rao's Score test

The Wald test assumes that the likelihood is normally distributed, and on that basis, uses the degree of curvature to estimate the standard error. Then, the parameter estimate divided by the SE yields a z-score. This holds under large N, but isn't quite true with smaller Ns. It is hard to say when your N is large enough for this property to hold, so this test can be slightly risky.

Likelihood ratio tests look at the ratio of the likelihoods (or difference in log likelihoods) at its maximum and at the null. This is often considered the best test.

The score test is based on the slope of the likelihood at the null value. This is typically less powerful, but there are times when the full likelihood cannot be computed and so this is a nice fallback option.

Trading algorithm testing

White Reallity Check test
Hansen Superior Predictive Ability test

Update Python syntax to a newer version

Use flynt or pyupgrade

Note: Add .repeat() to loop infinitely.

def _input_fn():
  sent1 = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=np.int64)
  sent2 = np.array([20, 25, 35, 40, 600, 30, 20, 30], dtype=np.int64)
  sent1 = np.reshape(sent1, (8, 1, 1))
  sent2 = np.reshape(sent2, (8, 1, 1))

  labels = np.array([40, 30, 20, 10, 80, 70, 50, 60], dtype=np.int64)
  labels = np.reshape(labels, (8, 1))

  def generator():
    for s1, s2, l in zip(sent1, sent2, labels):
      yield {"input_1": s1, "input_2": s2}, l

  dataset = tf.data.Dataset.from_generator(generator, output_types=({"input_1": tf.int64, "input_2": tf.int64}, tf.int64))
  dataset = dataset.batch(2).repeat()
  return dataset

...

model.fit(_input_fn(), epochs=10, steps_per_epoch=4)

Download all files in a Jupyter Notebook server

In a cell, run !tar chvfz notebook.tar.gz *

Save content of a web page to a file with the same name

curl http://example.com/folder/big-file.iso -O

Choose ARMA parameters from ACF and PACF

Use the correct alternative option in Fisher Exact test

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html

Scientific Python project template

https://github.com/scientific-python/cookie

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
image		image
Azure.md		Azure.md
Docker.md		Docker.md
Finance.md		Finance.md
GCP.md		GCP.md
LICENSE		LICENSE
Latex.md		Latex.md
Programming.md		Programming.md
Python.md		Python.md
R.md		R.md
README.md		README.md
SQL.md		SQL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image

image

Azure.md

Azure.md

Docker.md

Docker.md

Finance.md

Finance.md

GCP.md

GCP.md

LICENSE

LICENSE

Latex.md

Latex.md

Programming.md

Programming.md

Python.md

Python.md

R.md

R.md

README.md

README.md

SQL.md

SQL.md

Repository files navigation

Data Science - blood & tears

About

Releases

Packages

License

tqa236/Data-Science-blood-and-tears

Folders and files

Latest commit

History

Repository files navigation

Data Science - blood & tears

About

Resources

License

Stars

Watchers

Forks