Skip to content

My experience while working in Data Science, so I don't have to relearn or reinvent any thing.

License

Notifications You must be signed in to change notification settings

tqa236/Data-Science-blood-and-tears

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science - blood & tears

My experience while working in Data Science, so I don't have to relearn or reinvent any thing.

  1. Early Stopping, but when?

According to this paper, slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).

  1. No Free Lunch theorem as I understand it

If we have absolutely no knowledge or insights about the problem we intend to solve, which algorithm we choose doesn't matter.

  1. Bayes error rate

In statistical classification, Bayes error rate is the lowest possible error rate for any classifier of a random outcome (into, for example, one of two categories) and is analogous to the irreducible error.

  1. Common Dimension Reduction techniques

Dimension Reduction

  1. Use backslash escapes in f-strings

As you saw earlier, it is possible for you to use backslash escapes in the string portion of an f-string. However, you can’t use backslashes to escape in the expression part of an f-string:

>>> f"{\"Hello World\"}"
  File "<stdin>", line 1
    f"{\"Hello World\"}"
                      ^
SyntaxError: f-string expression part cannot include a backslash
  1. Compare notebooks for version control

Use nbdime

nbdiff notebook_1.ipynb notebook_2.ipynb
nbdiff-web notebook_1.ipynb notebook_2.ipynb
  1. Jupyter notebook tools
  • facets: Data visualization
  • papermill: Parameter settings
  • nbconvert: convert notebooks to other formats
  1. Close multiple Chrome windows with PowerShell
Stop-Process -Name chrome
  1. Tukey test

The Tukey Test (or Tukey procedure), also called Tukey’s Honest Significant Difference test, is a post-hoc test based on the studentized range distribution. An ANOVA test can tell you if your results are significant overall, but it won’t tell you exactly where those differences lie. After you have run an ANOVA and found significant results, then you can run Tukey’s HSD to find out which specific groups’s means (compared with each other) are different. The test compares all possible pairs of means.

  1. SQL tips and tricks
SELECT c.name AS ColName, t.name AS TableName
FROM sys.columns c
    JOIN sys.tables t ON c.object_id = t.object_id
WHERE c.name LIKE '%MyCol%';
  • Declare a variable
DECLARE @my_var int = 10;
  • Calculate a column using a calculated column

Not possible, recalculate.

  • Read SQL script from file on Windows

Use UTF-16 encoding, not utf-8

query = open(filepath, 'r',  encoding="UTF-16")

Add GO at the end of the function.

  • Cannot save a sql file directly from a text file, must do that from MSSS (for now)
  1. Time series - ARMA model

You should always examine the residuals because the model assumes the errors are Gaussian white noise. Test for white noise with astsa package in R

# Generate 100 observations from the AR(1) model
x <- arima.sim(model = list(order = c(1, 0, 0), ar = .9), n = 100) 

# Plot the generated data 
plot(x)

# Plot the sample P/ACF pair
plot(acf2(x))

# Fit an AR(1) to the data and examine the t-table
sarima(x, p = 1, d = 0, q = 0)

Bad residuals

  • Pattern in the residuals
  • ACF has large values
  • Q-Q plot suggests normality
  • Q-statistics - all points below line
  1. Various kinds of transformation for categorical variables

  2. Jupyter Notebook tips and tricks

Halfway solution: use the %load magic of IPython

  • Autoreload submodule in Jupyter Notebook
%load_ext autoreload
%autoreload 2

It's faster to cache repeated result with joblib

from sklearn.externals.joblib import Memory
memory = Memory(cachedir='/tmp', verbose=0)
@memory.cache
def computation(p1, p2):
    ...
import qgrid
# Adjust column width with grid_options
qgrid_widget = qgrid.show_grid(df, show_toolbar=True, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 200})
qgrid_widget
    * Sometimes `qgrid.show_grid` does not work. One solution is to close and open notebooks once for `qgrid` to work. Not sure why yet. Can't reproduce on an isolated environment.

Some related settings:

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
  1. SVN on Windows tips and tricks
svn --force add .
  • Delete files

Cannot delete manually, must use svn rm path/to/file

svn status | ? { $_ -match '^!\s+(.*)' } | % { svn rm $Matches[1] }
  • After ignoring files, must commit for it to work.
  1. Python tips and tricks

For sklearn interface, set verbose=-1 when defining the model (not in fit).

For lgb.train interface, you can set verbose=-1 in param dict.

  1. Financial data
  • We need to back-adjust all historical data for an instrument when there is a new split or dividend.
  1. Cross Validation is used for model comparison, not model building

After we pick the suitable, we retrain with all the data and use that model for production.

  1. Travis CI tips and tricks
  • lint a travis.yml file
travis lint [path to your .travis.yml]
  1. Python toolbox
  • Web Scrapping: BeautifulSoup
  • Data Visualization: seaborn, matplotlib, facets, qgrid
  • Feature Engineering: category_encoders
  • Classical Machine Learning modelling: scikit-learn, LightGBM, CatBoost, XGBoost
  • Deep Learning modelling: Keras, Tensorflow, Pytorch
  • Clustering: hdbscan
  • Autoformat and linting: pylint, pycodestyle, pydocstyle
  • Database: pyodbc
  • Dimension Reduction: Factor Analysis, Principal Component Analysis, Independent Component Analysis, t-SNE (scikit-learn), UMAP
  • Reporting: papermill, scrapbook
  1. Panel data

In statistics and econometrics, panel data or longitudinal data are multi-dimensional data involving measurements over time.

  1. R tips and tricks
library(RODBC)
dbhandle <- odbcDriverConnect('driver={SQL 
Server};server=mysqlhost;database=mydbname;trusted_connection=true')
res <- sqlQuery(dbhandle, 'select * from information_schema.tables')
  1. Comparison of Likelihood Ratio, Wald test and Rao's Score test

Statistical tests' comparison

  • The Wald test assumes that the likelihood is normally distributed, and on that basis, uses the degree of curvature to estimate the standard error. Then, the parameter estimate divided by the SE yields a z-score. This holds under large N, but isn't quite true with smaller Ns. It is hard to say when your N is large enough for this property to hold, so this test can be slightly risky.
  • Likelihood ratio tests look at the ratio of the likelihoods (or difference in log likelihoods) at its maximum and at the null. This is often considered the best test.
  • The score test is based on the slope of the likelihood at the null value. This is typically less powerful, but there are times when the full likelihood cannot be computed and so this is a nice fallback option.
  1. Trading algorithm testing
  • White Reallity Check test
  • Hansen Superior Predictive Ability test
  1. Update Python syntax to a newer version

Use flynt or pyupgrade

  1. Add new file types to Atom

  2. Example of tf.data.Dataset.from_generator

Note: Add .repeat() to loop infinitely.

def _input_fn():
  sent1 = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=np.int64)
  sent2 = np.array([20, 25, 35, 40, 600, 30, 20, 30], dtype=np.int64)
  sent1 = np.reshape(sent1, (8, 1, 1))
  sent2 = np.reshape(sent2, (8, 1, 1))

  labels = np.array([40, 30, 20, 10, 80, 70, 50, 60], dtype=np.int64)
  labels = np.reshape(labels, (8, 1))

  def generator():
    for s1, s2, l in zip(sent1, sent2, labels):
      yield {"input_1": s1, "input_2": s2}, l

  dataset = tf.data.Dataset.from_generator(generator, output_types=({"input_1": tf.int64, "input_2": tf.int64}, tf.int64))
  dataset = dataset.batch(2).repeat()
  return dataset

...

model.fit(_input_fn(), epochs=10, steps_per_epoch=4)
  1. Download all files in a Jupyter Notebook server

In a cell, run !tar chvfz notebook.tar.gz *

  1. Save content of a web page to a file with the same name
curl http://example.com/folder/big-file.iso -O
  1. Choose ARMA parameters from ACF and PACF

Choose ARMA parameters from ACF and PACF

  1. Use the correct alternative option in Fisher Exact test

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html

  1. Scientific Python project template

https://github.com/scientific-python/cookie

About

My experience while working in Data Science, so I don't have to relearn or reinvent any thing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published