Projects

Python, numpy, networkx, matplotlib

Project description:

The aim of the project is to build python package useful to investigate information diffusion in social networks. DifPy is in early development stage. MIT licence.

Project contains modules with following features:

initiatialize.py

Create random graph ready to simulation performance
Adjust existing graph
Add manually features to graph nodes
Visualize graph with information spread view
Extract graph statistics

simulate.py

perform one simulation step
perform whole simulation
perform simulation sequence with statistics computation

optimize.py

compute centrality for all nodes and return n nodes with best scores
perform simulation for given nodes set size with random search method and return n nodes’ sets with best diffusion capability

feature_importance.py

Compute importance of nodes’ features due to nodes information diffusion capability

Basic information about functions usage you may read in functions docstrings.

Requirements:

DifPy is based mostly on NumPy, NetworkX, Matplotlib libraries. Also contemporary version of Python 3.7 + is needed. It works well in 2019.03 Anaconda environment.

Installation:

DifPy is available directly from the Github repository. To install DifPy, git installed on your local machine is needed.

You may install DifPy on your local machine with the line below:

$ pip install git+git://github.com/John-smith-889/difpy.git

And import with the line below:

import difpy as dp

To do:

Additional optimization metaheuristics functions
Additional methods of computing diffusion speed
Implement paralell computing for better performance
Extend unit tests fot better code coverage
Test big size networks 1 million + nodes
Examples of code usage
Extended documentation

Check more:

Link to Github

2. Numbers’ images classification

Python, numpy, tensorflow, keras, scikit-learn, matplotlib

Project description:

The aim of the project is to predict if number on an image is prime or composite.
The model used in the project is Multilayer Perceptron (MLP), a kind of Artificial Neural Network. The model of the network was created with Tensorflow 2.0 and Keras as high level API, and was tuned with Scikit-learn library. The project is based on MNIST dataset accessible here. Dataset contains pictures of images which represent numbers from 0 to 9, and labels of those numbers. There are 4 files, which divide data on images taining set, labels training set, images test set, and labels test set as follows:

train-images.idx3-ubyte
train-labels.idx1-ubyte
t10k-images.idx3-ubyte
t10k-labels.idx1-ubyte

Modified MNIST dataset has been used to build the model. ‘0’ and ‘1’ numbers’ pictures has been deleted. Labels has been encoded into 0 and 1, where 0 is a prime number (2, 3, 5 or 7) and 1 is a composite number (4, 6, 8, or 9).

Project contains Jupyter Notebook file named “MNIST_binary_cls_tf.ipynb” with:

Data exploration
Data preparation
Data modelling
Model assessment (1)
Hyperparameters tuning
Model assessment (2)

To do:

experiments with model’s hyperparameters tuning
improve model architecture
implement Convolutional Neural Networks

Check more:

Link to Github

3. Housing Prices Modelling in Python

Python, numpy, pandas, scikit-learn, xgboost, matplotlib, seaborn

Project description:

The aim of the project is to analyze data and build model to predict housing prices. Project is based on dataset accessible on Kaggle.com here. Dataset contains information on sold houses in King County (including Seattle) between May 2014 and May 2015. Variables include date of sale, price, number of beedrooms, number of beedrooms, floor area, and more.

Project contains a few Jupyter notebooks files, and each is based on following sections:

Data exploration
Data preparation
Data modelling
Models evaluation
Choosen model optimization

Notebook named “housing_case.ipynb” is a first version of the notebook in which regression problem of prediction each house price is solved. Results of a few algorithms are juxtaposed, where the best performance has XGBoost.

In further notebooks we operate on modified dataset. Variable ‘price_bin’ is constructed with values 0 and 1. Value of 0 means certain house is <= $1 mln worth, value 1 means it is > $1 mln worth. The rest of variables stay the same.

Notebook housing_case_2_reg.ipynb is the improved version of previous one. It has more expanded explanation of taking steps and results analysis, although the same algorithms are used.

In the notebook housing_case_2_cls.ipynb problem of classification is approached. Variable ‘price_bin’ with values 0 and 1 is treated as target variable, and other variables (excluding ‘price’ variable - prices of houses) are treated as explanatory variables. Classificator version of XGBoost algorithm is used to perform binary classification.

In notebook housing_case_2_reg_MLP.ipynb the problem of regression is considered. Solution is prepared with Scikit-learn Multilayer Perceptron (MLP) implementation, a genre of Artificial Neural Network.

To do:

\/ Experiment with Artificial Neural Networks
\/ Modify variables with binning and include date of sale as a explanatory variable
Experiment with more efficient metaheuristics during hyperparameters optimization

Check more:

Link to Github

4. Spatial Data Analysis of Taxi Vehicles’ Movements in Python

Python, SQL, numpy, pandas, geopandas, shapely, matplotlib

Project description:

The aim of the project is to analyze spatial data of taxi fleet movements. Analysis include Big Query standard SQL queries, Python code for matching points with polygons, and data visualisation. Project is based on dataset accessible on BigQuery here. Dataset contains information of taxi vehicles activity in New York in 2014.

Key research questions are as following: in which day of the week there is highest transits rate, what were the trends of payments during the year if we take under consideration cash and a card, what is the most popular customer’s pick up place?

To acquire data a few SQL queries were performed on tlc_green_trips_2014, and taxi_zone_geom BigQuery tables. Queries are saved in “queries.sql” in the project repository. Variables include pick-up datatime, pick-up longitude, pick-up latitude, payment type and more. Polygons data of New York area was downloaded from taxi_zone_geom table.

Project contains also Python script file named “spatial_case_solution.py” with following tasks:

Import and transform polygons data and points data
Match points to polygons with Python’s shapely package
Check data consistency
Compose charts and map chart

To do:

More map charts in the division into particular months
Chart with juxtaposition of all payment profiles
Apply code optimization for Python to enable better efficiency (e.g. numba)

Check more:

Link to Github

5. Data Processing in Python - Repository

Python, numpy, pandas

Project description:

The aim of the project is to collect usefull Python code snippets and design small programs for data processing. Data processing operations we may divide into insertion, deletion, merging, searching, traversal and sorting. Various functions including Python idiomatic expressions were applied to example random data according to repository’s author interpretation of operations classification.

Those operations may be applied on various Python data structures. Python script file data-processing.py contains application of data processing operations on most common data structures as follows:

lists
tuples
dictionaries
ndarrays from Numpy package
DataFrames from Pandas package

To do:

Add less common data structures

Check more:

Link to Github

6. Exploratory Data Analysis of Movies Data in Python

Python, numpy, pandas, matplotlib

Project description:

The aim of the project is to explore data and build model to predict rating of particular movies. Project is based on Movies Dataset accessible on Kaggle.com here. Used file movies_metadata.csv contains information on 45_000 movies released on or before July 2017. Movies belong to the Full MovieLens Dataset. Variables include budget, revenue, release dates, languages, and more. Ratings are on a scale of 1-10.

Project contains Jupyter Notebook file named “movies_case_eda.ipynb” with:

Data collection
General overview
Data exploration of choosen variables

To do:

Explore relations between variables
Create movies_case_feat_eng.ipynb where features will be prepared for data modelling
Create movies_case_model.ipynb for data modelling
Deploy model in Flask web app

Check more:

Link to Github

7. Credit Scoring in R

R, MXNetR, dplyr, scorecard, corrplot, fastDummies

Project description:

The aim of the project is to predict defaults of bank customers. Project contains two files: “feature-engineering.R” and “credit-scoring.R”.

In feature engineering part there are used techniques like:

IV (Information Value) calculating
dummy variables
merging correlated variables with PCA

In credit scoring part following activities were done:

setting up Multi Layer Perceptron
creating custom callback function for training process monitoring
model evaluation with gini coefficient

Check more:

Link to Github

8. Migrations in Europe Visualization in Shiny App

R, Shiny, googleCharts, magrittr, devtools

Project description:

The aim of the project is to build simple web application to visualize migrations across Europe over the years. Application frontend consists an interactive chart showing changes in migrants and refugees levels in European countries. Small button below the chart may be used to animate changes over the years. In the chart it is possible to discover:

particular country names in the map represented as bubbles
population of particular country at the choosen moment in time represented as a bubble size
migrants and refugees level at the choosen moment in time represented as a position in 2-dimensional space
color division on northern, southern, eastern and western countries

Check more:

Link to deployed app

Link to Github

Last but not least…

9. Jekyll Blog About Data Science Deployed on GitHub Pages

Ruby, html, css, javascript

Project description:

The aim of the project is to build blog for posting technical issues associated with data science and software engineering. Posts presented on blog include topics associated with Python, social network analysis, devops, space industry and more.

Ruby framework Jekyll with its Lanyon theme were used to create this blog. A few features have been added to personalize blog, including:

Logo
Social media icons
Code font modificiation
Code highlighting modification
Favicon
Long code snippets rolling
Footer
Post layout modification
Archive
Categories
Disqus integration
Google Analytics integration

To do:

Upgrade categories and tags system

Check more:

Link to the blog

Link to Github

To be continued...

Everyday normal hacker data science and coding

Projects

1. Difpy - Python Package for Information Diffusion Investigation in Social Networks

Project description:

Requirements:

Installation:

To do:

Check more:

2. Numbers’ images classification

Project description:

To do:

Check more:

3. Housing Prices Modelling in Python

Project description:

To do:

Check more:

4. Spatial Data Analysis of Taxi Vehicles’ Movements in Python

Project description:

To do:

Check more:

5. Data Processing in Python - Repository

Project description:

To do:

Check more:

6. Exploratory Data Analysis of Movies Data in Python

Project description:

To do:

Check more:

7. Credit Scoring in R

Project description:

Check more:

8. Migrations in Europe Visualization in Shiny App

Project description:

Check more:

9. Jekyll Blog About Data Science Deployed on GitHub Pages

Project description:

To do:

Check more: