Python, numpy, networkx, matplotlib
The aim of the project is to build python package useful to investigate information diffusion in social networks. DifPy is in early development stage. MIT licence.
Project contains modules with following features:
initiatialize.py
simulate.py
optimize.py
feature_importance.py
Basic information about functions usage you may read in functions docstrings.
DifPy is based mostly on NumPy, NetworkX, Matplotlib libraries. Also contemporary version of Python 3.7 + is needed. It works well in 2019.03 Anaconda environment.
DifPy is available directly from the Github repository. To install DifPy, git installed on your local machine is needed.
You may install DifPy on your local machine with the line below:
$ pip install git+git://github.com/John-smith-889/difpy.git
And import with the line below:
import difpy as dp
Python, numpy, tensorflow, keras, scikit-learn, matplotlib
The aim of the project is to predict if number on an image is prime or composite.
The model used in the project is Multilayer Perceptron (MLP), a kind of Artificial Neural Network.
The model of the network was created with Tensorflow 2.0 and Keras as high level API,
and was tuned with Scikit-learn library.
The project is based on MNIST dataset accessible here.
Dataset contains pictures of images which represent numbers from 0 to 9, and labels of those numbers.
There are 4 files, which divide data on
images taining set, labels training set, images test set, and labels test set as follows:
Modified MNIST dataset has been used to build the model. ‘0’ and ‘1’ numbers’ pictures has been deleted. Labels has been encoded into 0 and 1, where 0 is a prime number (2, 3, 5 or 7) and 1 is a composite number (4, 6, 8, or 9).
Project contains Jupyter Notebook file named “MNIST_binary_cls_tf.ipynb” with:
Python, numpy, pandas, scikit-learn, xgboost, matplotlib, seaborn
The aim of the project is to analyze data and build model to predict housing prices. Project is based on dataset accessible on Kaggle.com here. Dataset contains information on sold houses in King County (including Seattle) between May 2014 and May 2015. Variables include date of sale, price, number of beedrooms, number of beedrooms, floor area, and more.
Project contains a few Jupyter notebooks files, and each is based on following sections:
Notebook named “housing_case.ipynb” is a first version of the notebook in which regression problem of prediction each house price is solved. Results of a few algorithms are juxtaposed, where the best performance has XGBoost.
In further notebooks we operate on modified dataset. Variable ‘price_bin’ is constructed with values 0 and 1. Value of 0 means certain house is <= $1 mln worth, value 1 means it is > $1 mln worth. The rest of variables stay the same.
Notebook housing_case_2_reg.ipynb is the improved version of previous one. It has more expanded explanation of taking steps and results analysis, although the same algorithms are used.
In the notebook housing_case_2_cls.ipynb problem of classification is approached. Variable ‘price_bin’ with values 0 and 1 is treated as target variable, and other variables (excluding ‘price’ variable - prices of houses) are treated as explanatory variables. Classificator version of XGBoost algorithm is used to perform binary classification.
In notebook housing_case_2_reg_MLP.ipynb the problem of regression is considered. Solution is prepared with Scikit-learn Multilayer Perceptron (MLP) implementation, a genre of Artificial Neural Network.
Python, SQL, numpy, pandas, geopandas, shapely, matplotlib
The aim of the project is to analyze spatial data of taxi fleet movements. Analysis include Big Query standard SQL queries, Python code for matching points with polygons, and data visualisation. Project is based on dataset accessible on BigQuery here. Dataset contains information of taxi vehicles activity in New York in 2014.
Key research questions are as following: in which day of the week there is highest transits rate, what were the trends of payments during the year if we take under consideration cash and a card, what is the most popular customer’s pick up place?
To acquire data a few SQL queries were performed on tlc_green_trips_2014, and taxi_zone_geom BigQuery tables. Queries are saved in “queries.sql” in the project repository. Variables include pick-up datatime, pick-up longitude, pick-up latitude, payment type and more. Polygons data of New York area was downloaded from taxi_zone_geom table.
Project contains also Python script file named “spatial_case_solution.py” with following tasks:
Python, numpy, pandas
The aim of the project is to collect usefull Python code snippets and design small programs for data processing. Data processing operations we may divide into insertion, deletion, merging, searching, traversal and sorting. Various functions including Python idiomatic expressions were applied to example random data according to repository’s author interpretation of operations classification.
Those operations may be applied on various Python data structures. Python script file data-processing.py contains application of data processing operations on most common data structures as follows:
Python, numpy, pandas, matplotlib
The aim of the project is to explore data and build model to predict rating of particular movies. Project is based on Movies Dataset accessible on Kaggle.com here. Used file movies_metadata.csv contains information on 45_000 movies released on or before July 2017. Movies belong to the Full MovieLens Dataset. Variables include budget, revenue, release dates, languages, and more. Ratings are on a scale of 1-10.
Project contains Jupyter Notebook file named “movies_case_eda.ipynb” with:
R, MXNetR, dplyr, scorecard, corrplot, fastDummies
The aim of the project is to predict defaults of bank customers. Project contains two files: “feature-engineering.R” and “credit-scoring.R”.
In feature engineering part there are used techniques like:
In credit scoring part following activities were done:
R, Shiny, googleCharts, magrittr, devtools
The aim of the project is to build simple web application to visualize migrations across Europe over the years. Application frontend consists an interactive chart showing changes in migrants and refugees levels in European countries. Small button below the chart may be used to animate changes over the years. In the chart it is possible to discover:
Last but not least…
Ruby, html, css, javascript
The aim of the project is to build blog for posting technical issues associated with data science and software engineering. Posts presented on blog include topics associated with Python, social network analysis, devops, space industry and more.
Ruby framework Jekyll with its Lanyon theme were used to create this blog. A few features have been added to personalize blog, including: