Create a small application for data science using Streamlit and Light Gradient Boosting Machine (LightGBM) in PyCharm

5 min readJan 17, 2021

Using Streamlit and LightGBM we create a small app to predict time series. The dataset we work on includes the historical values of the XRP coin. The chosen dataset is for demonstration purposes only and in the current post, we do not aim to predict the Volume or any other parameter connected with the XRP coin. You will see how to build a Machine Learning model, visualize the data and data preprocessing, and check results in the Streamlit app.

In this post, we will learn how a basic Machine Learning model can be used to predict prices or volume of XRP coin and visualize results in Streamlit application. It is important to understand that we aim to learn mainly the Streamlit application as a tool to build the small and basic application and our aim is not the prediction of XRP values.

In order to start exploring our application you need to:

Clone the GitHub repo https://github.com/arm-star/Streamlit_example
Create venv and install dependencies
Open Command Prompt (tested on Windows 10 and Ubuntu 18.1)
Navigate to your folder where call_streamlit.py is located
Execute the command below in the command prompt. This will activate the virtual environment.

.\venv\Scripts\activate

6. Execute the command below to start the Streamlit

streamlit run call_streamlit.py

7. Go to the browser and paste the link (google chrome was used in this example)

http://localhost:8501

Now you are all set and should see the following window appeared in the browser

What you must see before you have uploaded any data

It is time to upload a dataset. I provide one dataset in GitHub repo for tests. Here is the link to download a new dataset (https://finance.yahoo.com/quote/XRP-USD/history?p=XRP-USD). Click on brows files or drag and drop the data file. You will see the CSV file and a number of radio buttons and additional options pop up in the browser. Let us go step by step from now to discuss the steps and options you have o run the application. Our app acknowledges the upload of the data file and prints out the shape of the initial dataset. In PyCharm we have used pandas.read_csv to read the CSV file and st.write(df) to visualize it in the browser. Further down we see the name of columns of dataframe and corresponding line plot when Target and date columns have been selected. By default, the Date is selected as a target column and it is important to select any other column at this point. In the example below Volume is selected as a target column. Click on dataframe column names to arrange the table. In PyCharm you can use st.radio(“Your text”, df.columns) to generate multiple-choice radio button.

In the case of a linear plot move the cursor to data points to see the date and the value. From this interactive plot, you can zoom in or out, pan, autoscale, toggle spike lines, make the plot fullscreen and download the plot as a .png file.

Linear plot of XRP Volume as a function of date

Radio buttons and line plot can be produced by following part of the script in call_streamlit.py:

Now let us define the start date and end date to select a time window for the data we like to use to train our algorithm. As can be seen in the line plot the Volume has very low values up to the beginning of 2017 (well, obviously something happened at the beginning of 2017 with XRP price/volume). Therefore we just decide to get read of that part of the data and just randomly define the start of the data to 2018–6–1. This will change the shape of the initial data to (682, 8). At this part, we will also check if any missing values are present in the data and remove them. We see that there are no missing values in our data.

Let us prepare the data by splitting it into test and train sets. You can define the test size by percentage to 20% or more. Similarly set for how many days in the future you like to predict the Volume of the XRP. Having this set push on Start LightGBM and predict Test set to start the prediction

Once training and prediction finished you will see which model performed the best based on parameter tuning using GridSearchCV from sklearn.model_selection. Grid search is performed in LightGBM_model in model_engine.py. We tested different combinations for the model using different values of learning_rate, n_estimtaer, and max_depth. You can read more on this topic elsewhere, here we used basic grid search with cv =5.

It is always good to check the feature importance to understand which entries in our data set have the most impact on predicting the new results. In this case, we see that Volume has the highest impact. To overcome the data leakage we used the information about the volume from past N days and. You can visualize the feature importance as a table or histogram as shown below.

To train, you used all the data available up to Apr 12, 2020 (or “yesterday” which is in reality the day before yesterday). To check how good is the prediction for “today” (in reality yesterday) you can compare the true value and prediction shown below for Apr 13, 2020. For “tomorrow” (in reality today, as if you were runing the script on Apr 14), Apr 14, 2020, you will have a new prediction. It is logical that the true value for the Volume for Apr 14, 2020 is missing since the transactions on this day are not done yet.

Prediction of Volume for today and tomorrow

Finally, you can use the model to predict the Volume for the next N days. In our example, we defined 14 days.

This is the end of this post, thank you for reading and I will be happy to receive comments, recommendations about how interesting it was and about possible improvements.

link to GitHub: https://github.com/arm-star/Streamlit_example

Create a small application for data science using Streamlit and Light Gradient Boosting Machine (LightGBM) in PyCharm

Written by Arman Davtyan