A very summarized version of his story

Your early years are spent experimenting, establishing legitimacy, climbing the ladder of your industry until you’ve found where you belong. Now, success does not mean slowly graduating through an…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Data Science and Game

I recently completed a test task for one organization. The data were presented and the main task was to predict the profitability based on the presented data. The values themselves were significantly scattered and unbalanced. Also, according to the data, information was not provided about what each value means and what kind of relationship exists.I immediately decided that I would conduct the analysis using popular boosting methods.

The first step was to download the necessary libraries.

Now you can look at the data itself.

Now you can look at the data globally

Now let’s build the distribution of variables

I deleted the parasitic column and selected a separate variable that will contain the values of numeric columns. Many columns contain a large number of null values. Text columns are characterized by the presence of a large number of NaN values. The values are characterized by a significant spread. The dataframe size is 25,000 rows by 39 columns. The analysis itself is carried out in colab.

Prepare int and float

Prepare data

Prepare Boolean data

Prepare object columns

Deleted columns with a large number of NaN values from the dataset. Created a separate dataset where the prepared data will be stored. Also separately created a variable for storing text columns. Having built the Pearson correlation, I saw that some values have a significant relationship with each other. I normalized the numeric data using the MinMaxScaler() function. The date value was converted to UNIX format. I changed the Boolean values to 0 and 1.

Undersampling

Oversampling

Due to the large spread of the target value, I decided to apply and compare two methods: Undersampling and Oversampling.

The studies were carried out using Cut Boost and Boost, while the second method additionally requires the preparation of text data.

Usual Data

Undersampling

Oversampling

Usual Data

Undersampling

Oversampling

Significant errors were obtained on both algorithms. and the possibility of their optimization was not allowed by the limitations of the free version of colab. It is important to note that both the use of under- and -oversampling did not bring tangible results and worsened the values.

Conclusions:

If you need code in Collab, then you can visit my blog and take the code directly from the article:

Add a comment

Related posts:

How to Play Goldmine Minigame and Win 300 Everdragons2 NFT

Everdragons2 launches in less than a week! You’ll notice building excitement in our Discord group about Goldmine, which is a minigame starting Saturday (December 11) at 11 AM Pacific Time and ending…

What were the main features of the interior paintings?

Interior painting has been a popular form of artistic expression for centuries. From the grandiose frescoes in ancient Rome to the delicate watercolors of the 18th century, interior paintings have…

Benefits of Laravel Development Services for Enterprises

There are some very important benefits involved when enterprises make their move while selecting Laravel as their option for website and application development services. Let us take a detailed…