Your early years are spent experimenting, establishing legitimacy, climbing the ladder of your industry until you’ve found where you belong. Now, success does not mean slowly graduating through an…
I recently completed a test task for one organization. The data were presented and the main task was to predict the profitability based on the presented data. The values themselves were significantly scattered and unbalanced. Also, according to the data, information was not provided about what each value means and what kind of relationship exists.I immediately decided that I would conduct the analysis using popular boosting methods.
The first step was to download the necessary libraries.
Now you can look at the data itself.
Now you can look at the data globally
Now let’s build the distribution of variables
I deleted the parasitic column and selected a separate variable that will contain the values of numeric columns. Many columns contain a large number of null values. Text columns are characterized by the presence of a large number of NaN values. The values are characterized by a significant spread. The dataframe size is 25,000 rows by 39 columns. The analysis itself is carried out in colab.
Prepare int and float
Prepare data
Prepare Boolean data
Prepare object columns
Deleted columns with a large number of NaN values from the dataset. Created a separate dataset where the prepared data will be stored. Also separately created a variable for storing text columns. Having built the Pearson correlation, I saw that some values have a significant relationship with each other. I normalized the numeric data using the MinMaxScaler() function. The date value was converted to UNIX format. I changed the Boolean values to 0 and 1.
Undersampling
Oversampling
Due to the large spread of the target value, I decided to apply and compare two methods: Undersampling and Oversampling.
The studies were carried out using Cut Boost and Boost, while the second method additionally requires the preparation of text data.
Usual Data
Undersampling
Oversampling
Usual Data
Undersampling
Oversampling
Significant errors were obtained on both algorithms. and the possibility of their optimization was not allowed by the limitations of the free version of colab. It is important to note that both the use of under- and -oversampling did not bring tangible results and worsened the values.
Conclusions:
If you need code in Collab, then you can visit my blog and take the code directly from the article:
Everdragons2 launches in less than a week! You’ll notice building excitement in our Discord group about Goldmine, which is a minigame starting Saturday (December 11) at 11 AM Pacific Time and ending…
Interior painting has been a popular form of artistic expression for centuries. From the grandiose frescoes in ancient Rome to the delicate watercolors of the 18th century, interior paintings have…
There are some very important benefits involved when enterprises make their move while selecting Laravel as their option for website and application development services. Let us take a detailed…