In my previous post I mentioned that I took a machine learning class based on Python and a library called PyTorch. Since the class ended, I have been working on a useful application of the PyTorch library and machine learning ideas to my work with Oracle databases. I do not have a fully baked script to share today but I wanted to show some things I am exploring about the relationship between the current date and time and database performance metrics such as host CPU utilization percentage. I have an example that you can download here: datetimeml2.zip
There is a relationship between the day of the week and the hour of the day and database performance metrics on many Oracle database systems. This is a graph from my hostcpu.py script that shows CPU utilization on a production Oracle database by date and hour of the day:
During the weekdays the CPU peaks at a certain hour and on the weekends, there are some valleys. So, I thought I would use PyTorch to model this relationship. Here is what the graph looks like of actual host CPU used versus predicted by PyTorch:
It’s the same source database but an earlier date. The prediction is close. I guess the big question after I got to this point was, so what? I’m not sure exactly what to do with it now that I have a model of the database CPU. I guess if the CPU is at 100% for an entire day instead of going up and down, I should throw an alert? What if CPU stays near 0% for a day during the week? It must be helpful to have a prediction of the host CPU but exactly how to alert on deviations from the predicted value is still a work in progress.
I thought it would be helpful to talk about the date and time inputs to this model. If you look at datetimeoscpu.sql in the zip it has this SQL for getting the date and time values:
select
to_char(sn.END_INTERVAL_TIME,'D') day_of_week,
to_char(sn.END_INTERVAL_TIME,'DD') day_of_month,
to_char(sn.END_INTERVAL_TIME,'DDD') day_of_year,
to_char(sn.END_INTERVAL_TIME,'HH24') hour_of_day,
to_char(sn.END_INTERVAL_TIME,'MM') month,
to_char(sn.END_INTERVAL_TIME,'Q') quarter,
to_char(sn.END_INTERVAL_TIME,'WW') week_of_year,
to_char(sn.END_INTERVAL_TIME,'W') week_of_month,
to_char(sn.END_INTERVAL_TIME,'YYYY') year,
...
I ended up ignoring year because it is not a cyclical value. The rest have a range like 1 to 7 for day of week or 1 to 31 for day of month. Having all eight of these is probably overkill. I could just focus on day of week and hour of day and forget the other six. We have 6 weeks of AWR history so I’m not sure why I care about things like month, quarter, day of year because I don’t have multiple years of history to find a pattern.
My input data looks like this:
4 17 199 03 07 3 29 3 2024 15.4210261
4 17 199 04 07 3 29 3 2024 15.5799532
4 17 199 05 07 3 29 3 2024 26.2080227
4 17 199 06 07 3 29 3 2024 24.1532019
4 17 199 07 07 3 29 3 2024 23.2947767
4 17 199 08 07 3 29 3 2024 25.5198717
4 17 199 09 07 3 29 3 2024 26.029638
4 17 199 10 07 3 29 3 2024 27.1160204
4 17 199 11 07 3 29 3 2024 27.7076496
Each line represents an AWR snapshot. The first 8 are the cyclical date and time input values or X. The last value is the host CPU utilization percentage or Y. The point of the program is to create a model based on this data that will take in the 8 values and put out a predicted CPU percentage. This code was used to make the predictions for the graph at the end of the train.py script:
predictions = model(new_X)
When I first started working on this script it was not working well at all. I talked with ChatGPT about it and discovered that cyclical values like hour of day would work better with a PyTorch model if they were run through sine and cosine to transform them into the range -1 to 1. Otherwise PyTorch thinks that an hour like 23 is far apart from the hour 0 when really, they are adjacent. Evidently if you have both cosine and sine their different phases help the model use the cyclical date and time values. So, here is the code of the function which does sine and cosine:
def sinecosineone(dttmval,period):
"""
Use both sine and cosine for each of the periodic
date and time values like hour of day or
day of month
"""
# Convert dttmval to radians
radians = (2 * np.pi * dttmval) / period
# Apply sine and cosine transformations
sin_dttmval = np.sin(radians)
cos_dttmval = np.cos(radians)
return sin_dttmval, cos_dttmval
def sinecosineall(X):
"""
Column Number
day_of_week 7
day_of_month 31
day_of_year 366
hour_of_day 24
month 12
quarter 4
week_of_year 52
week_of_month 5
"""
...
The period is how many are in the range – 24 for hour of the day. Here is the hour 23 and hour 0 example:
>>> sinecosineone(23.0,24.0)
(-0.25881904510252157, 0.9659258262890681)
>>> sinecosineone(0.0,24.0)
(0.0, 1.0)
Notice how the two values are close together for the two hours that are close in time. After running all my input data through these sine and cosine procedures they are all in the range -1.0 to 1.0 and these were fed into the model during training. Once I switched to this method of transforming the data the model suddenly became much better at predicting the CPU.
You can play with the scripts in the zip. I really hacked the Python script together based on code from ChatGPT, and I think one function from my class. It isn’t pretty. But you are welcome to it.
I’m starting to get excited about PyTorch. There are a bunch of things you must learn initially but the fundamental point seems simple. You start with a bunch of existing data and train a model that can be used as a function that maps the inputs to the outputs just as they were in the original data. This post was about an example of inputting cyclical date and time values like time of day or day of week and outputting performance metrics like CPU used percentage. I don’t have a perfect script that I have used to great results in production, but I am impressed by PyTorch’s ability to predict database server host CPU percent based on date and time values. More posts to come, but I thought I would get this out there even though it is imperfect. I hope that it is helpful.
Bobby