Querying Many Databases in Parallel in Python

I have some Python scripts that I use to access a bunch of databases to gather information such as the size of the database. Usually it is not that important how long they take so I have been running queries against a list of 100 databases one at a time and it can take a while depending on what I do. Coworkers of mine have worked on running Unix/Linux shell scripts in parallel to hit a bunch of systems like this more quickly. So, I thought I would look at how to do the same in Python. I am sure there are more complicated ways to do it, but I got good results based on the simplest example from this part of the Python documentation:

https://docs.python.org/3.8/library/multiprocessing.html

Here is a simple python script to get the size of the database files from four databases at the same time:

All I did was take the first example from the multiprocessing documentation and replace f() which squared a number with dbspace() which connects to a database and runs a query to get the total db size. I experimented with different Pool() sizes. It is 4 in the example.

I picked 82 non-production databases that I have access to and ran the same query sequentially in a loop and it took 375 seconds. I took the code above and increased the Pool to 32 processes and it ran in 24 seconds. That is about a 15x speedup.

I am sure that I could get a lot more complicated with all this stuff, but this works well with minimal effort.

Bobby

About Bobby

I live in Chandler, Arizona with my wife and three daughters. I work for US Foods, the second largest food distribution company in the United States. I have worked in the Information Technology field since 1989. I have a passion for Oracle database performance tuning because I enjoy challenging technical problems that require an understanding of computer science. I enjoy communicating with people about my work.
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Querying Many Databases in Parallel in Python

  1. Tim N. says:

    Parallelizing tasks across server targets or across server => database targets is cool stuff. I started with parallelizing in 2005 using ClusterIt dsh (hard to find on Linux) and GNU Parallel.

    I found it takes well designed set of helper tools/scripts to scale up a robust/reliable parallelizing infrastructure. Error checking at scale is tough.

    Your example using GNU Parallel might look like this:
    me@myhost:1950 [/home/me/Library/dsh]
    $ cat z.sql
    set heading off
    set pagesize 0

    column instance_name format a20
    column mbytes format 999,999,999,999
    with
    my$view as
    (
    select bytes from dba_data_files union all
    select bytes from dba_temp_files union all
    select bytes from v$log
    )
    select max( i.instance_name ) instance_name ,( sum(bytes) / power( 2 ,20 ) ) mbytes
    from my$view ,v$instance i
    ;
    me@myhost:1950 [/home/me/Library/dsh]
    $ time parallel ‘sqlplus -S -L {} @z.sql;’ ::: myuser/mypass@orcl1db myuser/mypass@orcl2db myuser/mypass@orcl3db myuser/mypass@orcl4db
    orcl1db1 110,830

    orcl2db1 6,030

    orcl3db1 327,187

    orcl4db1 54,037

    real 0m4.80s
    user 0m1.54s
    sys 0m0.34s
    me@myhost:1950 [/home/me/Library/dsh]
    $
    *****
    I abbreviated the output, 4.8s hit 15 databases on a single server.

    Ciao, Tim…

    • Bobby says:

      Thanks Tim. I have been meaning to look at what my coworkers are using with Linux shell scripts. It is great to have your example and information as well.

      Bobby

Leave a Reply to Bobby Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.