I'd like somefeedback regardinghow best to introduce parallelism into an existing serial application.
I have written an application which retrieves data from each table under a particular schema in an Oracle database. Assume for discussion purposes that the Oracle database schema specified contains several hundred tables, and the data is being retrieved over the network (i.e., application is not running on the same machine as the database).
The following are the high-level tasks that the application performs:
1) Given a schema name, retrieve the names of all the tables under the schema
2) For each table contained in the schema, retrieve the current number of records for the table (i.e., SELECT COUNT(*) ...). These record counts will be used during data retrieval (task 4) toverify thatweextracted theexpected number of records.
3) For each empty table, write out a zero-byte file to disk (i.e., no further database interaction performed)
4) For each non-empty table, retrieve all records from thetable using the most efficient database retrieval method, and write the data to disk (1 file per table). Assume that the number of records returned per network round trip is configurable. The number of records for some tables may only be a few dozen, but some small number of tables contain a very large number of records (between 10 and 50 million). Also assume that you have one or two tables that have over 200 million records.
The objective is to minimize the total time it takes to extract all the data because the application needs to be offline (unavailable to end users) during the extract.
It seems to me that task 1) would probably be serial, but tasks 2) and 4) could potentially be good candidates for parallel execution. How would you recommend threading tasks 2) and 4)? Should task 3) be treated differently from a threading perspective (perhaps a dedicated thread to deal with this special case)?
I'm interested to hear thoughts on how best to thread this scenario.
if IF if IF (read if)
If your database has, or can, distribute the tables to different hard drives (one table per drive or per RAID system)
if your application has, or can, distribute the resultstables to different hard drives (one table per drive or per RAID system),and different from the database hard drives (or RAID)
Then, once you determine the number of tables, spawn that number of threads and have each thread work on a different table.
If you end up with a bottleneck with your Eithernet, then consider using multiple systems (through different ports).
If your database is stored on one RAID system
Then read/copy each table to local hard drive (one PC per table). Note, you may have to experiment with how many workers you can use before it bercomes counter productive when sucking data off the database.
Hard drives and PCs are cheap, time is not.
Configure your database such that not only do you keep the count of records in the database (in some header) but you also keep transaction times for each record in the database (and each delete of record in database).
Then your dump the database works off the changes (like a transaction system)
I think you could work with a pool of threads to schedule the tasks that take different times to complete. At least, you can test how it performs. You can combine the pool of threads with additional threading, according to the number of cores in which the application is going to run.