I'd like somefeedback regardinghow best to introduce parallelism into an existing serial application.
I have written an application which retrieves data from each table under a particular schema in an Oracle database. Assume for discussion purposes that the Oracle database schema specified contains several hundred tables, and the data is being retrieved over the network (i.e., application is not running on the same machine as the database).
The following are the high-level tasks that the application performs:
1) Given a schema name, retrieve the names of all the tables under the schema
2) For each table contained in the schema, retrieve the current number of records for the table (i.e., SELECT COUNT(*) ...). These record counts will be used during data retrieval (task 4) toverify thatweextracted theexpected number of records.
3) For each empty table, write out a zero-byte file to disk (i.e., no further database interaction performed)
4) For each non-empty table, retrieve all records from thetable using the most efficient database retrieval method, and write the data to disk (1 file per table). Assume that the number of records returned per network round trip is configurable. The number of records for some tables may only be a few dozen, but some small number of tables contain a very large number of records (between 10 and 50 million). Also assume that you have one or two tables that have over 200 million records.
The objective is to minimize the total time it takes to extract all the data because the application needs to be offline (unavailable to end users) during the extract.
It seems to me that task 1) would probably be serial, but tasks 2) and 4) could potentially be good candidates for parallel execution. How would you recommend threading tasks 2) and 4)? Should task 3) be treated differently from a threading perspective (perhaps a dedicated thread to deal with this special case)?
I'm interested to hear thoughts on how best to thread this scenario.