#15429: Unacceptably slow insert performance

bobfromtn reported 2017-08-08T05:08:04Z · last modified 2017-09-26T18:40:44Z

Unacceptably slow insert performance

Dev	funfun
QA	tariqrahiman

Priority	Critical
Complexity	Unknown

Component	Open API - IO
Version	18.0

I have a draft script that serves an ETL purpose of reading rows from a source and writing to a target. The script took over 60 seconds to insert 10k rows (of only one column). My real script runs around 20 minutes to insert 170k rows, which obviously is not acceptable for a production system. Is this the expected/normal performance of ADS datawriter on inserts? (I hope something has been missed.)

My first effort was to move to the attached draft script that strips out my abstract, data-driven logic in favor of direct and simple coding to transfer the source value to target datarow. Next I tried to address performance by turning off auto-commit, which I believe I now have success with (previously, I could see rows in the table from another session; now in this draft where it will not commit until the end, I do not see table rows from another session while the script runs). I also monitored the sessions while the script was running and did not see the transaction renewing, when previously I did.

However, this simplified attached draft script is still taking over 60 seconds:

Computername is MONOLITH

Executing query to NetSuite...

Target result column structure=customer_id(int8)

Table Column[customer_id]

Table Column[date_last_modified]

PrimaryKey:

DB Status: powell Connected: true on session 8011 DB version: POSTGRESQL.9.5 Autocommit status: undefined javaConn.getAutoCommit()=false

..........

Read 10000 source rows and 0 target rows; 10000 rows inserted, 0 rows updated, 2 commits, 0 rollbacks

Script execution time: 66071

*** Script Completed ***

If I comment out the data writer .write on line 272, the whole script runs (including reading 10k source rows) in under 5 seconds.

Computername is MONOLITH

Executing query to NetSuite...

Target result column structure=customer_id(int8)

Table Column[customer_id]

Table Column[date_last_modified]

PrimaryKey:

DB Status: powell Connected: true on session 8048 DB version: POSTGRESQL.9.5 Autocommit status: undefined javaConn.getAutoCommit()=false

..........

Read 10000 source rows and 0 target rows; 10000 rows inserted, 0 rows updated, 2 commits, 0 rollbacks

Script execution time: 4833

*** Script Completed ***

The table itself is trivial, without even a primary key. It is truncated before running the script.

CREATE TABLE nsrep.keylist_customers  ( 
   customer_id        bigint NOT NULL,
   date_last_modified timestamp NULL 
   )
GO

4 attachments

Store single column direct insert only commit ctrl.xjs

2017-08-08T05:08:04Z

24 KB
support_info.txt

2017-08-08T05:08:04Z

1 KB
LoggingOutput.png

2017-08-09T19:09:00Z

52 KB
NewAPIs

2017-08-17T01:14:25Z

72 KB

All Comments (14) Change History

tomconrad 2017-08-08T18:14:40Z

I did a quick test using V18 import tool for importing 10000 and 60000 rows into the bistudio_example table. I also used our insert random generator test script to generate 10000 and 60000 rows into the bistudio_example table. PostgreSQL database was used.

Here are the results:

10000 rows

<1 sec inserting from the import tool from a csv file using batch(1000).

12 secs inserting from the import tool from a csv file using full.

35 - 11 = 24 secs using random generation datawriter.

where 35 is process total and 11 is for the js random generation and 24 just datawriter

60000 rows

4 secs inserting from the import tool from a csv file using batch(1000).

70 secs inserting from the import tool from a csv file using full.

211 - 60 = 151 secs using random generation datawriter

where 211 is process total and 60 is for the js random generation and 151 just datawriter

151 - 52 = 99 secs using random generation datawriter

where 151 is the datawriter total and 52 is construct statement and 99 is database create statement and execution

bobfromtn 2017-08-08T19:59:20Z

Love you guy's responsiveness and grateful for your help!! OK, you test was insightful. I modified my real script that reads about 170k source rows to write to a csv file instead of doing DB writes. Then I manually used the import tool to dump that data, with batch size of 1000 into the DB. My results:

a) just 74 seconds to read source rows, run all my logic in javascript and create the CSV!!

Read 171085 source rows and 0 target rows; 171085 rows inserted, 0 rows updated

Script execution time: 74585

b) The import tool had Elapsed time of 6s and Inserted 171,085 rows. Avg Row Time 0.04 mills with 24902 rows/sec!!

So the combined total with this approach is 80 seconds, which is very acceptable. So here then are my follow up questions:

1) Can the import tool be driven programmatically from Aquascript? Do you have an example?

2) The full source table is unfortunately fat, with just over 700 columns. I know Excel itself will only handle 255 or so columns in a CSV. Do you know if ADS / aquascript / import tool / DB itself will have this work-around technique still work with such a fat table?

3) Is there any chance that whatever magic you guys do in the import tool could become available without an intermediary file so that the import/inserts can have that kind of performance without the extra steps (assuming you can even run the import tool programmatically from aquascript)? From your own tests above, essentially the "normal" way with a datawriter is orders of magnitude slower than import tool.

Love you guy's responsiveness and grateful for your help!!

SachinPrakash 2017-08-08T23:27:05Z

From Tom:

>>1) Can the import tool be driven programmatically from Aquascript?

No. However, I would recommend using FluidShell. To invoke FluidShell, right click on your registered server & choose "FluidShell". Overview. FluidShell provides a sqlimport command and should provide the same performance as the Import Wizard. Example syntax:

sqlimport -d tom4 -s public bistudio_example C:/Users/tom/Projects/data/tom3.public.bistudio_example.csv -TE Batch -TS 1000

From FluidShell you can also invoke an AquaScript using the source command:

source "C:/Users/tom/Projects/Database Schema and Data Exporter3/AquaScripts/Database Schema and Data Exporter.xjs"

For Export, you have the option of using AquaScript, our Export UI (which allows for a custom SQL statement) or FluidShell's sqlexport command.

>>2) The full source table is unfortunately fat, with just over 700 columns. I know Excel itself will only handle 255 or so ...

This is strictly an Excel limitation and not a limitation in ADS. If you choose to import a CSV file with more than 255 columns, that should work fine in ADS. Our Batch import & export functionality, either through UI or through FluidShell, are designed to be highly optimized and very memory efficient.

>>3) Is there any chance that whatever magic you guys do in the import tool could become available without an intermediary file....

This seems like an ideal use case to use FluidShell. On this page, look at the Example "Extract Data from Oracle and Load into SQL Server".

From Tom:

>>1) Can the import tool be driven programmatically from Aquascript?

sqlimport -d tom4 -s public bistudio_example C:/Users/tom/Projects/data/tom3.public.bistudio_example.csv -TE Batch -TS 1000

From FluidShell you can also invoke an AquaScript using the source command:

source "C:/Users/tom/Projects/Database Schema and Data Exporter3/AquaScripts/Database Schema and Data Exporter.xjs"

For Export, you have the option of using AquaScript, our Export UI (which allows for a custom SQL statement) or FluidShell's sqlexport command.

>>2) The full source table is unfortunately fat, with just over 700 columns. I know Excel itself will only handle 255 or so ...

>>3) Is there any chance that whatever magic you guys do in the import tool could become available without an intermediary file....

This seems like an ideal use case to use FluidShell. On this page, look at the Example "Extract Data from Oracle and Load into SQL Server".

bobfromtn 2017-08-09T05:30:13Z

Essentially, I'm creating a proof-of-concept/prototype of ETL with Aquascript. If all of this pans out, there might be the possibility that my company would adopt Aqua Data Server. If we had Aqua Data Server, then we would probably be looking for it to do some scheduling, invoking various aqua scripts.

A couple things I have not stated thus far: a) our cloud system, NetSuite, allows user customization of its tables, which means that tables and columns in the source system, to some extent, can be added, dropped and renamed. b) I'm seeking to handle this (unfortunate) mutability via data-driven logic based in Postgres, where I also export from NetSuite a data dictionary of the current source structure. Thus, I am dynamically generating various SQL statements. I've got this working in a combination of Aqua Script and Postgres functions. I am less confident of getting this level of dynamic behavior with FluidShell (and it has been 20 years since I've done unix shell scripting, which is also not as common a skill in our shop cf to javascript) c) While javascript is adopted in our company stack, it is a bit of a sell to adopt Aqua Script and Aqua Data Server where javascript is a bit "non-standard" in that it is single-threaded and not oriented around callbacks and one needs to learn the OpenAPIs too. I'm really hoping not to add another tech to the mix in FluidShell, at least not if I can avoid it.

Given your comments / suggestions, here are further questions:

1) You mention FluidShell can invoke aquascript. Can aquascript invoke FluidShell, setting command line parameters as needed?

2) Can Aqua Data Server invoke FluidShell in a scheduled manner?

Essentially, I'd prefer an approach where the ADServer invokes an aqua script that then drives a set of work, including invoking FluidShell for an import if needed, rather than scheduling FluidShell at the OS level and having it invoke aquascript. I'm not quite sure where ADServer would even fit in in that last scenario.

Since we are talking about which kind of thing can invoke another kind of thing... As I've been contemplating how this might all work using Aqua Data Server (and I have not had time to try that product yet and doc is somewhat limited - I've read everything on the website), I have some further questions:

3) Does Aqua Data Server provide any standard logging? If an Aquascript runs in the context of Aqua Data Server, do any aqua.console.println() statements automatically start going into the server logging system or would different code need to be written or would we need to create our own separate logging approach entirely?

4) I have not gotten the full picture of how I might package a more complete system up. Essentially, there are a set of actions that would need to be coordinated in a certain sequence. I'm guessing that shared javascript/aquascript functions could be put into a common file that is imported by various scripts? Can an aqua script invoke another aqua script, especially passing in parameters and/or receiving a value back from the child aqua script? Since there is no multi-threading, I'm guessing if a parent aqua script did programatically invoke another "child" aqua script, then execution in the parent is blocked until the child finishes or throws an uncaught exception?

5) I'm hoping that an aqua script could be invoked interactively from ADServer with a user providing parameter input to the aqua script. It looks like there is a "workspace" in which some kind of form can be created and maybe that supports this scenario? Any examples of this?

6) I imaging we might have other systems want to make web service/rest calls to ADServer, which may invoke an aquascript. The caller might pass parameters that go to the aquascript and then caller can receive a payload back. I've seen various indications that this may be possible, including a URI construction that shows a path including a project and also a concept of some specific aqua script being the default script for a project. However it isn't fully clear how all of this would work.

Given your comments / suggestions, here are further questions:

1) You mention FluidShell can invoke aquascript. Can aquascript invoke FluidShell, setting command line parameters as needed?

2) Can Aqua Data Server invoke FluidShell in a scheduled manner?

SachinPrakash 2017-08-09T19:24:58Z

Hi Bob,

>>Essentially, I'd prefer an approach where the ADServer invokes an aqua script that then drives a set of work

The below answers focus on how you can accomplish the task using AquaScript.

>>3) do any aqua.console.println() statements automatically start going into the server logging system

If you use aqua.console.println() and execute an AquaScript inside the AquaScript tab, there is a "Console Output" section where these messages are displayed. If you execute the AquaScript via Scheduled Job, then the output can be viewing by clicking on the Job's Detail's results - see LoggingOutput.png. If you find this is insufficient, you can use the default java.util.logging framework from AquaScript:

importPackage(java.util.logging);


var fileHandler = new FileHandler('app.log');
var logger = Logger.getLogger("app.example.com");
fileHandler.setFormatter(new SimpleFormatter());
fileHandler.setLevel(Level.ALL);
logger.addHandler(fileHandler);

logger.log(Level.INFO, "Starting app...");
// application logic
logger.log(Level.INFO, "Exiting app...");

>> 4) I'm guessing that shared javascript/aquascript functions could be put into a common file that is imported by various scripts? Can an aqua script invoke another aqua script, especially passing in parameters and/or receiving a value back from the child aqua script?

Yes to both. The runScript API is used to accomplish this.

// START Import Aquascript.xjs

var util = aqua.project.getAquaScript("util.xjs");
aqua.system.runScript(util, null, true);
sayHello("Allen");

// END Import Aquascript.xjs

// START util.xjs

function sayHello(name) {
    print("hello: " + name);
}

// END util.xjs

>> I'm guessing if a parent aqua script did programatically invoke another "child" aqua script, then execution in the parent is blocked until the child finishes or throws an uncaught exception?

Correct.

>>5) & 6)

Yes, we could pass parameter as follows:

URL Querystring (HelloWorld.xjs)
Form input (typeHelloWorld.xjs)

Take a look at our HelloWorld solution. These can be run inside ADStudio or ADServer. If run in ADServer, choose the "Debug in Browser" option in the AquaScript toolbar to see the HTML form.

We also provide api to call remote HTTP services:

var client = aqua.net.newWebClient();
var url = "https://login.example.com" + "?location=" + aqua.util.urlEncode("abcd@example.com");
var request = client.newWebRequest(url);

// Currently ADS can POST data as key-value pairs or as file upload.
request.addParameter('username', "test");
print("Request: " + request);

var response = client.submitPostRequest(request);
print("Response: " + response.getContent());

Hi Bob,

>>Essentially, I'd prefer an approach where the ADServer invokes an aqua script that then drives a set of work

The below answers focus on how you can accomplish the task using AquaScript.

>>3) do any aqua.console.println() statements automatically start going into the server logging system

importPackage(java.util.logging);


var fileHandler = new FileHandler('app.log');
var logger = Logger.getLogger("app.example.com");
fileHandler.setFormatter(new SimpleFormatter());
fileHandler.setLevel(Level.ALL);
logger.addHandler(fileHandler);

logger.log(Level.INFO, "Starting app...");
// application logic
logger.log(Level.INFO, "Exiting app...");

Yes to both. The runScript API is used to accomplish this.

// START Import Aquascript.xjs

var util = aqua.project.getAquaScript("util.xjs");
aqua.system.runScript(util, null, true);
sayHello("Allen");

// END Import Aquascript.xjs

// START util.xjs

function sayHello(name) {
    print("hello: " + name);
}

// END util.xjs

>> I'm guessing if a parent aqua script did programatically invoke another "child" aqua script, then execution in the parent is blocked until the child finishes or throws an uncaught exception?

Correct.

>>5) & 6)

Yes, we could pass parameter as follows:

URL Querystring (HelloWorld.xjs)
Form input (typeHelloWorld.xjs)

Take a look at our HelloWorld solution. These can be run inside ADStudio or ADServer. If run in ADServer, choose the "Debug in Browser" option in the AquaScript toolbar to see the HTML form.

We also provide api to call remote HTTP services:

var client = aqua.net.newWebClient();
var url = "https://login.example.com" + "?location=" + aqua.util.urlEncode("abcd@example.com");
var request = client.newWebRequest(url);

// Currently ADS can POST data as key-value pairs or as file upload.
request.addParameter('username', "test");
print("Request: " + request);

var response = client.submitPostRequest(request);
print("Response: " + response.getContent());

SachinPrakash 2017-08-09T19:30:00Z

1) You mention FluidShell can invoke aquascript. Can aquascript invoke FluidShell, setting command line parameters as needed?

2) Can Aqua Data Server invoke FluidShell in a scheduled manner?

FluidShell is not supported inside of Aqua Data Server.

In Aqua Data Studio (ADS), FluidShell can be invoked directly via command line. In your [ADS_HOME] directory, you'll notice runfluid*.bat files. This also allows FluidShell to be programmatically invoke using the OS's native job scheduler. An AquaScript has the ability to invoke an external process using the runCommand API. Using this API, an AquaScript could invoke the runfluid*.bat file

funfun 2017-08-11T23:29:29Z

SVN r55470/ADS 18.0.18-3
SVN r55472/ADS 19.0.0-beta-40

Checkpoint: Improved performance of AQTableWriter.write(AQDataRow row) Open API. After above check-in is applied, AquaScript's performance should be close to Import Tool's performance with Transaction Type set to FULL.

tomconrad 2017-08-15T21:07:40Z · (edited)

Hi Bob,

Is it possible that instead of processing each row of the result set in javascript that you invoke dataWriter.write

with your entire result set. This will eliminate a lot of javascript overhead processing since all the processing will be done on the api side. I tried this using a result set with 60k rows and it cut the insert time in half compared to processing each row in a loop in javascript. I had autocommit turned on for the test.

So something like this:

try {

var resultset = conn.executeQuery(sqlstmt);

dataWriter.write(resultset);

}

If you need to process each row from the source, you can process the row and add the row into a

data set. When you are done with all of the source processing, execute dataWriter using the data set.

This way all of the inserts are done inside of the api and faster. Here is an example:

var resultset = conn.executeQuery(sqlstmt);

var rows = aqua.data.newDataSet();

while (resultset.next()) {

var row = "";

row = resultset.getRow();

if (row.getString(0) == "Quality" || row.getString(0) == "Manufacturer" ||

row.getString(0) == "Price" || row.getString(0) == "Review" ||

row.getString(0) == "On Promotion") {

rows.add(row);

}

conn.setAutoCommit(false);

dataWriter.write(rows);

conn.commit();

conn.setAutoCommit(true);

Thanks,

Tom

bobfromtn 2017-08-15T17:18:12Z

It is nice to know about this capability. In this case, my actual script reads a resultset stream from my source and simultaneously reads a resultset stream from my target. These are processed in nested loops to determine whether source rows are inserts, updates, or to be ignored (no changes). So, in my mainstream use case, I am not commiting the entire resultset to the output. Also, in my mainstream use case, there may be some data transformation here and there.

bobfromtn 2017-08-15T17:24:14Z

I note the patch available from funfun. Once that is made available to me, I'll retry and retime with that update.

funfun 2017-08-17T01:21:48Z

SVN r55476/ADS 18.0.18-4
SVN r55477/ADS 19.0.0-beta-40
Support batch processing in AquaScript on invocation of AQDataWriter.write(AQDataReader) and AQDataWriter.write(AQDataSet).

AquaScript now supports batch processing on execution of AQDataWriter.write(AQDataReader) and AQDataWriter.write(AQDataSet). Please note that batching processing is not applicable to AQDataWriter.write(AQDataRow).

The following APIs are added to AQTableWriter:

public void setBatchSize(int size);
public void setBatchTransactionType();
public void setFullTransactionType();
public int getBatchSize();
public boolean isBatchTransactionType();
public boolean isFullTransactionType();

Please see this screenshot for javadoc.

To enable batch processing, the following settings are required
writer.setBatchTransactionType();
writer.setBatchSize(N); // where N > 0
prior calling writer.write(AQDataReader) and writer.write(AQDataSet).

With batch processing enabled, on PostgreSQL, AquaScript's performance should be close to Import tool with transaction type set to BATCH.

public void setBatchSize(int size);
public void setBatchTransactionType();
public void setFullTransactionType();
public int getBatchSize();
public boolean isBatchTransactionType();
public boolean isFullTransactionType();

bobfromtn 2017-08-17T01:59:04Z

This is really encouraging. Do you think it might be possible to have the patch at least by Friday afternoon? I expect to do a good bit of coding on the weekend and likely on Friday as well. I'd love to try out these new capabilities.

SachinPrakash 2017-08-17T03:09:44Z

Hi Bob,

We will be testing the patch tomorrow. If it passes our testing, then we should be able to provide it to you by Friday.

SachinPrakash 2017-08-17T19:07:07Z

Hi Bob,

Patch uploaded:

Patch: http://www.aquafold.com/download/v18.0.0/ads-18.0.18-4-patch.zip

Update Instructions: http://www.aquafold.com/support-update.html#v18

Search Tips

Aqua Data Studio / nhilam

Unacceptably slow insert performance

4 attachments

Issue #15429

Completion

Search Tips

Aqua Data Studio / nhilam

Title

Unacceptably slow insert performance

4 attachments

Issue #15429

Completion