I have a text file with a pipe delimiter. The text file happens to have a blank row as the first row in the file which is a problem in itself but it led to the discovery of what looks like a bug in the import tool. When I change the delimiter from a comma to a pipe, the columns in the preview window do not expand to show the other columns in the file. And, when clicking next, the columns do not get recognized in the grid which shows the columns being assigned to the table fields. If I go into the file and delete the blank row, everything works as expected. I would expect the behavior of the import tool to inspect the whole file and not the first row which is what it seems to do when rescanning the file after a delimiter change.
@swissarmyknife: pls try & reproduce and attach a sample file
@brad: You have correctly identified the issue. Import assumes a property formatted file with each row having the exact same # of columns. Even if we enhanced the parser to take care of the situation you mentioned, the next question would be what INSERT statement should be generated for the blank row.
Question for you: Is this a problem you are running into frequently? If so, an approach that might work better is for us to introduce a "From row" field (similar to LibreOffice) which specifies the starting row # the parser should use.
@brad: You have correctly identified the issue. Import assumes a property formatted file with each row having the exact same # of columns. Even if we enhanced the parser to take care of the situation you mentioned, the next question would be what INSERT statement should be generated for the blank row.
Question for you: Is this a problem you are running into frequently? If so, an approach that might work better is for us to introduce a "From row" field (similar to LibreOffice) which specifies the starting row # the parser should use.
I load millions of records a week from various sources across my company and this is the first time I have encountered this. It is clearly an issue that should be corrected on the source side. But, realizing the source systems aren't always easy to manipulate, I think the loader needs to be a bit more intelligent. Having said that, I see absolutely no value in inserting a blank row. With proper constraints, this shouldn't even be possible. So, my feeling is all blank rows should be suppressed from an import. If you suppress blank rows from the parser and the import engine, this fix should be relatively straight forward I would think.
I load millions of records a week from various sources across my company and this is the first time I have encountered this. It is clearly an issue that should be corrected on the source side. But, realizing the source systems aren't always easy to manipulate, I think the loader needs to be a bit more intelligent. Having said that, I see absolutely no value in inserting a blank row. With proper constraints, this shouldn't even be possible. So, my feeling is all blank rows should be suppressed from an import. If you suppress blank rows from the parser and the import engine, this fix should be relatively straight forward I would think.
I reviewed ADS source, the method that is used to read an input row is ETLInputFile.readRecord(boolean) (line #200). It seems that we can modify this method to suppress blank rows if needed, assuming blank row means a line containing only new line character (UNIX) or only carriage return + new line (Windows). In this method, there is a special handling for a line containing only a new line character (lines #216-219), this special handling causes import behave differently on files generated from Windows where line terminated by carriage return + new line. I discussed this with Niels, and Niels said he needs to do a review and see why it was implemented that way.
If import file contains multiple blank rows (header and/or contents), ADS might generate different number of INSERT statements on blank rows, depends on whether they are terminated by new line character or by carriage return + new line.
In the case of attached 'Pipe delimited file' test data, there is only one blank row at the very beginning of the file. This file seems to use carriage return + new line as line terminator. If we save this file in UNIX way, i.e. using only new line as line terminator, then, both check/uncheck the 'First row contains column names' field in the Import dialog should generate the correct results.
I reviewed ADS source, the method that is used to read an input row is ETLInputFile.readRecord(boolean) (line #200). It seems that we can modify this method to suppress blank rows if needed, assuming blank row means a line containing only new line character (UNIX) or only carriage return + new line (Windows). In this method, there is a special handling for a line containing only a new line character (lines #216-219), this special handling causes import behave differently on files generated from Windows where line terminated by carriage return + new line. I discussed this with Niels, and Niels said he needs to do a review and see why it was implemented that way.
If import file contains multiple blank rows (header and/or contents), ADS might generate different number of INSERT statements on blank rows, depends on whether they are terminated by new line character or by carriage return + new line.
In the case of attached 'Pipe delimited file' test data, there is only one blank row at the very beginning of the file. This file seems to use carriage return + new line as line terminator. If we save this file in UNIX way, i.e. using only new line as line terminator, then, both check/uncheck the 'First row contains column names' field in the Import dialog should generate the correct results.
In this method, there is a special handling for a line containing only a new line character (lines #216-219), this special handling causes import behave differently on files generated from Windows where line terminated by carriage return + new line.
Modify the code so that the handling for windows (cr+lf) is the same as for Unix (lf). By making this change, the behavior would be:
Both check/uncheck the 'First row contains column names' field in the Import dialog should generate the correct results.
In this method, there is a special handling for a line containing only a new line character (lines #216-219), this special handling causes import behave differently on files generated from Windows where line terminated by carriage return + new line.
Modify the code so that the handling for windows (cr+lf) is the same as for Unix (lf). By making this change, the behavior would be:
Both check/uncheck the 'First row contains column names' field in the Import dialog should generate the correct results.
There are 2 problems in the current implementation:
(1) Blank line (a line that only contains line terminator) is not handled correctly:
This can be resolved by handling UNIX LF and Windows CR+LF the same way.
(2) Multiple blank lines:
Using UNIX file as an example, if there are multiple blank lines in the input file, all of these blank lines will be skipped if none of them are consecutive. However, if some of them are consecutive, then odd number blank lines are skipped but even number blank lines will be scripted. This does not seem to be correct, and is also caused by lines #216-219 in ETLInputFile. I briefly mentioned multiple blank rows problem in 6/24/13 comment but did not provide details there.
Since we remove blank row, there is an edge case we could not handle:
If there is a table that has only one varchar column defined in it, then export this table to a CSV file could contain blank rows. Import the generated CSV file to another table will skip all of these blank rows.
There are 2 problems in the current implementation:
(1) Blank line (a line that only contains line terminator) is not handled correctly:
This can be resolved by handling UNIX LF and Windows CR+LF the same way.
(2) Multiple blank lines:
Using UNIX file as an example, if there are multiple blank lines in the input file, all of these blank lines will be skipped if none of them are consecutive. However, if some of them are consecutive, then odd number blank lines are skipped but even number blank lines will be scripted. This does not seem to be correct, and is also caused by lines #216-219 in ETLInputFile. I briefly mentioned multiple blank rows problem in 6/24/13 comment but did not provide details there.
Since we remove blank row, there is an edge case we could not handle:
If there is a table that has only one varchar column defined in it, then export this table to a CSV file could contain blank rows. Import the generated CSV file to another table will skip all of these blank rows.
Lets go ahead & fix both #1 & 2.
Since we remove blank row, there is an edge case we could not handle:
Understood.
Lets go ahead & fix both #1 & 2.
Since we remove blank row, there is an edge case we could not handle:
Understood.
Fixed - all of blank rows are skipped on import - SVN r33114/14.0.0-beta-55
Fixed - all of blank rows are skipped on import - SVN r33114/14.0.0-beta-55
I tested files with 1, 2 and 10 blank lines at their beginnings, and they all worked fine.
I tested files with 1, 2 and 10 blank lines at their beginnings, and they all worked fine.
tested files with the following EOL (end of line)
CR+LF (carriage return + line feed)
LF (linefeed)
On OS X, Windows and Linux
tested files with the following EOL (end of line)
CR+LF (carriage return + line feed)
LF (linefeed)
On OS X, Windows and Linux
> I tested files with 1, 2 and 10 blank lines at their beginnings, and they all worked fine.
Please note that SVN r33114 skips all blank lines which can appear in any place in a file, not just the beginning.
> I tested files with 1, 2 and 10 blank lines at their beginnings, and they all worked fine.
Please note that SVN r33114 skips all blank lines which can appear in any place in a file, not just the beginning.
I've attached more example test cases that include internal blank lines, with different EOL
Don't use the EOL of CR, we don't support carriage return only as a format of EOL.
I've attached more example test cases that include internal blank lines, with different EOL
Don't use the EOL of CR, we don't support carriage return only as a format of EOL.
Issue #9352 |
Closed |
Fixed |
Resolved |
Completion |
No due date |
Fixed Build 14.0.0-beta-55 |
No time estimate |
1 issue link |
relates to #10368
Issue #10368OS X - not identifying CR (carriage return) for new line during import |
@swissarmyknife: pls try & reproduce and attach a sample file