This issue is to follow up the test case below mentioned in issue #7803:
:$ cat enwiki-20121001-pages-articles-multistream-index.txt2 > enwiki-index.txt
// 44 seconds in FluidShell.
// 3 second in Ubuntu/Bash
I tried above test case on my desktop (file size is 567,746,536 bytes) and measured CPU statistics using Java VisualVM; result shows most of time is consumed by CommandOutput.write() (see attachment for more info); why? we need to look into this.
|
94 KB
|
99 KB
|
109 KB
After looked into this closely, the root cause of the problem can be divided into 2 areas:
(1) Unnecessary input processing is performed.
(2) Non-buffered I/O is used when writing output to disk.
Before go into details on these 2 items, let's take a look at the test file first:
prompt> \ls -l enwiki-20121001-pages-articles-multistream-index.txt
-rwx- 567746536 2012-10-27 12:25 enwiki-20121001-pages-articles-multistream-index.txt
prompt> exec wc enwiki-20121001-pages-articles-multistream-index.txt
12767780 37700546 567746536 enwiki-20121001-pages-articles-multistream-index.txt
From the output of \ls command and UNIX wc command, we know the size of the test file is ~541 MB which consists of ~12 million lines; i.e. a file containing many lines with average line length ~45 bytes.
(1) Unnecessary input processing:
\cat command currently read in input 'line by line' which is required by '-n' option. After a line is read in and processed, \cat simply writes it to the destination stream which can be a file or something else. If the destination is a file, the stream object is created by shell (in the current implementation, shell creates a FileOutputStream).
The test file used by this test case contains 12 million lines, with buffered I/O used to read input, \cat needs to extract a line a time from the buffer and then write it out; this is time consuming, there are 12 million lines.
Improvement can be made here is: if -n option is NOT on, \cat can simply read in a chunk of data and write it out without any interpretation.
(2) Non-buffered I/O is used when writing output to disk:
When writing out data, \cat simply calls output stream's write(byte[]) method to write out data. In this test case, shell creates a FileOutputStream which does not perform buffered I/O. The test file contains 12 million lines, hence, 12 million disk writes are performed which is very expensive.
Improvement can be made here is: when writing data to file, a stream object that offers buffer I/O is preferred. It seems that we can do this at shell level so that individual command does not need to deal with this.
---------------------------------
See attached screenshot from VisualVM Profier for more info.
Change made for both (1) and (2) described above, SVN r30034.
> cat enwiki-20121001-pages-articles-multistream-index.txt > tmp.txt
This command used to take ~35 seconds to complete on my desktop. After r30034 is applied, it goes down to ~3 seconds now.
(2) is a bigger factor than (1), if only applying change for (2), the run time is cut from ~35 seconds to ~12 seconds.
> cat enwiki-20121001-pages-articles-multistream-index.txt -n > tmp.txt
This is still slow, takes ~20 seconds to complete on my desktop. Will take a look tomorrow.
Change made for both (1) and (2) described above, SVN r30034.
> cat enwiki-20121001-pages-articles-multistream-index.txt > tmp.txt
This command used to take ~35 seconds to complete on my desktop. After r30034 is applied, it goes down to ~3 seconds now.
(2) is a bigger factor than (1), if only applying change for (2), the run time is cut from ~35 seconds to ~12 seconds.
> cat enwiki-20121001-pages-articles-multistream-index.txt -n > tmp.txt
This is still slow, takes ~20 seconds to complete on my desktop. Will take a look tomorrow.
>> cat enwiki-20121001-pages-articles-multistream-index.txt -n > tmp.txt
> This is still slow, takes ~20 seconds to complete on my desktop. Will take a look tomorrow.
I reopened this issue because I thought \cat did not use buffered I/O to read input; I was wrong, \cat does use buffered I/O to read input in this case. It just takes time to process those 12 million lines in memory. See attached Sampler snapshot from VisualVM for more info.
Note that as mentioned by 10/30/2012 comment in issue #7803, \cat does not use buffered I/O to read input if the destination of output is FS terminal.
>> cat enwiki-20121001-pages-articles-multistream-index.txt -n > tmp.txt
> This is still slow, takes ~20 seconds to complete on my desktop. Will take a look tomorrow.
I reopened this issue because I thought \cat did not use buffered I/O to read input; I was wrong, \cat does use buffered I/O to read input in this case. It just takes time to process those 12 million lines in memory. See attached Sampler snapshot from VisualVM for more info.
Note that as mentioned by 10/30/2012 comment in issue #7803, \cat does not use buffered I/O to read input if the destination of output is FS terminal.
Issue #7874 |
Closed |
Fixed |
Resolved |
Completion |
No due date |
Fixed Build trunk/30034 |
No time estimate |
After looked into this closely, the root cause of the problem can be divided into 2 areas:
(1) Unnecessary input processing is performed.
(2) Non-buffered I/O is used when writing output to disk.
Before go into details on these 2 items, let's take a look at the test file first:
prompt> \ls -l enwiki-20121001-pages-articles-multistream-index.txt
-rwx- 567746536 2012-10-27 12:25 enwiki-20121001-pages-articles-multistream-index.txt
prompt> exec wc enwiki-20121001-pages-articles-multistream-index.txt
12767780 37700546 567746536 enwiki-20121001-pages-articles-multistream-index.txt
From the output of \ls command and UNIX wc command, we know the size of the test file is ~541 MB which consists of ~12 million lines; i.e. a file containing many lines with average line length ~45 bytes.
(1) Unnecessary input processing:
\cat command currently read in input 'line by line' which is required by '-n' option. After a line is read in and processed, \cat simply writes it to the destination stream which can be a file or something else. If the destination is a file, the stream object is created by shell (in the current implementation, shell creates a FileOutputStream).
The test file used by this test case contains 12 million lines, with buffered I/O used to read input, \cat needs to extract a line a time from the buffer and then write it out; this is time consuming, there are 12 million lines.
Improvement can be made here is: if -n option is NOT on, \cat can simply read in a chunk of data and write it out without any interpretation.
(2) Non-buffered I/O is used when writing output to disk:
When writing out data, \cat simply calls output stream's write(byte[]) method to write out data. In this test case, shell creates a FileOutputStream which does not perform buffered I/O. The test file contains 12 million lines, hence, 12 million disk writes are performed which is very expensive.
Improvement can be made here is: when writing data to file, a stream object that offers buffer I/O is preferred. It seems that we can do this at shell level so that individual command does not need to deal with this.
---------------------------------
See attached screenshot from VisualVM Profier for more info.