This issue is to follow up the test case below mentioned in issue #7803:
:$ cat enwiki-20121001-pages-articles-multistream-index.txt2 | grep oracle
// 1m 26 seconds in FluidShell.
// 2 second in Ubuntu/Bash
I replaced the test case above as
:$ cat enwiki-20121001-pages-articles-multistream-index.txt2 | grep pattern-match-nothing
so that nothing will be written to FS terminal.
I tried above test case on my desktop (file size is 567,746,536 bytes) and measured CPU statistics using Java VisualVM; result shows I/O on pipeline consumes half of the test time; why? we need to look into this.
Profiling result shows 1/3 of the overall time is consumed by \grep on read, \grep uses non-buffered I/O to read input from standard input which can be imporved with buffered I/O if \grep is executed inside a pipeline.
|
97 KB
> Profiling result shows 1/3 of the overall time is consumed by \grep on read, \grep uses non-buffered I/O to read input from standard input which can be improved with buffered I/O if \grep is executed inside a pipeline.
This is logged as issue #7876.
The slowness of pipeline is mainly because the way how command reads input from the standard input (logged as issue #7876). JDK PipedInputStream creates a 1K buffer by default, FS now uses a 4K buffer by default which seems to help in some cases. SVN trunk/r30034.
> cat enwiki-20121001-pages-articles-multistream-index.txt | grep oracle
This command used to take ~1 minute 27 seconds to complete on my desktop, the execution time goes down to ~25 seconds after r30034 is applied.
Here is a test case for showing how PipedInputStream's buffer size might affect command's execution time:
> cat enwiki-20121001-pages-articles-multistream-index.txt | cat > tmp.txt
Buffer size: 1K - execution time: ~14 seconds.
Buffer size: 4K - execution time: ~6 seconds.
Buffer size: 8K - execution time: ~5 seconds.
FS currently creates a 4K buffer.
The slowness of pipeline is mainly because the way how command reads input from the standard input (logged as issue #7876). JDK PipedInputStream creates a 1K buffer by default, FS now uses a 4K buffer by default which seems to help in some cases. SVN trunk/r30034.
> cat enwiki-20121001-pages-articles-multistream-index.txt | grep oracle
This command used to take ~1 minute 27 seconds to complete on my desktop, the execution time goes down to ~25 seconds after r30034 is applied.
Here is a test case for showing how PipedInputStream's buffer size might affect command's execution time:
> cat enwiki-20121001-pages-articles-multistream-index.txt | cat > tmp.txt
Buffer size: 1K - execution time: ~14 seconds.
Buffer size: 4K - execution time: ~6 seconds.
Buffer size: 8K - execution time: ~5 seconds.
FS currently creates a 4K buffer.
Issue #7875 |
Closed |
Fixed |
Resolved |
Completion |
No due date |
Fixed Build trunk/30034 |
No time estimate |
2 issue links |
relates to #9070
Issue #9070Performance regression on FluidShell pipeline |
relates to #7876
Issue #7876Performance problem on command reading standard input using non-buffered I/O |
> Profiling result shows 1/3 of the overall time is consumed by \grep on read, \grep uses non-buffered I/O to read input from standard input which can be improved with buffered I/O if \grep is executed inside a pipeline.
This is logged as issue #7876.