Show HN: Samchika – A Java Library for Fast, Multithreaded File Processing

(github.com)

Comments

sieve 11 minutes ago
A note on the name.

The nasal "m" takes on the form of the nasal in the row/class of the letter that follows it. As "ñ" is the nasal of the "c" class, the "m" becomes "ñ"

Writing Sanskrit terms using the roman script without using something like IAST/ISO-15919 is a pain in the neck. They are going to be mispronounced one way or the other. I try to get the ISO-15919 form and strip away everything that is not a-z.

So, सञ्चिका (sañcikā) = sancika

You probably want to keep the "ch," as the average English speaker is not going to remember that the "c" is the "ch" of "cheese" and not "see."

Calzifer 2 hours ago

        for(int i=0;i<10000; ++i){

            // do nothing just compute hash again and again.
            hash = str.hashCode();
        }
https://github.com/MayankPratap/Samchika/blob/ebf45acad1963d...

"do nothing" is correct, "again and again" not so much. Java caches the hash code for Strings and since the JIT knows that (at least in recent version[1]) it might even remove this loop entirely.

[1] https://news.ycombinator.com/item?id=43854337

sureglymop 3 hours ago
Perhaps I misunderstand something but doesn't reading from a file require a system call? And when there is a system call, the context switches? So wouldn't using multiple threads to read from a file mean that they can't really read in parallel anyway because they block each other when executing that system call?
sidcool 3 hours ago
It would be even more amazing if it had tests. It's already pretty good.
SillyUsername 1 hour ago
An ArrayList for huge numbers of add operations is not performant. LinkedList will see your list throughput performance at least double. There are other optimisations you can do but in a brief perusal this stood out like a sore thumb.
codetiger 3 hours ago
Do you have a benchmark comparison with other similar tools?
ldjkfkdsjnv 1 hour ago
I could write this library with an llm in a few hours
VWWHFSfQ 3 hours ago
Am I wrong in thinking that this is duplicating lines in memory repeatedly when buffering lines into batches, and then submitting batches to threads? And then again when calling the line processor? Seems like it might be a memory hog
gavinray 3 hours ago
Please don't do this.

Have the OS handle memory paging and buffering for you and then use Java's parallel algorithms to do concurrent processing.

Create a "MappedByteBuffer" and mmap the file into memory.

If the file is too large, use an "AsynchronousFileChannel" and asynchronously read + process segments of the buffer.