
It seems that some of the problems related to high number of files will automatically be taken care of - eg, you probably won't have to handle 5000 files because they'll be processed out when finished. I think it is important to manage the case where logstash dies or is killed - so I assume you do need a file/position status also - I think it's important that's not overlooked. You could also handle this by just changing the filename when finished -> file.json - that's basically instant and avoids the problems. Here the stat_interval might be helpful - you might hit the eof, but you would wait until the next stat interval (or two?) to make sure it didn't change. Here, the main problem (I see) is that if I want to be able to add complete files into directory it will take time before it's all copied and I might hit the eof but it's really still copying. Then will read until the eof and then do whatever the eof says to do. There will be other file globs with config that will specify an eof_action. These files won't have an eof_action and will work as now. Here's my thoughts.įor files that are being added to or rotated, things should work as they do now - as messy as that is. IMO, we would not want a synchronous eof_action that runs a script or operates on a file - we just need to signal to the user that we are done with a file then the user can action this out-of-band - presuming that we can know what 'done' for the explanation - I hear some of the challenges given a combined implementation.
-1.jpg)
We can operate over a smaller bunch of files, say 256, at a time and therefore support reading many many thousands of files in multiple readfile inputs with a low file_handle impact and lower memory usage. We know we only need to stat, open, looped read and close and when 'done' we can 'tidy up' - flush buffered multilines and 'signal' that we are done. We can have a scan_mode for depth or breath first. We have no need for the stat_interval or start_position. I suspect that some of the 'can't recreate' weirdness issues may be related to this.Ī specialised readfile input using a read optimised filewatch can do things much more efficiently. the sincedb can't be saved and filters/outputs that use files or sockets begin to fail. If the user has, say 20 file inputs operating on different folders, when all inputs are opening all of their files, one can run out of file_handles and then many things start going wrong, e.g. This is why we are introducing ignore_older and close_older configs and the auto_flush config on the multiline codec but actually they are hacks to help support the read use case. Further, because filewatch is tailing, we will monitor those 'done' files for changes every stat_interval and the 5000 files will stay open. It will be many minutes before we get to the 5000th file.

If we reuse the same filewatch lib, we will open each one in turn and begin reading to eof. In the read case, assume the same number 5000 of content complete files but they are > 800K big. In time, as content is appended we will loop through each file reading what we can and there is no logical eof. The current tail implementation will open all 5000, see that they are empty and loop, we do not open-read-close now. In the tail case, assume for a moment that there are 5000 files detected from the glob and they have all just been rotated and so are empty. Breadth first means reading 32K from each file in turn until 'done'. In the read case I would like to support zipped files and breadth or depth first operations.

Obviously this depends on how the user puts the files in the watched folder. When we open a file to read it, very occasionally if the timing is perfect we see an empty file - we get an eof then.

It is heavily optimised for the tail use case. I did start a branch with a mode config of tail, read, but I stopped work on this because most of the heavy lifting in this input is done in the filewatch library. What would your guess be to the percentages of use case for Tail only, Read only and mixed Tail some Read some? My guess is that it may be as much as 49.5, 49.5, 1.
