Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

path to local files in subdir #20

Open
EvaEibl opened this issue Nov 9, 2022 · 12 comments
Open

path to local files in subdir #20

EvaEibl opened this issue Nov 9, 2022 · 12 comments

Comments

@EvaEibl
Copy link

EvaEibl commented Nov 9, 2022

When providing the searchdir path to the top directory of my files, it does not find the files (which are three levels of subfolders down) but just provides a list of folder names in 'flist' that cannot be read in. I had to manually copy and rearrange my files in order to have them just one level below the top directory.

The description just says: 'If using local files, define the path to the top directory where they exist, ending in / or \ as appropriate for your operating system. If there are files in any subdirectories within this directory, they will be found.'

@ahotovec
Copy link
Owner

ahotovec commented Nov 9, 2022

Can you provide me with some additional details on what you're using for your search directory and file pattern? In trigger.py the piece of code that finds the files does a walk of the subfolders, but maybe there's an incompatibility with what you've told it to look for?

@EvaEibl
Copy link
Author

EvaEibl commented Nov 10, 2022

Thank you for your fast reply.
I've tried to reproduce the error. I previously used
server=file
searchdir=/path/to/files/MINISEED/2015/VI/
the mseed files were then in subfolders such as:
IEB/HHZ.D/VI.IEB..HHZ.D.2015.200
IEB/HHN.D/VI.IEB..HHN.D.2015.200
IEB/HHE.D/VI.IEB..HHE.D.2015.200
IEA/HHE.D/VI.IEA..HHE.D.2015.200
...
The flist now actually contains the mseed files. (so I must have done sth. different this time.) However, the code cannot read in this data and aborts with 'Could not download or trigger data... moving on'.

When I copy the same data into one folder. I.e. here the path would be:
server=file
searchdir=/path/to/files/MINISEED/2015/VI/folder/
the mseed files are directly in this folder:
VI.IEB..HHZ.D.2015.200
VI.IEB..HHN.D.2015.200
VI.IEB..HHE.D.2015.200
VI.IEA..HHE.D.2015.200
The mseed data can be read in.

@ahotovec
Copy link
Owner

Can you try the first case (mseed files within the subfolders) with the -t flag on backfill.py? Adding this flag removes the try/except surrounding the data reading step and will give us a more detailed failure message than just that it couldn't complete. Let me know what that error message contains and it'll help me track down where the problem is.

@EvaEibl
Copy link
Author

EvaEibl commented Nov 10, 2022

The error is:
Traceback (most recent call last):
File "backfill.py", line 106, in
st, stC = redpy.trigger.getData(tstart+n*opt.nsec-opt.atrig, endtime, opt)
File "/data/REDPy/redpy/trigger.py", line 55, in getData
stmp = obspy.read(f, headonly=True)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/obspy/core/util/decorator.py", line 291, in _map_example_filename
return func(*args, **kwargs)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/obspy/core/stream.py", line 208, in read
st = _generic_reader(pathname_or_url, _read, **kwargs)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/obspy/core/util/base.py", line 657, in _generic_reader
generic = callback_func(pathnames[0], **kwargs)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/obspy/core/util/decorator.py", line 148, in uncompress_file
if tarfile.is_tarfile(filename):
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/tarfile.py", line 2442, in is_tarfile
t = open(name)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/tarfile.py", line 1575, in open
return func(name, "r", fileobj, **kwargs)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/tarfile.py", line 1639, in gzopen
fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/gzip.py", line 168, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/path/to/files/MINISEED/2015/VI/IEA'
Closing remaining open files:redpytable.h5...done

@ahotovec
Copy link
Owner

Ok, seems that it's complaining that the first item in the list is a directory and it can't read it. In the .cfg file, let's try adding filepattern='*.D.*' as I believe all of the mseed files should contain that and none of the folders will...

@EvaEibl
Copy link
Author

EvaEibl commented Nov 10, 2022

Adding this, it just goes like:
2015-09-28T00:00:00.000000Z
Couldn't find JOA.HHZ.VI.
Couldn't find JOB.HHZ.VI.
Couldn't find JOD.HHZ.VI.
Couldn't find JOE.HHZ.VI.
Couldn't find JOF.HHZ.VI.
Couldn't find JOG.HHZ.VI.
Couldn't find JOK.HHZ.VI.
Couldn't find IEA.HHZ.VI.
Couldn't find IEB.HHZ.VI.
Couldn't find IED.HHZ.VI.
Couldn't find IEE.HHZ.VI.
Couldn't find IEF.HHZ.VI.
Couldn't find IEG.HHZ.VI.
Couldn't find IEY.HHZ.VI.
Length of Orphan table: 13
Time spent this iteration: 0.0069476922353108725 minutes

@ahotovec
Copy link
Owner

And I take it that putting these in the top directory does find the data correctly? I suppose we should also verify that flist does actually contain the filenames of all the data.

@EvaEibl
Copy link
Author

EvaEibl commented Nov 11, 2022

It finds the files if I remove the inverted commas.
filepattern=star.D.star
However, when I'm in the top directory I get some results after 1.5 minutes. When using the data in subfolders it seems to get stuck somewhere. Nothing has happend for 10 minutes now.

@ahotovec
Copy link
Owner

Ah, yes without the commas. When you moved your files to the top directory, did you move all of them? I'm wondering if there are a lot of files it's trying to read through. I'll readily admit that the way REDPy parses through files on disk is not very efficient.

A path we might consider going down instead is setting up a portable FDSN. It's got a bunch of setup associated with it but once it's going it'll probably be the fastest way to query your data, and might be useful outside of REDPy as well. If you'd like to try this, send me an email (ahotovec-ellis@usgs.gov) and I'll forward you some notes on installing and setting it up from one of my colleagues.

@EvaEibl
Copy link
Author

EvaEibl commented Nov 15, 2022

Ok. I see. Yes there are a lot of files in the original folders.
Since we have expertise using pyrocko in our group, I think it might be easier to use the pyrocko pile for the reading in (or just copy the event data I want to analyse for the moment), than to setup a portable FDSN for this dataset.

@ahotovec
Copy link
Owner

Ok, let me know how that goes. I don't usually work with lots of data in files on disk, and tend to favor waveservers and webservices. I've had folks that have their files in directories sorted by date use shell scripts to change the filepath based on what time they are processing to reduce the number of files that REDPy needs to search through. I have some other ideas on better ways to handle it but haven't had a chance to test/implement them.

@ahotovec
Copy link
Owner

Just putting a quick update here that I've been picking at this issue while "cleaning up" the code. In the branch "cleanup" there is new code that creates a file index of all the files in the data search directory that helps it know which of those files to read once, rather than redoing the query every time step. I've also added options to load a few days of that data into memory for faster access. I've tested it with both large mseed volumes (~1 GB each per channel, containing several months of data each) and ~35k individual sac files from that same time span. It probably isn't as optimized as using a local waveserver, but it's orders of magnitude more efficient now.

I'll probably close this issue when I pull 'cleanup' into the 'master' branch. I'd love it if you could test the new code on your dataset and let me know how it works, and what I can improve to align with your use case.

Were you able to get pyrocko to work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants