Adding new metadata to be extracted
The current process of adding new metadata to be extracted involves a lot of code changes. Currently, if a new piece of metadata is to be extracted, the process is as follows:
- Go to metadata/metadata.py and edit the InvenioMetadata class to accommodate a method to add the metadata, a way to store the metadata, add it to the JSON payload, and from_json and to_json methods.
- Go to metadata/extractor.py and add to the auto-extractor a way to extract the metadata.
- In manager.py, when the auto-extractor is called, the results have to be added into the InvenioMetadata object.
This is obviously a very long process. An easier way to add metadata needs to be implemented. One suggestion is to have a config file for what metadata should be specified and how they should be stored in the payload. This would apply specifically to metadata that requires no further processing to acquire results. The solution is as follows:
- A configuration file in YAML, JSON or some other simple markup language. Should be able to specify the metadata to be extracted, which file to extract it from, and where in the file to extract it from (some path to the attribute or parameter in the file)
- Some code to parse this config file, creating a list of metadata targets to be auto-extracted
- extractor.py should then be able to use the information in the config file to automatically extract the metadata
- manager.py should be able to call the auto-extractor as per usual, returning the desired metadata
- InvenioMetadata should have a single function to ingest this information, and should then shape the payload according to metadata is extracted and what the configuration file says
This is a rough guide and exact implementation details may vary. New metadata can just be added via config file by specifying the new ones, rather than having to dive into the codebase.