Skip to content

Improve error handling for blocksize / files_per_partition which can result in error: Invalid file path or buffer object type: <class 'tuple'> #1401

@shaltielshmid

Description

@shaltielshmid

Describe the bug

When using the JsonlReader, I needed to specify blocksize in order to not run out of RAM on the machine. When specifying a value in the parameter, there was an immediate crash with this error:

ValueError: Invalid file path or buffer object type: <class 'tuple'>

I placed a more detailed stack trace below.

Steps/Code to reproduce bug

Below is minimal code which creates a sample JSONL file with random data, and then tries to read it in with NeMo Curator in blocks.

from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader

path = "/mnt/ephemeral/data"

if __name__ == "__main__":
    
    # generate random data
    import json
    with open(path + '/sample_data.jsonl', 'w') as w:
        from faker import Faker
        f = Faker()
        for _ in range(100):
            w.write(json.dumps(dict(text=f.text())) + '\n')

    # Initialize and start the Ray client
    ray_client = RayClient()
    ray_client.start()

    # Create a pipeline with the stages
    pipeline = Pipeline(
        name="jsonl-test",
        description="jsonl-test",
        stages=[
            JsonlReader(
                file_paths=path,
                files_per_partition=1,
                fields=['text'],
                blocksize=1024
            )
        ]
    )

    results = pipeline.run()

    ray_client.stop()

Expected behavior

To load in the file in chunks, preserving RAM.

Additional context

Full stack trace:

  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 690, in _process_step
    result = self._process_data(task.data)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 846, in _process_data
    raise e
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 827, in _process_data
    result = retry.do_with_retries(func_to_call, max_attempts=self._params.num_run_retries)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/utils/retry.py", line 56, in do_with_retries
    return func()
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 824, in func_to_call
    return self._stage_interface.process_data(in_data)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/pipelines/private/specs.py", line 528, in process_data
    return self._stage.process_data(data)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/backends/xenna/adapter.py", line 69, in process_data
    return self.process_batch(tasks)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/backends/base.py", line 88, in process_batch
    results = self.stage.process_batch(tasks)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/stages/base.py", line 182, in process_batch
    result = self.process(task)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/stages/text/io/reader/base.py", line 81, in process
    result = self.read_data(task.data, effective_read_kwargs, self.fields)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/stages/text/io/reader/jsonl.py", line 71, in read_data
    df = pd.read_json(file_path, **read_kwargs)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/json/_json.py", line 791, in read_json
    json_reader = JsonReader(
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/json/_json.py", line 904, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/json/_json.py", line 944, in _get_data_from_filepath
    self.handles = get_handle(
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/common.py", line 728, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/common.py", line 472, in _get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'tuple'>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions