-
Notifications
You must be signed in to change notification settings - Fork 213
Open
Labels
bugSomething isn't workingSomething isn't workingcommunity-requestgood first issueGood for newcomersGood for newcomers
Description
Describe the bug
When using the JsonlReader, I needed to specify blocksize in order to not run out of RAM on the machine. When specifying a value in the parameter, there was an immediate crash with this error:
ValueError: Invalid file path or buffer object type: <class 'tuple'>
I placed a more detailed stack trace below.
Steps/Code to reproduce bug
Below is minimal code which creates a sample JSONL file with random data, and then tries to read it in with NeMo Curator in blocks.
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
path = "/mnt/ephemeral/data"
if __name__ == "__main__":
# generate random data
import json
with open(path + '/sample_data.jsonl', 'w') as w:
from faker import Faker
f = Faker()
for _ in range(100):
w.write(json.dumps(dict(text=f.text())) + '\n')
# Initialize and start the Ray client
ray_client = RayClient()
ray_client.start()
# Create a pipeline with the stages
pipeline = Pipeline(
name="jsonl-test",
description="jsonl-test",
stages=[
JsonlReader(
file_paths=path,
files_per_partition=1,
fields=['text'],
blocksize=1024
)
]
)
results = pipeline.run()
ray_client.stop()Expected behavior
To load in the file in chunks, preserving RAM.
Additional context
Full stack trace:
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 690, in _process_step
result = self._process_data(task.data)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 846, in _process_data
raise e
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 827, in _process_data
result = retry.do_with_retries(func_to_call, max_attempts=self._params.num_run_retries)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/utils/retry.py", line 56, in do_with_retries
return func()
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 824, in func_to_call
return self._stage_interface.process_data(in_data)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/cosmos_xenna/pipelines/private/specs.py", line 528, in process_data
return self._stage.process_data(data)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/backends/xenna/adapter.py", line 69, in process_data
return self.process_batch(tasks)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/backends/base.py", line 88, in process_batch
results = self.stage.process_batch(tasks)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/stages/base.py", line 182, in process_batch
result = self.process(task)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/stages/text/io/reader/base.py", line 81, in process
result = self.read_data(task.data, effective_read_kwargs, self.fields)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/nemo_curator/stages/text/io/reader/jsonl.py", line 71, in read_data
df = pd.read_json(file_path, **read_kwargs)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/json/_json.py", line 791, in read_json
json_reader = JsonReader(
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/json/_json.py", line 904, in __init__
data = self._get_data_from_filepath(filepath_or_buffer)
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/json/_json.py", line 944, in _get_data_from_filepath
self.handles = get_handle(
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/common.py", line 728, in get_handle
ioargs = _get_filepath_or_buffer(
File "/home/ubuntu/test/.venv/lib/python3.10/site-packages/pandas/io/common.py", line 472, in _get_filepath_or_buffer
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'tuple'>Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcommunity-requestgood first issueGood for newcomersGood for newcomers