data instance for frame cloze problem by nikett · Pull Request #383 · allenai/deep_qa

nikett · 2017-06-02T18:34:26Z

This is similar to tuple instance.

frame cloze pylint formatting

DeNeutoy · 2017-06-02T18:53:17Z

deep_qa/data/instances/text_classification/frame_instance.py

+from ...data_indexer import DataIndexer
+
+
+# Functionality supported in this class.


Remove these commented out sections, as it's duplicated below in read_from_line.

DeNeutoy · 2017-06-02T18:54:06Z

deep_qa/data/instances/text_classification/frame_instance.py

+        return 'FrameInstance( [' + ',\n'.join(self.text) + '] , ' + str(self.label) + ')'
+
+    @overrides
+    # accumulate words from each slot's phrase


Can you put the comment inside the method, and the same for the other methods in this class?

Done. I did it the java style, incorrectly.

DeNeutoy · 2017-06-02T18:55:20Z

deep_qa/data/instances/text_classification/frame_instance.py

+
+    @staticmethod
+    # s: "event:plant absorb water###participant:water" TAB "participant:water"
+    def unpack_input(s: str, kv_separator: str="\t"):


You only use this classmethod once in read_from_line. Just have it inline there, rather than splitting it out.

Data processing in between the logic renders it unreadable. Small functions make the functionality easier to read, and debug.

DeNeutoy · 2017-06-02T18:59:13Z

deep_qa/data/instances/text_classification/frame_instance.py

+    A FrameInstance is a kind of TextInstance that has text in multiple slots. This generalizes a FrameInstance.
+    """
+    def __init__(self,
+                 # the input can have only few slots and not all and in any random order.


Move this comment either above __init__ or below, rather than in the middle of the argument names.

DeNeutoy · 2017-06-02T19:28:47Z

tests/data/instances/text_classification/frame_instance_test.py

+
+    def test_words_from_frame_aggregated_correctly(self):
+        instance = FrameInstance.read_from_line(line)
+        assert instance.words()['words'].__len__() == 30


In python you should use the built-in functions rather than these "magic" (__len__, __str__, etc) methods. this should be len(instance.words()["words]) instead, and the same for str below.

DeNeutoy · 2017-06-02T19:48:08Z

deep_qa/data/instances/text_classification/frame_instance.py

+
+    @overrides
+    def to_indexed_instance(self, data_indexer: DataIndexer):
+        # a phrase in a slot, is converted from [list of words] to [list of wordids]


Add capital letters and full stops to comments and remove the [] around words.

DeNeutoy · 2017-06-02T19:52:30Z

deep_qa/data/instances/text_classification/frame_instance.py

+        return IndexedFrameInstance([], label=None)
+
+    @overrides
+    def get_padding_lengths(self) -> Dict[str, int]:


Take a look at how get_padding_lengths is implemented here - this similar to your case, where you have a set of slots which need padding. In this method, we only care about the max length of a slot.
https://github.com/allenai/deep_qa/blob/master/deep_qa/data/instances/text_classification/tuple_instance.py#L94

Then, you can see here that the pad method expects to get back this dictionary - this answers the question in your comment above.
https://github.com/allenai/deep_qa/blob/master/deep_qa/data/instances/text_classification/tuple_instance.py#L100

Implementing these two methods like this will allow you to remove the MAX_SLOT_LEN global variable and have the padding for the slots set automatically from your training data.

DeNeutoy · 2017-06-02T19:55:01Z

deep_qa/data/instances/text_classification/frame_instance.py

+
+# TODO PR request for having these in the json as an application specific content
+# the slotnames can vary according to different end applications, e.g., a HowTo tuple, OpenIE tuple ...
+SLOTNAMES_ORDERED = ["agent", "beneficiary", "causer", "context", "definition", "event",


If you want to have these global variables, you could consider moving them inside FrameInstance but above init. This would then make them class attributes, which would be accessible in your classmethods.

DeNeutoy · 2017-06-02T19:56:49Z

tests/data/instances/text_classification/frame_instance_test.py

+    def test_slots_unwrap_correctly(self):
+        instance = FrameInstance.read_from_line(line)
+        # what we get
+        machine_label = "" + instance.label


Why are you adding the empty string here?

DeNeutoy · 2017-06-02T20:05:59Z

deep_qa/data/instances/text_classification/frame_instance.py

+
+    @staticmethod
+    def given_slots_from(slots_csv: str, values_separator: str="###", kv_separator: str=":"):
+        return dict(map(lambda x: x.split(kv_separator), slots_csv.split(values_separator)))


Could you add a comment describing this function? It's not clear immediately what it does. You could consider re-writing this like:

key_value_list = slots_csv.split(values_separator) key_value_tuples = [pair.split(key_value_separator) for x in key_value_list] key_value_dict = {key: value for key, value in key_value_tuples}

DeNeutoy · 2017-06-03T00:45:50Z

tests/data/instances/text_classification/frame_instance_test.py

+        for word in ['plant', 'unk', 'absorb', 'ques', 'water', 'soil']:
+            self.data_indexer.add_word_to_index(word)
+
+    def tearDown(self):


This is not required as you do nothing but call the super class.

DeNeutoy · 2017-06-03T00:46:09Z

tests/data/instances/text_classification/frame_instance_test.py

+    def test_convert_instance_to_indexed_instance(self):
+        instance = FrameInstance.read_from_line(self.line)
+        indexed_instance = instance.to_indexed_instance(self.data_indexer)
+        print(instance.label + "\t padded to: " + indexed_instance.label.__str__())


Can you remove these print statements?

DeNeutoy · 2017-06-03T00:46:57Z

tests/data/instances/text_classification/frame_instance_test.py

-        machine_slotvals = ""+instance.text.__str__()
+        # TODO: consider when label is OOV or finalloc instead of finalloc:soil
+        machine_label = instance.label
+        machine_slotvals = instance.text


No need to shorten variable names here - just use machine_slot_values.

DeNeutoy · 2017-06-03T00:48:50Z

tests/data/instances/text_classification/frame_instance_test.py

+        indexed_instance = instance.to_indexed_instance(self.data_indexer)
+        print(instance.label + "\t padded to: " + indexed_instance.label.__str__())
+        print(indexed_instance.word_indices.__str__())
+        assert indexed_instance.label == [self.data_indexer.get_word_index('soil')]  # TODO


remove this TODO.

matt-gardner

There's one big question I had around whether you're actually encoding enough information in the inputs to realistically expect the model to get the output correct. Other than that, I think this will work as it is, and while I would recommend changing a few things to fit the API better, if you want to do it like this, it should still work.

Also, just a heads up, we're probably going to be splitting a lot of things out of this repo soon, so the main repo is more focused and doesn't have so many models in it. We'll probably make a separate Aristo repo for DeepQA models, and move this code into it, and then you won't be subject to the whims of our coding style =). I'm still very happy to review PRs and give advice on how to best use the library, it'll just be in a separate repo.

matt-gardner · 2017-06-05T19:48:51Z

deep_qa/data/instances/text_classification/frame_instance.py

+    A FrameInstance is a kind of TextInstance that has text in multiple slots. This generalizes a FrameInstance.
+    """
+    def __init__(self,
+                 padded_slots: List[str],


There is no padding done here, so I wouldn't put padded in this name.

Oh, looks like I'm wrong. This would be more clear if there were some comment in the docstring about what this represents. Also, this will run into some naming collision with padding as done in the indexed instances.

Fixed to sparse and dense frame, instead of unpadded and padded slots.

matt-gardner · 2017-06-05T20:06:22Z

tests/data/instances/text_classification/frame_instance_test.py

+        expected_label = "soil"
+        # do they match?
+        assert machine_label == expected_label
+        assert machine_slot_values == self.padded_slots


The label here is what you're trying to predict. You want to have the model output "soil". But it looks like there's nothing in this instance that tells the model what slot you're trying to query. How would a model know which slot to fill, when it's trying to predict "soil"?

Very good point. I was hoping to include a one-hot vector depicting the slot to fill. I thought that by specifying a special tag "ques", I could inform the model eventually. You are probably right, that in the data instance, my input could explicitly include a one-hot vector. I don't know how to do that so I would ask Mark.

matt-gardner · 2017-06-05T20:11:48Z

deep_qa/data/instances/text_classification/frame_instance.py

+    def get_padding_lengths(self) -> Dict[str, int]:
+        # The parent class somehow understands the semantics of these dictionary keys.
+        # would be better if these keys were more explicit, e.g., as enumerations.
+        return {'num_sentence_words': MAX_PHRASE_LEN}


Yeah, there's no need to hard-code a length here, and that parameter is better specified in the model you build, anyway. This will actually break the code we have that tries to be smart about letting you swap out word representations. It looks like the tuple instance code that Mark linked to also will break the word representation stuff that we have. Instead, what you should do is something like this.

@DeNeutoy, I think we need way better documentation around how the whole data pipeline works. I think it does some really nice things, but people have a hard time understanding it and how to use it correctly.

I now understand. So something like this?

for key in slot_word_lengths: lengths[key] = len(key) return lengths```

Yeah, similar. I would do something like this:

all_slot_lengths = [self._get_word_sequence_lengths(slot_indices) for slot_indices in self.word_indices] lengths = {} for key in slot_lengths[0]: lengths[key] = max(slot_lengths[key] for slot_lengths in all_slot_lengths) return lengths

matt-gardner · 2017-06-05T20:18:30Z

deep_qa/data/instances/text_classification/frame_instance.py

+
+# TODO PR request for having these in the json as an application specific content
+# the slotnames can vary according to different end applications, e.g., a HowTo tuple, OpenIE tuple ...
+SLOTNAMES_ORDERED = ["agent", "beneficiary", "causer", "context", "definition", "event",


If you want to always have a fixed set of slots that every dataset uses, where your input includes a one-hot vector for which slot you're trying to fill, or something like that, this should work. The model's weights will be tied to particular slots, so a model trained on one slot configuration will not possibly generalize to any other slot configuration.

If you want to make this more modular, you could consider using a separate namespace in the data indexer to keep track of this for you, instead of hard-coding it here. Something similar is done with tag labels here.

The reason to have this configurable was to be able to train any variant of the input with the same code, by passing a list with the config file upfront. This configuration allows ordering the input on the fly for any types of slots.

It appears that there is no possibility to specify these list of slotnames in the config file, so, perhaps having this as a constant may be better than passing through a function. The data indexer does not really do any book keeping because the frame_instance takes care of it.

Am I thinking incorrectly?

Oh, yeah, your preprocess scripts putting everything into this format will take care of that, I think.

Another consideration is that you'll be essentially wasting computation and modeling capacity on all of the un-filled slots by filling them with "unk". You could either: (1) leave them empty, and let them be masked in whatever model computation you do later. This will keep the model from wasting its capacity trying to figure out that "unk" means "unfilled". Or (2) instead of having a fixed-sized input here, you could have a key-value input, where you give a list of (slot, value) pairs as input to the model. In this setting, it's also a lot easier to think about how to pass in the query - it's just another (slot, value) pair, just with a "QUERY" value, or something similar. Then instead of learning what is essentially an embedding of each slot type in the weights of a softmax layer, you can explicitly learn an embedding for them, and have that be part of your model.

Happy to talk about this more in person if it's hard to follow what I'm suggesting here.

…y by DeepQA

nikett added 2 commits June 2, 2017 11:32

data instance for frame cloze problem

67e00b1

Merge branch 'master' into master

90a7086

nikett assigned DeNeutoy Jun 2, 2017

nikett added 2 commits June 2, 2017 12:09

pylint formatting

e420699

Merge branch 'master' of https://github.com/nikett/deep_qa

90afb77

frame cloze pylint formatting

DeNeutoy suggested changes Jun 2, 2017

View reviewed changes

PR comments addressed

b23d28d

DeNeutoy suggested changes Jun 3, 2017

View reviewed changes

DeNeutoy requested a review from matt-gardner June 3, 2017 00:49

PR comments addressed

42bedfc

matt-gardner reviewed Jun 5, 2017

View reviewed changes

nikett added 5 commits June 7, 2017 14:04

PR comments addressed, fixed max padding length to be handled directl…

b2b5562

…y by DeepQA

untested code

7a04a52

frame cloze model is implemented (untested on toy data)

7bfddaa

frame cloze model implemented and tested

566a2bf

frame cloze model implemented and tested

cd2a062

		from ...data_indexer import DataIndexer


		# Functionality supported in this class.

Conversation

nikett commented Jun 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matt-gardner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikett Jun 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matt-gardner Jun 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

nikett Jun 7, 2017 •

edited

Loading

matt-gardner Jun 7, 2017 •

edited

Loading