Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similar code to create groups #7

Open
inezpereira opened this issue May 19, 2020 · 9 comments
Open

Similar code to create groups #7

inezpereira opened this issue May 19, 2020 · 9 comments

Comments

@inezpereira
Copy link

Hi! This is more of a question/request.
Do you have code to mind-match participants into groups?
Meaning, your current output csv would be something like:

user_id,match_ids
1,93;217;463;645;783;1101
[...]
93, 1;217;463;645;783;1101 # has the same ids as before

@titipata
Copy link
Owner

Yes, I do have it! I will also put on the code. Mainly, we apply hierarchical clustering on top of distance matrix calculated from mind-matching part!

@inezpereira
Copy link
Author

Perfect! :D
Oh, I am really looking forward to checking it out, then!

@titipata
Copy link
Owner

@inezpereira I put the group matching code here. This is the version that I use to rearrange users into order (by distance) and group them. It still needs some tweaks and a lot of documentation tho!

@inezpereira
Copy link
Author

inezpereira commented May 20, 2020

Very cool, just got a chance to try it out.
A few questions:

  • cluster is an array with the corresponding cluster number for subject at index i, right?
  • Is the t argument under hcluster.fcluster() what I then need to tweak to get groups of size n? How can I do this in a principled way? Right now, with the current value for t=0.01, I get a cluster for each subject (on my data), which according to the documentation means t is too small.
  • In users_group_df the first column is called user_id, but actually it's cluster_id, right?

@titipata
Copy link
Owner

  • Yes, so I use optimal ordering (here) so that the order of the final dataframe should be good here. Basically, I just need to do group by 5-6 users for the final use (or hand to someone to tweak it a bit).
  • Yes, t=0.01 is the threshold for clustering here. However, it tends to cluster everything into one big group, have to be careful when we use that. I tend to set it up low first and tune it up a little to see what happens.
  • users_group_df is basically an ordered of users_df arranged in ordered which you can assign the group later based on this ordered output. user_id is still ID of all the user here.

@inezpereira
Copy link
Author

inezpereira commented May 20, 2020

  1. @titipata ok, so are you saying that I can just take consecutive chunks of the ordered users_group_df and define those as groups?
    For example, if I want groups of 6:
    users_group_df.iloc[0:6,] # group 1
    users_group_df.iloc[6:12,:] # group 2
    etc.
    Do I understand you correctly?

  2. Also: I thought that, with lower values of t, each individual would be a cluster, so you wouldn't have one big cluster but as many clusters as you have subjects. At least that is what the documentation indicates. So I thought that you were doing the following: (1) get the optimal ordering using linkage() and then, (2) when using fcluster(), just set t to a very small value to not group people by clusters but instead give us back an ordered list of the user_id's. Did I get this at least partially right?

Thanks again for sharing the code and for being so responsive in dealing with issues and questions!

@titipata
Copy link
Owner

Yes for both answers! I'll elaborate more clearly quite soon!

@inezpereira
Copy link
Author

inezpereira commented May 20, 2020

Ok cool! I'll add a third question to the lot:
3. Right now, my dataset does not include abstracts per se, just the fields of interest in the abstracts column of the input csv file. And a lot of subjects have exactly the same word or pair of words (meaning the strings are exactly the same) in the abstracts column.
However, when I look at the users_group_df, not all subjects saying they are interested in the exact same field get grouped together. I see you add some randomness to the process. However, removing A_rand from the linkage() call (and thus only looking at A_cluster, instead of A_cluster + A_rand) doesn't solve the issue.

Again, let me illustrate. Without randomness, I would expect users_group_df to look like:
251, John Doe, field A
1, Jane J. Doe, field A
56, James Doe, field A
72, Richard Anonymous, field B
4, Joan Doe, field B
3, Et Al, field C

Where the subjects are neatly grouped by field of interest.

Instead, I get something like:
1, Jane J. Doe, field A
56, James Doe, field A
72, Richard Anonymous, field B
3, Et Al, field C
4, Joan Doe, field B
251, John Doe, field A

Do you know why this persists even after removing randomness (A_rand) from the process?

@titipata
Copy link
Owner

@inezpereira Maybe the random distance is too small? You might have to try adding random +1,-1 matrix instead depending on your matrix. Since in your case since you have a count of terms overlap. In the example, it's a cosine distance, and adding small +- 0.01 randomness can tweak the result a bit.

I haven't tried experimenting with adding multiple randomnesses but will try to follow up on this later!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants